WO2009150591A1

WO2009150591A1 - Method and device for the generation of a topic-specific vocabulary and computer program product

Info

Publication number: WO2009150591A1
Application number: PCT/IB2009/052386
Authority: WO
Inventors: Zsolt Saffer
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2008-06-11
Filing date: 2009-06-05
Publication date: 2009-12-17

Abstract

For the computer-aided generation of a topic-specific vocabularyfrom public text corpora, an automatic selection (4, 6) of a language-and topic-specific text, and an automatic generation (10) of vocabulary entries each comprising a word together with a phonetic transcription on the basis of the selected text is carried out, wherein the automatic generation of the vocabulary entries comprises a grapheme structure-based classification (16) of the vocabulary entries, to classifythe vocabulary entries according to a number of predetermined types, and a vocabulary entry type-specific grapheme-to-phoneme conversion (21), to obtain phonetic transcriptions for words.

Description

Method and device for the generation of a topic-specific vocabulary and computer program product

FIELD OF INVENTION

The present invention relates to a method for the computer-aided generation of a topic-specific vocabulary from public text corpora. Such a vocabulary may be used in speech recognition, in speech synthesis systems, and in automatic processing of audio-visual lectures for information retrieval.

The invention further relates to a device for the generation of a topic-specific vocabulary from public text corpora.

The invention also relates to a computer program product that is arranged to perform the present method when being run on a computer.

BACKGROUND OF INVENTION

Speech recognizer and speech synthesis systems apply vocabularies containing words and their pronunciation forms. Both the creation of pronunciation forms and the resulted phoneme sequences are called phonetic transcription. A word together with its phonetic transcription forms a vocabulary entry. Creating phonetic transcriptions of all words of a text and assigning them to the corresponding words is called phonetic labeling. Once a large collection of text - also called text corpora - is phonetic-labeled, the lexicon generation simply reduces to collect the vocabulary entries from the corpora and adding them to the lexicon. In current speech processing systems the vocabulary generation is only partly automatized. Most prior art vocabulary generating systems work in a semi-automatic way. The mainly used automatic phonetic transcription methods include: look-up into a background lexicon, and rule-based phonetic transcription or statistical phonetic transcription. After carrying out these steps, the automatically generated phoneme sequences are partly verified, and language specialists can add then further pronunciation variants to them. Usually, the grapheme structure of the actual language dominates the automatic phonetic transcription process, i.e. entries from other languages are handled only exceptionally. As a consequence, the prior art vocabulary generating systems can produce automatically vocabulary entries only for standard words, fitting to the grapheme structure of the actual language.

An increasing number of special applications require setting up topic-specific vocabularies. Such kinds of applications are e.g. domain-specific speech recognition systems or speech synthesis systems. The generation of such lexicons requires topic-specific phonetic-labeled text corpora. Manual labeling is time-consuming and expensive; hence there is a need for automatic vocabulary generation.

The prior art vocabulary generation from specified text corpora is not apt to automatically produce vocabulary entries for special items having grapheme structure differing from the actually used language. Such special items are words with foreign origin (foreign language), abbreviations, e-mail addresses, http addresses or the like. Additionally - due to the lack of this capability - vocabulary generation in automatic form from any topic- specific text corpora, which are public accessible, is not possible.

SUMMARY OF INVENTION

It is an object of the invention to overcome the abovementioned problems, and to provide a method and a device for automatically generating vocabulary entry for every item of a specified text corpora, including also special ones. Furthermore, it is an object of the invention to automatically identify topic-specific electronic text corpora from any public accessible list of source texts. Moreover, it is an object of the invention to improve the automatization of topic-specific vocabulary generation.

According to a first aspect of the present invention, there is provided a method for the computer-aided generation of a topic-specific vocabulary from public text corpora, said method comprising the steps of:

- an automatic selection of a language- and topic-specific text,

- an automatic generation of vocabulary entries each comprising a word together with a phonetic transcription on the basis of the selected text,

- wherein the step of automatic generation of the vocabulary entries comprises: - - a grapheme structure-based classification of the vocabulary entries, to classify the vocabulary entries according to a number of predetermined types, and

- - a type-specific grapheme-to-phoneme conversion, to obtain phonetic transcriptions for words. According to a second aspect, the invention provides a device for the generation of a topic-specific vocabulary from public text corpora and comprising

- selection means for automatically selecting language- and topic-specific texts, and - vocabulary entry generation means for automatically generating the vocabulary entries on the basis of the selected texts, the vocabulary entry generation means including

- - grapheme structure-based classifier means for automatically classifying the vocabulary entries according to a number of predetermined types, and - - phonetic transcriber means connected to the classifier means, for automatically carrying out a vocabulary entry type-specific grapheme-to-phoneme conversion, to obtain phonetic transcriptions of words.

Furthermore, the invention provides a computer program product that can be loaded directly into a memory of a computer and comprises sections of a software code, for performing the method according to the invention when the computer program is run on the computer.

According to the invention, it is possible to automatically generate topic- specific vocabulary from any public accessible text corpora. The input can be a specified list of source texts, e.g. the result of a browser search on WWW, i.e. a list of texts accessible via hyperlinks.

During the automatic vocabulary generation, vocabulary entries are generated for every item of the text collection and added to a vocabulary. A vocabulary entry is composed of an item (typically a word) and its corresponding phonetic transcription(s). The words are described on the grapheme level as character sequences, as is usual per se, and similarly, the phonetic transcriptions are described on the phonetic level as phoneme sequences, as is usual, too.

An essential feature of the invention is that not only the relevant language- specific text collection but also, in combination therewith, a topic-specific text collection is automatically selected from the specified list of source texts. The present technique is generic, i.e. it is applicable to the generation of any kind of vocabulary entries. The special importance is that it enables also the generation of special vocabulary entries having a grapheme structure which is different from the language morphology. Such entries are e.g. words with foreign origin, abbreviations, e-mail addresses or http addresses. It has proven particularly advantageous to classify the vocabulary entries according to four types, namely vocabulary entries fitting to a language grapheme structure; vocabulary entries fitting to the grapheme structure of at least one other supported language; vocabulary entries which do not fit to any language grapheme structure, as e.g. abbreviations; and specific styles, like e-mail addresses or web addresses. Accordingly, there are four types of vocabulary entries which in short can also be named as normal type vocabulary entries, supported foreign language type vocabulary entries, abbreviation type vocabulary entries and specific styles vocabulary entries.

Here, it is further preferred that the specific styles are selected in a filtering operation applying pattern matching to find predefined specific styles. Furthermore, a preferred embodiment is characterised in that after the specific styles filtering, the other types of vocabulary entries are distinguished by grapheme structure-based classification. Such a grapheme structure-based classification is known per se.

In particular, it is suitable to base the classification of the vocabulary entries on the use of probabilistic framework using n-grams based statistic method. Here, it may be referred for further information to the document F. Jelinek, Statistical methods for speech recognition, Language, Speech and Communication Series, MIT Press, 1997.

According to a further preferred embodiment of the present invention, the classification of the vocabulary entries applies a language identification system based on the application of a neural network. Here, it may be referred to WO 2004/038606 A where the use of neural network application has already been described in principle on the present field of technique. Preferably, a statistic type-specific grapheme-to-phoneme conversion is carried out to obtain phonetic transcriptions.

With respect to grapheme-to-phoneme conversion, it may further be referred to M. Bisani and H. Ney, investigations on joint-multigram models for grapheme-to- phoneme conversion"; Proceedings of ICSLP 2002, Denver.

Moreover, with respect to the topic selection, it is referred to R. Schwartz, T. Imai, F. Kubala, L. Nguyen and J. Makhoul,"A maximum likelihood model for topic classification of broadcast news"; Proceedings of the European Conference On Speech Communication and Technology, Rhodes, Greece, 1997; as well as to M. Hwang, H.Kong, S. Baek and P. Kim,"TSM. Topic Selection Method of Web Documents", In Proceedings of the First Asis international Conference on Modelling & Simulation (AMS'07), 2007.

As far as the identification of languages, as for instance German, English, French and so on, is concerned, reference is made to K. R. Beesley, Language identifier: A computer program for automatic natural- language identification on on-line text; Proceedings of the 29th Annual Conference of the American Translators Association, pages 47—54, 1988. A further preferred embodiment of the present invention is characterised in that after the selection of language- and topic-specific text, the text is pre-processed to prepare it for the later phonetic transcription, to eliminate parts, like annotations, not to be transcribed, and to format special words, like numbers, by grammar parsing. Such text pre-processing is useful to prepare the text for the phonetic transcription.

For an efficient composing of the vocabulary entries, that is a combination of the respective words and phonetic transcriptions, it is advantageous to provide for such composing dependent on whether there is a specific style type word or not, and in the case of a specific style, respective defined stored phonetic forms are associated with corresponding sign parts of specific style strings. Such predefined phonetic forms may be stored in a dedicated storage or memory section.

The such generated vocabulary entries are added to an already prepared vocabulary by means of a lexicon adaptor.

Correspondingly, with respect to the device according to the invention, there is provided a preferred embodiment according to which classifier means comprise a specific styles filter, for filtering specific styles from the texts by applying pattern matching to find predefined specific styles, like e-mail addresses, as well as a classifier for distinguishing other types of vocabulary entries by grapheme structure-based classification. Here, it may further be provided that the phonetic transcriber means are connected to the classifier for generating the phonetic transcription of words according to the types of vocabulary entries as determined by the classifier, whereas predefined stored phonetic forms are associated withspecific style type words. To provide for particularly efficient phonetic transcription, it is advantageous if text pre-processor means are connected to the selection means, for pre-processing the texts to eliminate parts not to be transcribed, like annotations, and to formal special words, like numbers, by grammar parsing.

Then, the present device may comprise a lexicon adaptor which is connected to the vocabulary entry generation means for adding the vocabulary entries to a vocabulary. Thus, the basic idea of the invention is the automatic generation of topic-specific vocabulary, i.e. automatic generation of any kind vocabulary entry for every item of a text collection including special items. Here, the emphasis is on special vocabulary entries, like words with foreign origin, abbreviations, e-mail addresses or http addresses. A further aspect is the combined automatic identification of topic-specific electronic form text corpora from any public accessible location.

According to a particularly preferred embodiment of the invention, first the list of relevant texts is filtered out from a specified list of source texts by applying language and topic identification techniques known per se. Then, the resulting text collection is pre- processed mainly to eliminate some items being irrelevant for further processing. For each item of the text collection, grapheme structure-based automatic classification follows to decide the type of the vocabulary entry corresponding to the actual item. Next, a vocabulary entry type-specific statistical grapheme-to-phoneme conversion is carried out to get the pronunciations of each item of the text collection. Then, vocabulary entry composition follows from the items and their corresponding pronunciations. Finally, all the vocabulary entries generated from the text collection are added to the topic-specific vocabulary.

Here, the automatic grapheme structure-based classification of the vocabulary entry types and the application of the automatic language and topic detection for vocabulary generation purpose is of particular advantage, or importance, respectively. The classification may be based either on grapheme- level (textual) character n-grams, or on neural networks, but any other form of realization of grapheme structure based classification is also possible.

The language identification is applied for each text of the specified list of source texts and the texts matching to the required language are selected. Thus, the output of this step is a list of language-specific texts.

The topic identification is applied for each text of the list of language-specific texts, and those texts are filtered out which match to the required topic. The result thereof is the topic-specific text collection.

For performing language identification, as well as for performing topic identification, standard methods may be used, as referred to above.

The vocabulary entry generation means perform classification, phonetic transcription and vocabulary entry composition for each item of the input topic-specific pre- processed text collection. During the vocabulary entry generation each item is classified into one of the vocabulary entry types. This is advantageous for the phonetic transcription because its processing is vocabulary entry type-specific:

Vocabulary entry fitting to the language grapheme structure (normal word or family name with non- foreign origin or composite word)

Vocabulary entry fitting to the grapheme structure of a foreign language of a specified set of (supported) languages (e.g. foreign word or family name with foreign origin) Vocabulary entry fitting to the grapheme structure of none of the supported languages, i.e. abbreviation, either pronounced normally (e.g. like ,,Philips") or pronounced by spelling (e.g. like ,,IBM")

Specific styles: e-mail, http addresses. First, the generation means select the specific styles and parse them into sequences of words and special signs. Consequently, the further classification and phonetic transcription of the words are reduced to the handling of the other vocabulary entry types, where a grapheme structure based classification may be carried out among the other types. One representative example for the realization of the classifier is to apply a probabilistic framework using character n-grams based statistic method and an other one is to apply a language identification system based on neural network. The resulting vocabulary entry type information is used as input for the next step, namely for the vocabulary entry type specific phonetic transcription. This phonetic transcription preferably is performed by a joint n- grams-based statistical method. The last step of the vocabulary entry generator is the composition of the vocabulary entry from the item and its phonetic transcription(s).

Finally, the lexicon adaptor adds the such obtained vocabulary entries to the vocabulary.

BRIEF DESCRIPTION OF THE DRAWINGS The above on further aspects of the invention are apparent from and will be elucidated with reference to the embodiments described herein after by reference to the drawings, of course without the intention to limit the invention to these preferred embodiments.

In the drawings: Fig. 1 is a schematic diagram showing the automatic topic-specific vocabulary generation in the main processing blocks;

Fig. 2 is a block diagram illustrating the structure of a preferred vocabulary entry generator according to the invention;

Fig. 3 shows a flow chart illustrating the specific styles filtering according to the invention; and

Fig. 4 is a flow chart illustrating the vocabulary entry composition according to the invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION According to the preferred embodiment of the invention, the following main steps are carried out:

Automatic language-specific text selection Automatic topic-specific text selection Text pre-processing

Vocabulary entry generation Lexicon adaption

The main processing blocks therefor are illustrated in Fig. 1 which illustrates a device 1 for the automatic generation of a topic-specific (and language-specific) vocabulary. At an input 2, this device 1 receives a specified list of source text which is inputted to automatic selection means 3, for automatically selecting language- and topic-specific texts. The selection means 3 comprise a first stage, namely an automatic language-specific text selector 4 which filters the texts having a predetermined specified language, e.g. English, from the input specified list of source texts. The output 5 of stage 4 is a list of language- specific texts, and this list is then supplied to a second stage of the selection means 3, namely an automatic topic-specific text selector 6 which selects the texts having a specified topic from the input list of language-specific texts. The output 7 of this second stage 6 is a topic- specific text collection which is supplied then to text pre-processor means 8. The language and the topic can be predetermined by input means not shown. The text pre-processor means 8 perform text replacements on the input topic- specific text collection, and the output 9 thereof is a topic-specific pre-processed text collection. This pre-processed text collection is inputted to vocabulary entry generation means 10 which process each item of the input topic-specific pre-processed text collection. For a given item, first, the item is classified, then its phonetic transcription is created, and finally the corresponding vocabulary entries are composed and delivered as output 11. This vocabulary entry generation will be described hereafter more in detail with reference to Fig. 2.

A lexicon adaptor 12 simply adds then the input vocabulary entries to the vocabulary at 13. In the following, the various stages of the device 1 of Fig. 1 will be described somewhat more in detail.

The automatic language-specific text selector 4 first identifies the languages of the texts and then selects the text parts written in the specified language. The applied written language identification (LID) technology is based on character n-gram sequences. Such method applying n-grams of characters is described e.g. in K. R. Beesley, see above. A public accessible implementation based on it is e.g. Noord's TextCat, see http://odur.let.rug.nl/~vannoord/TextCat, which is known best in the language technology community.

Thereafter, a classical topic selection method based on topic-specific language models may be applied in the topic-specific text selector 6. According to this method, topic- specific language models are trained for each hypothetised topic. Then the probability of the text is computed for each hypothetised topic by scoring the text with the corresponding topic- specific language model. The hypothetised topic having the highest probability is declared as the topic of the text. A detailed description of similar maximum likelihood based topic detection method using topic-specific unigram language models can be found in R. Schwartz, T. Imai, F. Kubala, L. Nguyen and J. Makhoul, see above.

However beside of this standard solution, also other methods can be used to identify the topic of the text.

An emerging idea in topic selection is the combination of statistic and knowledge base information sources. Such a method is presented by M. Hwang, H.Kong, S. Baek and P. Kim, see above. The statistic used here is the unigram count of keywords called keyword term frequency. For instance, known semantic lexicon of English language called WordNet defines semantic relations among words. These relations are utilized as knowledge base information. The method assigns a value to each semantic-related group of keywords, which is computed from the keyword term frequency values and relevancy values between the terms. The topic belonging to the group of keywords having the highest frequency- relevancy value is declared as the topic of the text. The text pre-processor means 8 are useful to prepare the text for the phonetic transcription. Hence, annotations are eliminated and special words, like numbers, are formatted by means of grammar parsing.

The operation of the vocabulary entry generation means is partly vocabulary entry type-specific. Therefore, the vocabulary entry types are introduced: (1) Type "normal": vocabulary entry fitting to the grapheme structure of the current language, e.g. English (normal word or family name with non-foreign origin or composite word). (2) Type "i-th supported foreign": vocabulary entry fitting to the grapheme structure of the i-th foreign language (e.g. German, French,...) of a specified set (i = 1...N) of supported languages (e.g. foreign word or family name with foreign origin).

(3) Type "abbreviation": vocabulary entry fitting to the grapheme structure of none of the supported languages, i.e. abbreviation, either pronounced normally (e.g. like

Philips) or pronunced by spelling (e.g. like IBM).

(4) Type "specific styles": e-mail, http addresses or the like.

It should be noted here that some abbreviations can accidentally fit to the grapheme structure of the current language, so they belong to type "nomal". The processing blocks of the vocabulary entry generation means 10 are illustrated more in detail in Fig. 2. As already mentioned, the input to this vocabulary entry generation means 10 is the topic-specific pre-processed text collection as outputted at 9 from the text pre-processor 8, cf. Fig. 1. Now, in a first stage, an item iterator 14 goes through all the input words and feeds the other parts of the vocabulary entry generator 10 with them. In particular, each term outputted from the item iterator 14 is supplied to a specific styles filter

15 which looks for predefined form specific styles and parses them into sequences of words and special signs. If a specific style is found, then the filter 15 goes through the parsed words and feeds the words into a classifier 16. Parallel, it sets a specific style trigger information at an output 17 "true", and it puts the special signs belonging to the actual parsed word (which is subject to the classifier 16) to a special strings line 18.

If the current item is not a specific style, then it is a word, and it simply goes through the filter 15 and it is supplied to classifier 16, and the specific styles trigger information is set "false" at output 17.

The classifier 16 assigns the input word into one of the (l)"normal", (2)"i-th supported foreign" and (3)"abbreviation" vocabulary entry types. The output of the classifier

16 is the vocabulary entry type information at 19 as well as, together therewith, the word itself, at 20. Thereafter the phoneme transcriber means 21 perform vocabulary entry type specific phonetic transcription on the input word. The output is the word, at 22, together with its phonetic transcription(s), at 23. These output 22, 23 is supplied to a vocabulary entry compositor 24 which puts together the respective vocabulary entries from the sequences of the input words and their corresponding input phonetic transcription(s). This composition is controlled by the specific style trigger information on ,,line" 17. In the case of a composing vocabulary entry of specific style, also the special strings information at 18 is needed. The resulting vocabulary entry for each item is set to the output 25 of the vocabulary entry compositor 24.

More in detail, the specific styles filter 15 applies pattern matching to find predefined form specific styles. If one of them is found then it is parsed into a sequence of words and the special signs. The special signs are assigned to words following them. For instance, the following hyperlink example httrj://wΛvw.iaeng.org/WCE2008/ICAEM2008iitml is parsed into the following sequence of words: www(http://) iaeng(.) org(.)

WCE2008(/)

ICAEM2008(/) html(.) . The strings in the brackets are the special signs belonging to the given words.

A flowchart of the operation of the specific styles filter 15 is shown in Fig. 3.

In particular, in Fig. 3 it is illustrated that after a start step shown at 30, it is checked at step 31 whether the next term to be handled is specific style. If not, then, according to step 32, the specific style trigger information is set to the value "false", and the next term is accepted as word (word = next term). Thereafter, it is checked at step 33 whether the current term is the last term to be processed, and if this is so, then the end of the filtering operation is reached at block 34 ("stop"). In the case that further terms are to be processed, then it is reverted to field 31 , and the next term is checked.

In the case that at the check of step 31 , the result is that the next term is a specific style term, then the specific style trigger information on line 17 is set to the value "true", and the parsing of the specific style is initiated.

Thereafter, according to block 36, the next parsed word is taken as word, and the next parsed special sign is taken as special string, whereafter it is checked, according to field 37, whether this pair of parsed word and special sign is the last pair. If no, then it is reverted to step 36. Otherwise, the decision stage 33 is the next stage where it is checked whether there are further terms to be processed or not.

Next, the operation of the classifier 16 and of the phonetic transcriber 21 are to be described more in detail. With respect to the grapheme structure based classifier 16, it is to be noted that the basic idea of distinguishing among the (l)"normal", (2)"i-th supported foreign" and (3)"abbreviation" vocabulary entry types is that for the cases of "normal", "i-th supported foreign", the language origin and for the case "abbreviation", the independency of the characters of the word is reflected by the grapheme structure of that word. Therefore, the classifier works by applying grapheme structure identification. Then, the vocabulary entry type corresponding to the identified grapheme structure is declared as the right one.

According to one preferable embodiment of the classifier 16, the n-grams based statistic method is performed. Here, probabilistic framework based on n-grams of characters is applied for the identification of the grapheme structure of the words.

For instance, it is assumed that a word (w) has n graphemes, i.e. letters; g! denotes the i-th grapheme of the word, i=l ...n. The probability of this word belonging to the s-th grapheme structure can be expressed by multiplying the conditional probabilities of its graphemes hypothetising that the given grapheme sequence belongs to the s-th grapheme structure. This may be expressed as follows:

p(w|^s) ⁼ Ps(9i.-.9n ) = π ι=1 Ps(g,|gi.-.g-i )

Then, the identification of the grapheme structure of the word is performed by looking for the grapheme structure which has the highest probability that the given word belongs to it. So the probability to be maximized can be expressed in the following way: argmax p_s(g_{1 j}..._jg_n) = argmax flp_s(g_l|g_{1 j}..._jg_l_₁ ) s s 1=1

These conditional probabilities can be approximated by taking into account only a limited number of dependencies on the previous graphemes of the same word: argmax flp_s(g_l|g_{1 j}..._jg_l_₁ ) = argmax πp_{s j}(g_l|g_l__{j+1 j}....g^) s ι=1 s ι=1 '

The p_{s j} conditional probability is grapheme j-gram. Hence, this formula shows that the method leads to a standard language model technique applied on graphemes as language models units. For more on language models techniques, it may be referred to F. Jelinek, Statistical methods for speech recognition, Language, Speech and Communication Series, MIT Press, 1997.

According to another advantageous embodiment of the classifier 16, a neural network based method is carried out. A neural network based language identification (LID) system is applied to distuingish among the (l)"normal", (2)"i-th supported foreign" and (3)"abbreviation" vocabulary entry types. This system learns the grapheme structure of the possible language origins during a training process. During the training process input character sequences corresponding to the allowed vocabulary entry types are fed to the system. The input character sequences corresponding to the "abbreviation" vocabulary entry type are sequences consisting of random characters. Here, the independency of the characters of the "abbreviation" (type 3) vocabulary entries are utilized and reflected by the grapheme structure of these input training character sequences. In this way, an "artifical random language" origin is associated to the "abbreviation" vocabulary entry type. After training, the system can be used to identify the language origins of the input word as input character sequence, and the result corresponds to one of the allowed vocabulary entry type. In the neural network based language identification (LID) system a multi-layer percepton (MLP) neural network is used. The MLP network has a single hidden layer. Each input unit of the network corresponds to a character of the input character sequence. The output units of the network correspond to language origins, and they provide the probabilities of these language origins for the actual input character sequence. Since the neural network input units assume continuous values, the characters of the input character sequence are transformed to some numeric quantity.

A detailed description of such an MLP neural network based LID system can be found in WO 2004/038606 A.

After classification, a vocabulary entry type specific phonetic transcription of the input word is performed in the phonetic transcriber means 21 by statistical phonetic transcription using statistical resources. The process can be configured to create one or more pronunciation forms of the same input word. Different statistical resources are trained for the (l)"normal" and for each (2)"i-th supported foreign" vocabulary entry types. The input for such training is a prepared - not necessarily large - lexicon containing the pronunciation forms according to the corresponding language origin. Then, the phonetic transcriber means 21 use the vocabulary entry type information input at 19 to select the proper statistical resource to be used for the phonetic transcription of the words of the "normal" vocabulary entry type and for each "i-th supported foreign" vocabulary entry type.

For a word of the "abbreviation" vocabulary entry type, two phonetic transcriptions are created: one assuming that it is pronounced normally, and the other assuming that it is pronounced by spelling. For the first case, the phonetic transcriber 21 uses the statistical resource prepared for "normal" words because normal pronunciation is hypothesised. In the second case, due to the hypothesised spelling pronounciation, a unique grapheme-to-phoneme assignment exists. Hence simply this correspondence is utilized. The applied statistical phonetic transcription method is based on joint n-grams. For detailed description it may be referred to M. Bisani and H. Ney, investigations on joint- multigram models for grapheme-to-phoneme conversion", In Proceedings of ICSLP 2002, Denver. The working of the vocabulary entry compositor 24 depends on the specific style trigger information input at 17. In the case this trigger information is "true", the compositor 24 composes a vocabulary entry of a specific style and it incrementally collects the incoming words, the corresponding phonetic transcriptions and the corresponding special strings. Every special string has a predefined phonetic form which is stored in storage means 26. Then, the vocabulary entry compositor 24 incrementally composes the phonetic transcription(s) of the original specific style from the input phonetic transcriptions and the predefined phonetic forms of the corresponding input special strings preceding the actual phonetic transcription. Additionally, the vocabulary entry compositor 24 also incrementally retains the original specific style from the incoming words and the corresponding special strings preceding the actual word. If the specific styles trigger information at 17 changes to "false", then the compositor 24 puts together the vocabulary entry from the original specific style and its phonetic transcription(s).

While the input specific styles trigger information is "false", then the vocabulary entry compositor 24 simply puts together the vocabulary entry from the actual input word (at 22) and its corresponding input phonetic transcription(s) (from 23).

The flowchart of Fig. 4 illustrates the operation of the vocabulary entry compositor 24 somewhat more in detail.

According to Fig. 4, after a start step at field 40, an initiating step 41 follows where the specific style trigger information is set to the value "false". According to field 42, it is checked whether there has been a change from the value ,,false" to the value "true" with respect to the specific style trigger information on line 17. If not, then, according to field 43, the next word and the associated phonetic transcription are collected, and according to field 44, this next vocabulary entry is composed and outputted to the vocabulary at 25 in Fig. 2. Then, according to field 45, it is checked whether there are more words to be processed, and if not, then the end of the operation is reached at the "stop" field 46. If, however, more words are to be processed, then it is reverted to field 42 where the next check with respect to a change in the specific style trigger information is carried out. If such a check of a possible change from "false" to "true" yields the result that now the specific style trigger information is "true", then the process is continued at field 47 according to which the next word, the phonetic transcription and the special string are stored. Then, according to field 48, it is looked for a phonetic transcription of the special string in storage means 26, and the found phonetic transcription is stored. Thereafter, again it is checked whether there had been a change in the specific style trigger information, according to field 49, namely now from "true" to "false". If not, then the process reverts to field 47. Otherwise, the process is continued at field 50 according to which the specific style is restored, and the phonetic transcription is composed. Thereafter, the process continues at field 44.

Another possibility for the realization of the phonetic transcriber means 21 is the combination with predefined background lexicons for the (l)"normal" and for each (2)"i- th supported foreign" vocabulary entry types. In this case, first, the background lexicon belonging to the vocabulary entry type is checked for the transcription of the input word and for possible parts of it, hypothesising that it is a composite word. The statistical phonetic transcription is applied only if the pronunciation of the word is not found in the background lexicon.

The present automatic topic-specific vocabulary generation enables an efficient set-up of the emerging domain specific applications. Such applications are e.g.: Speech recognition systems with personalized vocabularies (such system are used e.g. for dictating e-mails or for converting voice mail messages to e-mails).

Speech recognition systems for different domains (here, also domain specific language models are used which are built up also from the same domain specific corpora).

Domain-specific speech synthesis systems. Automatic processing of audio-visual lectures for information retrieval (this includes the transcription of the audio-visual information, automatic indexing of the lectures which enables navigation, creation of such lectures and automatic translation them to other languages. For reference, A. Park, T. J. Hazen and J. R. Glass, "Automatic processing of audio lectures for information retrieval: vocabulary selection and language modeling", In Proc. ICASSP 2005 may be mentioned.

Automatic summarization technology.

Claims

CLAIMS:

1. A method for the computer-aided generation of a topic-specific vocabulary from public text corpora, said method comprising the steps of:

- an automatic selection of a language- and topic-specific text,

- wherein the step of automatic generation of the vocabulary entries comprises:

- - a grapheme structure-based classification of the vocabulary entries, to classify the vocabulary entries according to a number of predetermined types, and

- a vocabulary entry type-specific grapheme-to-phoneme conversion, to obtain phonetic transcriptions for words.

2. The method according to claim 1, wherein the vocabulary entries are classified according to four types, namely vocabulary entries fitting to a language grapheme structure; vocabulary entries fitting to the grapheme structure of at least one other supported language; vocabulary entries which do not fit to any language grapheme structure, as e.g. abbreviations; and specific styles, like e-mail addresses or web addresses.

3. The method according to claim 2, wherein the specific styles are selected in a filtering operation applying pattern matching to find predefined specific styles.

4. The method according to claim 2, wherein after the specific styles filtering, the other types of vocabulary entries are distinguished by grapheme structure-based classification.

5. The method according to anyone of claims 1 to 4, wherein the classification of the vocabulary entries is based on the application of a n-grams-based statistic method.

6. The method according to claim 1, wherein a statistic grapheme-to-phoneme conversion is carried out to obtain the phonetic transcription.

7. The method according to claim 1, wherein, after the selection of language - and topic-specific text, the text is pre-processed to prepare it for the later phonetic transcription, to eliminate parts, like annotations, not to be transcribed, and to format special words, like numbers, by grammar parsing.

8. The method according to claim 2, wherein the vocabulary entries comprising the respective words together with the phonetic transcriptions thereof are composed dependent on whether there is a specific style type word or not, and in the case of a specific style, respective predefined stored phonetic forms are associated with corresponding sign parts of specific style strings.

9. The method according to claim 1, wherein the vocabulary entries are added to an already prepared vocabulary by means of a lexicon adaptor.

10. A device for the generation of a topic-specific vocabulary from public text corpora and comprising:

- selection means for automatically selecting language- and topic-specific texts, and

- vocabulary entry generation means for automatically generating the vocabulary entries on the basis of the selected texts, the vocabulary entry generation means including

- - grapheme structure-based classifier means for automatically classifying the vocabulary entries according to a number of predetermined types, and

- - phonetic transcriber means connected to the classifier means, for automatically carrying out a vocabulary entry type-specific grapheme-to-phoneme conversion, to obtain phonetic transcriptions of words.

11. The device according to claim 10, wherein the classifier means comprise a specific styles filter, for filtering specific styles from the texts by applying pattern matching to find predefined specific styles, like e-mail addresses, as well as a classifier for distinguishing other types of vocabulary entries by grapheme structure-based classification.

12. The device according to claim 11, wherein the phonetic transcriber means are connected to the classifier for generating the phonetic transcription of words according to the types of vocabulary entries as determined by the classifier, whereas predefined stored phonetic forms are associated with specific style type words.

13. The device according to claim 10, wherein text pre-processor means are connected to the selection means, for pre-processing the texts to eliminate parts not to be transcribed, like annotations, and to formal special words, like numbers, by grammar parsing.

14. The device according to claim 10, wherein a lexicon adaptor is connected to the vocabulary entry generation means, for adding the vocabulary entries to a vocabulary.

15. A computer program product that can be loaded directly into a memory of a computer and comprises sections of a software code, for performing the method according to claim 1 when the computer program product is on the computer.