EP2815395A1

EP2815395A1 - Method and device for phonetising data sets containing text

Info

Publication number: EP2815395A1
Application number: EP13705421.9A
Authority: EP
Inventors: Jens Walther
Original assignee: Continental Automotive GmbH
Current assignee: Continental Automotive GmbH
Priority date: 2012-02-16
Filing date: 2013-02-11
Publication date: 2014-12-24
Also published as: WO2013120794A1; CN104115222B; US9436675B2; DE102012202391A1; US20150302001A1; CN104115222A

Abstract

The invention relates to a method for phonetising data sets (2) containing text and to a device for carrying out said method. The data sets (2) which are in form of graphemes are converted into phonemes and are saved as phonetisied data sets (8) in said device. The graphemes are prepared in a pre-processor for phonetising, in particular by modifying in a speech-defined and/or user-defined manner. According to the invention, the pre-processing of the graphemes and the conversion of the graphemes into phonemes is carried out parallel to different calculation units (5, 6) or in different parts of calculation units (5, 6).

Description

description

Method and device for phononizing text-containing data records

The invention relates to a method and a device for phonetizing text-containing data records, in particular different contents, such. Music titles, music interpreters, music albums or phonebook entries, contact names or the like, used in voice-operated user interfaces to control certain operations in which the user forwards the voice commands containing that content to the user interface. Without limiting the invention to this preferred application, a preferred field of application of the invention is in the field of motor vehicle control devices, in particular multimedia control units in motor vehicles, which serve for information, entertainment and / or communication in motor vehicles. Such control units may in particular contain music playback and telephone functions.

In the method proposed according to the invention, the data sets present as graphemes, as a sequence of individual grapheme symbols, in particular as a letter sequence or standardized letter sequence, are converted into phonemes, ie a sequence of individual phoneme symbols, and used as phonetized data sets, for example in a phonetized data list stored. According to the usual definition, a phoneme is a sound representation that forms the smallest meaningful unit in a language, has a distinctive function. The term "phonemes" is understood in the present text, in particular as a consequence of a plurality of individual phoneme symbols. The same applies to the term grapheme, which is understood in the present text in particular as a consequence of individual grapheme symbols. Similar to a phoneme, a grapheme (grapheme symbol) represents the smallest significance in the graphical representation of a text. outgoing unit, and is often defined by the letters of a script.

In the proposed method, the graphemes are prepared in a Präprozes tion for the actual phonation, in particular by the grapheme language-defined and / or user-defined modified before the conversion is performed in phonemes. The phonetized data list, for example in the form of the phonetized data sets, can then be used in a manner known per se, for example in the speech recognition of a voice-controlled user interface.

The preprocessing has the background that the graphemes (and also the phonemes) are language-dependent, depending on the language used. Frequently, however, data records in particular contain entries of different languages which must be identified and adapted for phonetization. Accordingly, preprocessing can be achieved by recognizing foreign-language texts, but also by replacing abbreviations, omitting prefixes (such as "Mr.", "Mrs.", "Dr.", the English article "the" or the like), expanding acronyms and / or Offering pronunciation variants that are selectable by the user.

Such preprocessing can at least partially override the most speech-related limitations of grapheme-to-phoneme conversion, which only supports a given number of digits and strings to be spelled, by using the speech-dependent acoustic speech used in the phonation Models unsupported characters of graphemes are replaced.

In existing systems, however, there is the problem with preprocessing that these methods precede the actual graph-to-phoneme conversions, the time, which is needed for preprocessing adds up to the total latency for grapheme to phoneme conversion.

Since the preprocessing can also be very compute-intensive, depending on the effort involved, either long latencies or the performance of the preprocessing can be expected, for example by ignoring unsupported characters of the grapheme representation during phonation. Due to the scarcity of resources in the

Preprocessing, the known implementations of preprocessing are only conditionally adaptable to specific application requirements and in particular hard-coded, in particular with regard to the number of variants and the available substitutions or modifications.

It is therefore an object of the present invention to propose a telephone system in which the time required for preprocessing and the subsequent conversion of the graphemes into phonemes is reduced.

This object is achieved by a method having the features of claim 1, a device having the features of claim 7 and a computer program product having the features of claim 8.

In the proposed method, it is provided in particular that the preprocessing of the graphemes and the conversion of the graphemes into phonemes are performed in parallel on different arithmetic units or parts of arithmetic units, in particular different processors or processor parts. The different arithmetic units can be implemented in different computing devices or in a computing device as a dual or multi-arithmetic unit, in particular dual or multi-processor.

The parallel execution of the preprocessing of the graphemes and the conversion of the graphemes into phonemes can, in particular, In such a way that graphemes provided for phononization are preprocessed in a first step in a first arithmetic unit, transmitted to a second arithmetic unit and phonetized in the second arithmetic unit, converted into phonemes. During the phononization of the graphemes in the second arithmetic unit, graphemes provided subsequently for phononization can then be processed in the first arithmetic unit.

As already mentioned, the data sets are usually in the form of graphemes, that is to say sequences of individual grapheme symbols (in particular letters), so that in each case a subsequence can be processed according to the capacity of the respective arithmetic unit, for example in the style of a FIFO buffer memory ( first-in-first-out). Optionally, according to the invention, a buffer may be provided between the first and the second arithmetic unit so as to be able to synchronize the arithmetic processes of both arithmetic units and to compensate for fluctuations in the arithmetic performance of the two arithmetic units by buffering the preprocessed graphemes in the short term.

A particularly advantageous use of the proposed method according to the invention results in a dynamic speech recognition in which the graphemes are generated only during the application of constantly changing text-containing data sets, in contrast to a use with a static database in which the phonemization of the graphemes is done once and the voice control then accesses the fixed phonemes.

According to a particularly preferred embodiment of the proposed method, the data sets present as graphemes, ie as a sequence of individual grapheme symbols, can be decomposed into grapheme sub-pacts, which may also be referred to as packets of grapheme subsequences, one graphem subpacket each preconditioned in a first arithmetic unit siert and then phonetized in another second arithmetic unit, converted into phonemes, and wherein both arithmetic units are adapted to process different grapheme sub-packets in parallel, in particular at the same time. The packet-wise distribution of the data to be processed allows a particularly effective use of the available processor resources, so that a temporally optimized implementation of the phonation with preprocessing and conversion is possible.

In this case, according to the invention, it is particularly advantageous if the size of a grapheme subpacket is specified, for example, matched to the available computing power of the arithmetic unit (that is, dependent on the platform). For example, a grapheme sub-packet with a maximum length of 50 entries (or graphem symbols) can be specified. It has been found that grapheme subpackets whose size is matched to the platform (arithmetic unit) can be preprocessed and converted particularly effectively, since in this case an optimal ratio of the amount of data to be processed to the messaging

Overhead results. The messaging overhead arises because the data packets (graphem subpackets) have to be exchanged between the various arithmetic units or parts of arithmetic units and the exchange must be coordinated with one another. Since both arithmetic units must buffer the data, furthermore, the respectively processed data quantity of a grapheme subpacket must be limited in order to enable effective and fast processing in each arithmetic unit.

In this context, it can also be particularly advantageous according to the invention to determine the size of a packet by applying defined rules, in particular before or at the beginning of the preprocessing, in order to take into account the content context of individual grapheme symbols in the preprocessing and conversion. These rules may include, for example, the recognition of certain grapheme symbols, the blank or Represent delimiters, and / or include a content assessment, optionally combined with a maximum and possibly also a minimum predetermined length of the subsequences, ie a length limit or a length interval for the subsequences. Due to the maximum predetermined length, in particular the computing power of the arithmetic unit can be taken into account. The minimally specified length ensures context-sensitive pre-processing and / or conversion, in which coherent graphemes can also be assessed and taken into account in terms of content.

In a particular embodiment of the proposed method, the preprocessing according to the invention may comprise a grammar-based parser, which in particular comprises rules for the text modification and / or pronunciation variants, it being possible for different languages to be taken into account. Particularly preferably, this grammatical parser is parameterizable, for example, by specifying rule-containing files. This has the consequence that the rules for the pattern matching and / or the linking of rules according to the invention are easily editable, expandable and interchangeable. For this purpose, a recourse to existing software modules is possible, for example the GNU parser generators Flex and Bison, the application of which is possible only for dynamic databases only by the inventively proposed parallel processing of preprocessing and conversion of the individual grapheme subsequences.

Another aspect of the proposed preprocessing according to the invention is that it involves a conversion from the acoustic model of the grapheme-to-phoneme conversion (for example due to a missing one)

Language support) of unsupported characters (e.g., another language) in grapheme symbols supported by the acoustic model, particularly latin baselines or characters. This leaves flexible language support for databases of very different contents can be achieved, which particularly preferably can also be parameterized and / or adapted according to the aforementioned aspect, so that the preprocessing can be adapted automatically, for example as part of a firmware update, if provided data contents and so that the text-containing data records intended for phonetization change.

The invention also relates to a device for phonetizing text-containing data records, for example in or for use in a voice-controlled user interface, such as a multimedia control unit of a motor vehicle, in particular with a music control, a car telephone and / or a hands-free device, wherein a multimedia Control unit via a data storage, such as a database, with the text-containing data records has that can also be displayed in a graphical user interface if necessary. The device is equipped with a data interface for inputting or reading in the text-containing data records, for example in the form of list entries, and has a computing device which is set up to convert the data records present as graphemes into phonemes and to the preceding preprocessing. According to the invention, the computing device has at least one first arithmetic unit and one second arithmetic unit, wherein the first and the second arithmetic unit are set up to carry out the above-described method or parts thereof.

Accordingly, the invention also relates to a computer program with program product means which are suitable for setting up a computing device of a device for phonetizing text-containing data records with two arithmetic units, in particular as described above, for carrying out the method described above or parts thereof. Further advantages, features and possible applications of the present invention will become apparent from the following description of an embodiment and the drawings. All of the described and / or illustrated features alone or in any combination form the subject matter of the present invention, also independent of their combination in the claims or their back references. The only FIG. 1 schematically shows an embodiment of the proposed device with the procedure of the method for phononizing text-containing data records 2.

1 shows a particularly preferred embodiment of a device 1 for phononizing text-containing data records 2 contained in a data memory or a database. The device 1 for phonetization can be integrated into a voice-controlled user interface, such as a multimedia control unit of a motor vehicle, and has a data interface 3 for inputting or reading in the text-containing data records 2. Furthermore, a computing device 4 is provided in the device 1, which is set up to convert the textual data records 2 present as graphemes and to preprocess the graphemes before conversion into phonemes. This aspect of the computing device 4 is shown in FIG. 1 for the sake of clarity next to the device 1, although this computing device 4 is part of the device 1 and the user interface parts containing them.

This computing device 4 has a first arithmetic unit 5 and a second arithmetic unit 6, which are suitable according to the invention for the parallel, independent processing of data.

It should be noted that the computing device 4 shown in FIG. 1 only needs to be described more precisely below. represents writing functions of the solution according to the invention and not all of the computing device 4 or in computing units 5, 6 of the computing device 4 running processes and processes reproduces.

The first arithmetic unit 5 is set up for preprocessing the graphemes and the second arithmetic unit 6 for converting the graphemes into phonemes, wherein the second arithmetic unit 6 may preferably also have a voice recognizer used by the voice-controlled user interface parts and stored phonetized data records, for example in form a phonetized data list.

The method of phononization proposed according to the invention is carried out as described below:

After reading the text-containing data records 2 via the data interface 3 in the device 1 for phonetization grapheme, d. H. the sequence of individual grapheme symbols, first broken down into graphem subsequences of a predetermined length of, for example, 50 grapheme symbols or units. This is illustrated by the arrow 7, which is shown in FIG. 1 outside the computing device 4, although the process of decomposition 7 takes place in an optionally also additional computing unit of the computing device 4 and, for example, as a first process step

Preprocessing can be understood.

Subsequently, the grapheme subsequence is fed to the first arithmetic unit 5, which takes over the preprocessing of the graphemes. The graphemes of each grapheme subsequence can be modified in a language-defined and / or user-defined manner, for example by replacing abbreviations, recognizing foreign-language texts, omitting prefixes, expanding acronyms and / or offering language variants which can be selected by the user. The preprocessing implemented in the first arithmetic unit 5 preferably comprises a grammar-based parser, which comprises rules for the text modification and / or pronunciation variants, it being possible for different languages to be taken into account. In addition, in the preprocessing implemented in the first arithmetic unit 5, characters not supported by the acoustic model of the grapheme-to-phoneme conversion are converted to grapheme symbols supported by the acoustic model.

After Präprozes sation in the first arithmetic unit 5, the (pre-processed) grapheme subsequence of the second arithmetic unit 6 is supplied, in which the actual grapheme-to-phoneme conversion takes place. This procedure is generally known and therefore need not be described in detail here.

As a result of the grapheme-to-phoneme conversion in the second arithmetic unit 6, a phonetized data list 8 is generated and stored in the arithmetic unit 4 or a memory device of the phonemeization device 1, so that a voice-controlled user interface phonetises a voice-controlled user interface Data list 8 can access. The phonetized data list 8 thus represents the phonetized data sets.

Due to the parallel processing of the preprocessing and the conversion in different independent arithmetic units, only the waiting time for a first packet is added to the total latency for the phonation of the text-containing data sets, even if a complex preprocessing is carried out, in addition to a substitution of acronyms and the like may include a language-dependent conversion of the characters of other languages not supported by the acoustic model of phonation into Latin base characters. Due to the parallel processing, it is also possible to carry out a comprehensive preprocessing and can be set so that the pre-processing rules can easily be set into the system. Moreover, these rules are well documented and easy to change. Furthermore, according to the invention, an efficient utilization of the processor resources during the phonation is carried out, so that, despite elaborate preprocessing, the waiting times for making available the phonetized data list used for voice control increase only imperceptibly.

A concrete embodiment will be described below, in which the inventive method is used in a vehicle entertainment device. The vehicle entertainment device has an interface for Bluetooth devices, USB data carriers, iPod devices or the like.

The pieces of music contained therein are read by the central unit of the vehicle entertainment device designated as the head unit, wherein meta-attributes of the pieces of music are stored in a database. This database can be searched via a graphical interface and single or multiple titles can be played. In addition to the haptic selection of pieces of music, there is also a voice-based operation of the vehicle entertainment device, where the selection of pieces of music (albums, artists) on their name should be made.

Often, the meta-attributes of the pieces of music are not suitable for voice control, so that it is not or only unnaturally possible for the user of the system to also enter his selection. One known approach to solving the problem is to compare the characteristics of the audio signal in a database hosted on the system, which provides the meta-attributes to the speech recognizer so that the user can easily select the titles. The disadvantage of such a solution is that the database does not know the latest title and therefore has to be constantly updated. In addition, licensing costs are due and a Bedded systems require significant memory, which would increase the fixed cost of such a product.

Instead, a Präprozes invention tion is applied, which has in particular the following steps:

1. After the vehicle entertainment device has detected an infected USB device or the like, a device-internal database is filled by indexing the pieces of music and their Metaattribute.

2. The meta-attributes are sorted by category from the database of the vehicle entertainment device into the voice-controlled user interface parts of the vehicle

Entertainment device read.

3. The computing unit 4 of the voice-controlled user interface sections, for example, which is suitably set up as the phonemeising device 1, reads the data packet by packet or breaks the data into individual packets of a predefined size, i. H. in grapheme subsequences or grapheme subpackets. A grapheme subpacket is given to the preprocessor (the first computing unit 5).

4. The first arithmetic unit 5 (preprocessor) essentially consists of a parser module, which searches the data for specific patterns. These patterns are partly language dependent and therefore interchangeable depending on the selected language. As input, the first arithmetic unit 5, ie the preprocessor, receives the grapheme from the database (primary grapheme) as well as the current category descriptor. 5. The first arithmetic unit 5 then creates an alternative text in the preprocessing, for example, and corrects the primary grapheme. So is the example common suffix "feat. <Artist>"for the primary graphem expands to" featuring <Artist>". In the alternative, the primary grapheme "feat. Often the attribute for title contains the index on the album as well as the artist and album name, and the primary grapheme is then cleaned up with the unnecessary parts, and an alternative is not created for this case.

6. The preprocessed grapheme subpackage is forwarded to the speech recognizer, which preferably resides on another second arithmetic unit 6.

7. Parallel to the elaborate phonation (g2p) in the second computing unit 6, the second sub-packet, or more generally another grapheme sub-packet, at the preprocessor, i. H. in the first arithmetic unit 5, processed.

8. Parallel to the preprocessor (first arithmetic unit 5) and the speech recognizer with the grapheme-to-phoneme conversion (second arithmetic unit 6), the voice-controlled user interface 1 queries the next packets at the database, so that a chain of packet processing in the voice-controlled user interface 1 is present. Of the parallel steps database inquiry, preprocessing and phonetization, the latter is the slowest. The parallelism of preprocessor and speech recognizer does not create any additional latency beyond the preprocessing of the first subpacket.

As a result, in the embodiment, user-improved operation results without causing a significant deterioration in latency or an increase in memory consumption.

Claims

Method for phononizing text-containing data records (2), in which the records (2) present as graphemes are converted into phonemes and stored as phonetized data records (8), the graphemes being processed in a preprocessing for the phononization, in particular by the graphemes language-defined and / or user-defined, since it is characterized in that the preprocessing of the graphemes and the conversion of the graphemes into phonemes are performed in parallel on different arithmetic units (5, 6) or different parts of arithmetic units (5, 6).

A method according to claim 1, characterized in that the datasets (2) present as graphemes are decomposed into grapheme subpackets, wherein in each case a grapheme subpacket is preprocessed in one arithmetic unit (5) and subsequently phonetized in another arithmetic unit (6) and both arithmetic units (5, 6) are arranged to process different grapheme subsequences in parallel.

Method according to claim 2, characterized in that the size of a subpacket is predetermined, in particular by a constant dependent on the arithmetic unit (5) of the data set of the grapheme sub-pacts to the messaging overhead which is used in the communication between the two arithmetic units (5, 6) arises.

Method according to claim 2 or 3, characterized in that the size of a pact is determined by applying defined rules. Method according to one of the preceding claims, characterized in that the

Preprocessing includes a grammatical parser.

Method according to one of the preceding claims, characterized in that the

Preprocessing involves conversion of characters not supported by the acoustic model of grapheme to phoneme conversion into grapheme symbols of the acoustic model.

Device for phononizing text-containing data records (2) with a data interface (3) for inputting the text-containing data records (2) and with a computing device (4), which is set up for converting the data records present as graphemes into phonemes and for preprocessing the graphemes in that the computing device (4) has at least one first arithmetic unit (5) and one second arithmetic unit (6), wherein the first and the second arithmetic unit (5, 6) are arranged to carry out the method according to one of claims 1 to 6.

Computer program product with program code means which are suitable for setting up a computing device (4) of a device (1) for phononizing text-containing data records (2) with at least two arithmetic units (5, 6) for carrying out the method according to one of claims 1 to 6.