WO2004072846A2

WO2004072846A2 - Automatic processing of templates with speech recognition

Info

Publication number: WO2004072846A2
Application number: PCT/IB2004/050081
Authority: WO
Inventors: Dieter Hoi
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2003-02-13
Filing date: 2004-02-05
Publication date: 2004-08-26
Also published as: WO2004072846A8; WO2004072846A3

Abstract

A speech recognition device (1) for processing a predefined form (2), having speech recognition means (7) which can be fed a spoken text (GT) and transcribe it into a recognized text (ET), has analysis means (12) for analyzing and converting the predefined form (2) into a form template (FV) having a data format which can be processed by the speech recognition means (7), wherein the form template (FV) and the recognized text (ET) can be combined by the speech recognition means (7) to form a document (DK).

Description

Automatic processing of templates with speech recognition

The invention relates to a speech recognition device for processing a predefined form, wherein the speech recognition device has speech recognition means which can be fed a spoken text and transcribe it into a recognized text.

The invention further relates to a speech recognition method for processing a predefined form, comprising the reception of spoken text and the transcribing of the spoken text into a recognized text.

The document WO 98/43181 discloses a system for completing documents or for filling out forms in a text processing program, or for entering data in a database. The term "document" should in this case be understood in the broad sense and comprises any file, created using a text processing program or a database, in which the user intends to perform at least one data entry. The known system for completing documents comprises data input means for inputting data in written or spoken form and comparison means for comparing the input data with stored reference data, wherein the correspondence of input data with given reference data is used to identify the data category to which the input data belong. By means of this connection between input data and the data category thereof, it is possible for the system to complete a document or form created using a text processing program by inputting the input data in an input field associated with the assigned data category, or to fill out a data record by filling out with the assigned input data the individual fields of the database which are associated with a respective data category.

In an arrangement of this known system for completing documents or for filling out forms or entering data in a database, WO 98/43181 proposes a method of producing a data input system for a specific user program, in which an existing document format or data input form used in the user program is analyzed in order to identify and characterize input fields. The result of the analysis is used to generate additional program code which is added to the operating system on which the user program runs, or to the user program itself. The effect of the additional program code is that each data input is compared with the stored reference data in order to identify the data category of the data input, and the input data are then entered in that input field of the document format or data input form whose data category corresponds to that of the input data. To generate the additional program code, use may advantageously be made of ActiveX technology, which is compatible with a large number of more advanced programming languages such as Visual Basic, C++, etc. A description is given of an ActiveX control which automatically carries out the above method steps.

The known system for completing documents and for filling out forms is used in particular to make it easier for unpracticed computer users to input data in input fields of documents created using a text processing program or in input masks of databases. By way of example of unpracticed computer users of this type, mention may be made of doctors and attorneys who are inexperienced at using a computer and who encounter problems even when navigating between input fields within an input mask using a keyboard or mouse. Although it should be noted that the system, known from the document WO 98/43181, for completing documents or for filling out forms does provide unpracticed users with support essentially to correctly fill out documents and forms, this known system and method limit the more practiced user when processing documents and forms more than they help him, since he is forced to move within predefined input fields or during the analysis of recognized input fields while he has no access to the other parts of the document or form. As a result, the known system and method can only be used on forms and input masks with a rigid form, for example for filling out a patient information sheet which comprises only a few input fields such as, for example, patient name, address, date of birth, name of the physician administering treatment, diagnosis, etc. However, the system does not offer the user the option of navigating outside the predefined input fields in the document and processing it for example by adding and deleting text or changing the formatting. Furthermore, essential functions of existing speech recognition programs cannot be used with the known system. This includes what is referred to as playback with synchronous "highlighting", i.e. where the user's dictation can be played back repeatedly after it has been recorded and recognized by the speech recognition software, wherein the user can hear the spoken text and, in synchronism therewith, follow the corresponding text parts recognized by the speech recognition software by means of colored highlighting on the screen, and make corrections where necessary. This function assumes that the speech recognition software knows the entire content of the document or form in a format that it can process, which is not the case with the system and method for completing documents known from WO 98/43181. It is an object of the invention to provide a speech recognition device of the type specified in the first paragraph and a speech recognition method of the type specified in the second paragraph, in which the abovementioned disadvantages are avoided. To achieve the abovementioned object, in such a speech recognition device use is made of analysis means for analyzing and converting the predefined form into a form template having a data format which can be processed by the speech recognition means, wherein the form template and the recognized text can be combined by the speech recognition means to form a document. To achieve the abovementioned object, such a speech recognition method comprises the following further method steps: analyzing the predefined form and converting it into a form template having a data format which corresponds to that of the recognized text, and combining the form template and the recognized text to form a document. By means of the features according to the invention, a user wishing to process a form receives said form in the form of a form template which can be processed by him, on which the functions of advanced speech recognition software, such as navigation using voice commands, synchronous playback with highlighting, substitution, etc., can also be used. The term "process" should in this case be understood in the broad sense and comprises, on the one hand, the filling out or amending of defined text fields but, on the other hand, also the amending, supplementing and/or formatting of predefined elements of the form. The proposed solution therefore offers the user all the possibilities for creating individual documents by means of speech recognition, which up to now could only be used if the entire document was created from a dictation since only then was the necessary information available to the speech recognition software. The proposed method goes way beyond the conventional methods of filling out documents and inserting data into input masks, as disclosed in the document WO 98/43181, in that it is now possible to configure each document individually.

In accordance with the measures of Claims 2 and 7, the advantage is obtained that a large number of forms can be accessed in a simple manner, wherein the computer files may have different formats so that the highest possible degree of flexibility in using the invention is ensured. Just as the computer files may be in different formats which can be automatically recognized and therefore treated correctly, the forms stored in the computer files may also be in different formats which are automatically recognized and processed correctly. In this connection, the word "automatically" is to be understood to mean that it has nothing to do with the user of the speech recognition device.

In accordance with the measures of Claims 3 and 8, the advantage is obtained that, as templates, use can also be made of those documents and working templates which have been created using software generally used in office applications, such as documents created using text processing programs, tables and diagrams created using table calculation programs, or reports drawn up using database programs. These files are automatically recognized and converted into a format which can be processed by the speech recognition device. In accordance with the measures of Claims 4 and 9, the advantage is obtained that use can also be made of those forms which the user has available only in paper form, without said user having to input these forms manually or by dictation.

In accordance with the measures of Claims 5 and 10, the advantage is obtained that forms which comprise text fields as the essential component to be processed by the user can be completed quickly and by means of simple navigation.

The invention will be further described with reference to examples of embodiments shown in the drawings to which, however, the invention is not restricted. Fig. 1 shows a speech recognition device for processing a predefined form.

Fig. 2 shows a form which is to be completed by a user. Fig. 3 shows the form of Fig. 2 after it has been completed by a user. Fig. 4 shows the form of Fig. 3, wherein a user has performed additional formatting.

Fig. 1 shows a speech recognition device 1 for processing a predefined form 2. The speech recognition device 1 may be formed by a computer which implements speech recognition software. The speech recognition device 1 comprises speech recognition means 7, storage means 8, parameter storage means 9, command storage means 10 and an adaptation stage 11. An audio signal A representing spoken text GT can be output via a microphone 5 to an A/D converter 6 which converts the audio signal A into digital audio data AD that can be fed to the speech recognition means 7. The speech recognition means 7 convert the digital audio data AD into recognized text ET that is stored in the storage means 8. For this purpose, parameter information PI that is stored in the parameter storage means 9 is taken into account, said parameter information PI comprising vocabulary information, speech model information and acoustic infonnation.

The vocabulary information comprises all words that can be recognized by the speech recognition means 7, together with phoneme sequences. The speech model information comprises statistical information regarding the sequences of words that are customary in the speech of the spoken text GT. The acoustic information comprises information about the characteristics of the accent of a user of the speech recognition device 1 and about acoustic properties of the microphone 5 and of the A/D converter 6. Speech model information and acoustic information can be configured in a user-specific manner.

The document US 5,031,113, the disclosure of which is incorporated by way of reference into the disclosure of the present document, discloses the implementation of a speech recognition method taking into account such parameter information PI, and for this reason no more details of this will be given in the present text. As a result of the speech recognition method, the speech recognition means 7 can store text data comprising the recognized text ET in the storage means 8. Furthermore, the spoken text GT can be stored in the storage means 8 in digitized form.

Sequences of words which are recognized as a command by the speech recognition means 7 are stored in the command storage stage 10. Such commands comprise, for example, the sequence of words "next word bold" in order to make the next word in the recognized text ET bold. It should be mentioned that commands can be matched in a user- specific manner, so that all users do not need to use the same strict sequence of words. Furthermore, commands can be stored in a document-specific manner, so that a sequence of command words has a fixed reference to a specific document. A predefined form 2 which is to be processed by the user of the speech recognition device 1 can be either in paper form or in the form of a computer file 3. The term "process" comprises in this connection the filling out, formatting, supplementing or deletion of elements of a form. If the form 2 is in the form of a computer file 3, then this computer file 3 can be on any desired storage medium, such as the hard disk of the computer on which the speech recognition software is implemented, on a floppy disk or on a CD-ROM. However, the computer file 3 can also be made available via a computer network, such as the Internet for example. The computer file 3 is read into analysis means 12 which are provided with computer file recognition means 13 that recognize the format of the computer file 3. On the one hand, the computer file 3 can be in a proprietary or standardized format for speech recognition software which can be processed directly by the speech recognition means without reformatting. However, the computer file 3 can also be in one of many formats produced by software generally used in office applications, which may be text documents possibly mixed with image elements, tables, etc. Such formats are recognized by the computer file recognition means 13 and are converted by means of computer file conversion means 14 into a data format which can be processed directly by the speech recognition means 7. In general, the data format produced by the computer file conversion means 14 will be the same as the data format in which the recognized text ET is stored in the storage means 8. If the speech recognition means are configured to process a number of data formats in which the recognized text ET may be, then the recognized text ET and the data format produced by the computer file conversion means 14 may also be different from one another. The data format produced by the analysis means 12 from the template 2 is stored as a form template FV in storage means 16.

If the form 2 to be processed is in paper form, it can be converted using a scanner 4, communicating with the analysis means 12, into a computer image data format BF which is subsequently converted by character recognition means 15, contained in the analysis means 12, into the form template FV which is stored in the storage means 16.

The speech recognition means 7 combine the form template FV with the recognized text ET to form a document DK which is stored in storage means 17. This document DK can be processed by the user of the speech recognition device 1 like any other document which has been created directly using the speech recognition device 1. In particular, all the functions of advanced speech recognition software can be used on the document DK. For example, the document DK can be read into reproduction and correction means 18, to which a keyboard 19, a monitor 20 and a loudspeaker 21 are connected. The reproduction and correction means 18 are designed for the visual displaying of the form 2 on the monitor 20 and also for the acoustic reproduction of the spoken text GT and for the synchronous visual marking of the associated recognized text ET in the document DK and of the analyzed elements of the form 2 if the reproduction and correction means 18 are in an activated synchronous reproduction mode of operation. In this reproduction mode of operation, the document DK can be corrected by inputting via the keyboard and also by means of voice commands via the speech recognition device 1 simultaneously.

Hereinbelow, a simple example is given of how a user of the speech recognition device 1 can process a form 2 illustrated in Fig. 2, according to the invention. The form 2 illustrated in Fig. 2 is the template for a radiology report which is to be completed by a radiologist using information about the patient name, clinical information and a summary. It should be mentioned that, for the purposes of the present invention, it is not necessary for the form to comprise separate text fields since the user can navigate at will in the form template FV produced by the speech recognition means 7 from the form, and can therefore perform corresponding inputs at any point in the form template. It is therefore completely sufficient, for example, if the form comprises only individual headings. However, it is of course also possible to process forms which comprise text fields, such as the text field 22 in the upper right-hand corner of the form 2, said text field being surrounded by an outline and comprising a date. By means of the measures according to the invention, it is possible for the user both to process the text field 22 (for example by inputting the date when the report was compiled) and to process all other elements of the form, that is to say the headings, or to add, delete and format any desired elements in the document created from the form. When the form 2 is to be filled out for the first time, the user can use the analysis means 12 to convert it into a form template FV. If the form 2 is in paper form, the conversion is carried out using the scanner 4 and the character recognition means 15. If the form 2 is in the form of a computer file 3, the conversion is carried out using the computer file recognition means 13 and, where appropriate, the computer file conversion means 14. The form template FV produced by the analysis means 12 is presented to the user on the monitor 20, along with the original form 2 illustrated in Fig. 2, and has for example the following data format:

<ROOT >

^TEMPLATE TEXT, "Clinical">

^TEMPLATE TEXT, ":">

This form template FV can now be filled out and processed by the user by means of dictation. For this purpose, the user dictates into the speech recognition device 1 for example the following spoken text GT:

"patient name" (in command mode)

"Henry Schmidt"

"Summary" (in command mode) "bold on" (in command mode)

"Healing fracture mid left femoral diaphysis period"

"bold off' (in command mode)

"Clinical information" (in command mode)

"The fracture fragments are near anatomic alignment. A small amount of periosteal reaction has developed period"

This spoken text GT is converted by the speech recognition means 7 into recognized text ET which is stored in the storage means 8. The recognized text ET and the form template FV are subsequently combined to form a single document DK which is stored in the storage means 17 where the reproduction and correction means 18 can access it. According to the above dictation, the document DK has the following content:

<ROOT >

<DICTATION, 0-2500, "Henry"> <DICTATIOΝ, 2500-3800, "Schmidt">

<TEMPLATE TEXT, "information'^

^TEMPLATE TEXT, ":"> <DICTATION, 12200-12700, "The">

<DICTATION, 22100-23300, "developed"> <DICTATION, 23300-23800, "."> <TEMPLATE TEXT, newline> <TEMPLATE TEXT, "Summary"> <TEMPLATE TEXT, ":"> <DICTATION, 3800-4500, bold, "Healing">

<DICTATION, 10500-11800, bold, "diaphysis">

<DICTATION, 11800-12200, bold, ".">

It should be mentioned that the keyword TEMPLATE TEXT serves as an indication to the speech recognition means 7 and the reproduction and correction means 18 that the adjusted text originates from a predefined form and therefore no audio information is available for it. After the comma there may be any formatting information, which of course can also be input as a command by means of dictation. The keyword TEMPLATE TEXTFIELD indicates that it is a text field which originates from a predefined form, so that no audio information is available. The parameters line and column indicate the position of the text field in the form. The parameter date [dd.mm.yyyy] provides more detailed information about the text field. It is accordingly a date field which represents a date with two digits for the day, two digits for the month and four digits for the year, that is to say 21.02.2003. The keyword DICTATION indicates dictated text. The value after the comma indicates the audio position of the respective word (beginning and end in milliseconds relative to the start of dictation).

This results in the document illustrated in Fig. 3.

Should the radiologist be unsatisfied with the content or the formatting of the document, he can make changes thereto at will. For example, he can dictate the following commands into the speech recognition device 1 :

"Patient name" (in command mode)

"italic" (in command mode)

"Clinical information" (in command mode)

"italic underline" (in command mode) "Summary" (in command mode)

"italic underline" (in command mode)

Once the commands have been processed by the speech recognition means 7, the result is the representation shown in Fig. 4 for the document DK. It should be understood that this is only a simple example of the possibilities that the invention provides, but that in fact all the processing possibilities of advanced speech recognition software are available to the user.

Claims

CLAIMS:

1. A speech recognition device (1) for processing a predefined form (2), wherein the speech recognition device (1) has speech recognition means (7) which can be fed a spoken text (GT) and transcribe it into a recognized text (ET), characterized by analysis means (12) for analyzing and converting the predefined form (2) into a form template (FV) having a data format which can be processed by the speech recognition means (7), wherein the form template (FV) and the recognized text (ET) can be combined by the speech recognition means (7) to form a document (DK).

2. A speech recognition device as claimed in Claim 1, characterized in that the predefined form (2) can be fed to the analysis means (12) as a computer file (3) and in that the analysis means (12) comprise a computer file recognition means (13).

3. A speech recognition device as claimed in Claim 2, characterized in that the computer file recognition means (13) comprise a computer file conversion means (14).

4. A speech recognition device as claimed in Claim 1, characterized in that the analysis means (12) comprise a scanner (4) and character recognition means (15).

5. A speech recognition device as claimed in Claim 1, characterized in that the combining of the form template (FV) and of the recognized text (ET) by the speech recognition means (7) comprises the filling out of at least one text field (22) in the form template with recognized text.

6. A speech recognition method for processing a predefined form (2), comprising the reception of spoken text (GT) and the transcribing of the spoken text into a recognized text (ET), characterized by the steps of: analyzing the predefined form (2) and converting it into a form template (FV) having a data format which corresponds to that of the recognized text (ET), and combining the form template (FV) and the recognized text (ET) to form a document (DK).

7. A speech recognition method as claimed in Claim 6, characterized in that the analyzing of the predefined form (2) comprises the reading in of the form (2) by a computer file (3) and recognitions of the file type of the computer file.

8. A speech recognition method as claimed in Claim 7, characterized in that the analyzing of the predefined form comprises the conversion of the read-in computer file (3) into a different data type.

9. A speech recognition method as claimed in Claim 6, characterized in that the analyzing of the predefined form comprises the scanning of the form and recognition of form text from the data (BF) obtained during the scanning.

10. A speech recognition method as claimed in Claim 6, characterized in that the combing of the form template (FV) and of the recognized text (ET) comprises the filling out of at least one text field (22) in the form template with recognized text.