WO2003079328A1 - Appareil, procede et programme de conversion audio video - Google Patents
Appareil, procede et programme de conversion audio video Download PDFInfo
- Publication number
- WO2003079328A1 WO2003079328A1 PCT/JP2003/003305 JP0303305W WO03079328A1 WO 2003079328 A1 WO2003079328 A1 WO 2003079328A1 JP 0303305 W JP0303305 W JP 0303305W WO 03079328 A1 WO03079328 A1 WO 03079328A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- unit
- video
- language
- data
- speaker
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/278—Subtitling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Definitions
- Audio / video conversion apparatus and method Description Audio / video conversion program
- the present invention relates to an audio-video conversion device and method, and an audio-video conversion program. Background art
- NHK adopts a voice recognition method by a repeater when captioning broadcast video.
- a press release January 20, 2003 published an article that Daikin Industries, Ltd. released “Non-linear character generator software (m0sPy) with speech recognition”. You. This is software that repeats video and audio while repeating pause and playback, and transcribes it through a voice recognition device. Disclosure of the invention
- a repeater converts the voice of an unspecified speaker into his / her own voice and converts it into characters via a voice recognition device, and also delays the image such as the expression of the speaker and converts the characters into characters. It is another object of the present invention to provide an audio-video conversion device and method, and an audio-video conversion program for making it easier for a hearing-impaired person or the like to understand what the speaker has spoken by displaying the information on a screen or the like.
- a repeater reads the voice of a speaker or an interpreter and inputs the voice to a voice recognition device.
- Audio-video conversion for meeting support in which the resulting character string is displayed on the screen together with the video of the speaker
- An object is to provide an apparatus and a method, and an audio-video conversion program.
- the present invention provides translators for international conferences and the like that are held in different languages and prints the conferences immediately (information compensation), supports meetings and classes in which the hearing impaired or the like participates, and gives voices to repeaters from the telephone. The purpose of this is to provide the user with transcribed information. It is another object of the present invention to provide an audio-video conversion device and method, and an audio-video conversion program for assisting communication between different language systems between a speaker and a user.
- a telecommunication circuit that performs communication using a telecommunication line such as the Internet transfers voices and images of speakers to translators, repeaters, and correctors at remote or home locations.
- the purpose of this system is to allow users to use this system wherever they are.
- An object of the present invention is to provide a home-based business for intervening repeaters and interpreters, and to further assist employment by making a home-bound disabled person who is difficult to go out to become a repeater.
- a video delay unit that gives a preset delay time difference to a video signal captured by the camera and outputs delayed video data
- a first voice input unit for inputting the content of the first language of the first repeater who repeats the content of the first language spoken by the speaker;
- a second voice input unit for inputting the second language content of the second repeater who further reads back the content of the second language of the interpreter who has interpreted the content of the first language spoken by the speaker;
- a first and second voice recognition unit that recognizes the contents of the first and second languages input from the first and second voice input units, converts them into first and second visible language data, and outputs the data.
- the first and second visible language data output from the first and second voice recognition units and the delayed video data of the speaker delayed by the video delay unit are input, and the display state is set.
- Display screen that synchronizes or nearly synchronizes these data
- a character image display unit for displaying a display image obtained by synchronizing or substantially synchronizing the first and second visible language data with the delayed image data in accordance with an output from the layout setting unit;
- An input unit for performing various settings for any one or a plurality of the first and second voice recognition units, the video delay unit, and the layout setting unit;
- a processing unit that controls each of the first and second voice recognition units, the video delay unit, the input unit, and the rate setting unit;
- a video delay unit that gives a preset delay time difference to a video signal captured by the camera and outputs delayed video data
- a first speech input unit into which the content of the first language of the first repeater who repeats the content of the first language spoken by the speaker or the interpreter;
- a first speech recognition unit that recognizes a content of a first language input from the first speech input unit, converts the content into first visible language data, and outputs the first visible language data; and a first speech recognition unit.
- the first visible language data output from the unit and the delayed video data of the speaker delayed by the video delay unit are input, the display state is set, and a display video is generated by synchronizing or substantially synchronizing these data.
- a character video display unit for displaying a display video in which the first visible language data and the delayed video data are synchronized or substantially synchronized in accordance with an output from the layout setting unit;
- An input unit for performing various settings for any one of the first voice recognition unit, the video delay unit, the layout setting unit, or a plurality of units;
- the processing unit sets the first and second voice recognition units and the video delay unit according to a command from the input unit or a setting predetermined by an appropriate storage unit;
- a processing unit configured to set a layout setting unit according to a command from the input unit or a setting predetermined by an appropriate storage unit;
- the image delay unit delays the image input to the camera according to the setting and control by the processing unit, performs appropriate image processing as necessary, and outputs delayed image data.
- a first voice input unit for inputting the first language content by the first repeater who repeats the content of the first language by the speaker
- a first speech recognition unit for recognizing the content of the first language input by the first repeater in the first speech input unit and converting the content into the first visible language data; and a second speech input unit,
- the second translator reads back the second language content translated by the interpreter in the second language, and inputs the reverified second language content.
- a second speech recognition unit for recognizing the content of the second language input by the second read-back unit in the second language and converting the content into second visible language data
- a layout setting unit comprising: According to the setting and control, the first and second language data from the first and second voice recognition units and the delayed video data from the video delay unit are input, the display layout of these data is set, and these are set by image processing.
- the character image display section is the first and the second according to the output from the layout setting section. Displaying a display image in which bilingual data and video delay data are synchronized or substantially synchronized;
- the processing unit sets the first voice recognition unit and the video delay unit according to a command from the input unit or a setting predetermined by an appropriate storage unit.
- the processing unit includes a command from the input unit or an appropriate Setting the layout setting unit according to the settings predetermined by the storage unit;
- the image delay unit delays the image input to the camera according to the setting and control by the processing unit, performs appropriate image processing as necessary, and outputs delayed image data.
- a first speech input unit for repetition of the content of the first language by a speaker or an interpreter
- a first speech recognition unit for recognizing the content of the first language by the first repeater input to the first speech input unit and converting the first language content into first visible language data
- a layout setting unit comprising: According to the setting and control, the first language data from the first audio recognition unit and the delayed video data from the video delay unit are input, the display layout of these data is set, and these data are synchronized or synchronized by image processing. Generating and outputting a substantially synchronized display image, and the character image display unit displays the display image in which the first language data and the image delay data are synchronized or substantially synchronized according to the output from the layout setting unit. Steps to perform,
- a first speech recognition unit that recognizes the content of the first language of the first repeater who converts the content of the first language spoken by the speaker, converts the content into the first visible language data, and outputs the data; and the first speech recognition described above.
- the translator interprets the second language content of the interpreter who interprets the first language content spoken by the speaker.Recognizes the second language content of the second repeater, converts it into second visible language data, and outputs it.
- a second speech recognition unit a second input unit for performing various settings of the second speech recognition unit; and a second processing unit that controls the second speech recognition unit and the second input unit.
- a recognition device a recognition device;
- a display device for receiving an output from the first and second recognition devices and displaying characters and images
- the display device includes
- a video delay unit that gives a preset delay time difference to a video signal captured by the power camera and outputs delayed video data
- the first and second visible language data output from the first and second recognition devices and the delayed video data of the speaker delayed by the video delay unit are input, and a display state is set and these data are set.
- a character image display unit for displaying a display image output from the layout setting unit
- a third input unit for performing various settings of the video delay unit and the delay setting unit
- a third processing unit that controls each unit of the video delay unit, the third input unit, and the layout setting unit;
- a first voice recognition unit for recognizing the content of the first language of the first repeater who reads the content of the first language spoken by the speaker or the interpreter, converting the content into the first visible language data, and outputting the data;
- a first input unit for performing various settings of the first voice recognition unit, and a first recognition device including the first voice recognition unit and a first processing unit that controls the first input unit,
- a display device for receiving the output from the first recognition device and displaying characters and images
- the display device includes
- a video delay unit that gives a preset delay time difference to a video signal captured by the camera and outputs delayed video data
- the first visible language data output from the first recognition device and the delayed video data of the speaker delayed by the video delay unit are input, and the display state is set to synchronize or substantially synchronize these data.
- a layout setting section for generating the displayed video is input, and the display state is set to synchronize or substantially synchronize these data.
- a character image display unit for displaying a display image output from the layout setting unit
- a third input unit for performing various settings of the video delay unit and the delay setting unit
- a third processing unit that controls each unit of the video delay unit, the third input unit, and the layout setting unit;
- An audio-video conversion method for converting a speaker's voice into visible language data and displaying it together with the speaker's video data
- the first and second processing units and the third processing unit respectively follow the instructions from the first and second input units and the third input unit, or the first and second processing units in accordance with settings predetermined by an appropriate storage unit.
- a third processing unit configured to set the layout setting unit according to a command from the third input unit or a setting predetermined by an appropriate storage unit; and an image delay unit configured to perform the setting by the third processing unit.
- a first speech recognition unit for recognizing the content of the first language by the first repeater who repeats the content of the first language by the speaker and converting the content into the first visible language data
- the second speech recognition unit recognizes the contents of the first language by the speaker and recalls the contents of the second language interpreted by the interpreter. Steps to
- the layout setting unit inputs the first and second visible language data from the first and second voice recognition units and the delayed video data from the video delay unit according to the settings and control by the third processing unit, and Setting a display layout of data, and generating and outputting a display image in which these data are synchronized or substantially synchronized by image processing;
- the character image display unit displays a display image in which the first and second visible language data and the image delay data are synchronized or substantially synchronized according to the output from the layout setting unit,
- An audio-video conversion method for converting a speaker's voice into visible language data and displaying it together with the speaker's video data
- the first and third processing units respectively perform a setting of a first voice recognition unit and a video delay unit in accordance with a command from the first and third input units or a setting predetermined by an appropriate storage unit; ,
- a third processing unit configured to set the layout setting unit according to a command from the third input unit or a setting predetermined by an appropriate storage unit; and an image delay unit configured to perform the setting by the third processing unit.
- Input to camera according to control Performing delayed and appropriate image processing as needed on the video of the speaker that has been output, and outputting delayed video data;
- a first speech recognition unit for recognizing the content of the first language by a speaker or an interpreter, recognizing the content of the first language by the first repetitor, and converting the content into first visible language data;
- the layout setting unit inputs the first visible language data from the first voice recognition unit and the delayed video data from the video delay unit according to the settings and control by the third processing unit, and changes the display layout of those data. Setting, and generating and outputting a display image in which these data are synchronized or substantially synchronized by image processing; and
- FIG. 1 is a schematic configuration diagram of an audio-video conversion device according to a first embodiment.
- FIG. 2 is a flowchart of a first embodiment of the voice conversion processing by the processing unit.
- FIG. 3 is a schematic configuration diagram of an audio-video conversion device according to a second embodiment.
- FIG. 4 is a flowchart of a second embodiment of the voice conversion processing by the processing unit.
- FIG. 5 is a schematic configuration diagram of an audio-video conversion device according to a third embodiment.
- FIG. 6 is a schematic configuration diagram of an audio-video conversion device according to a fourth embodiment.
- FIG. 1 is a schematic configuration diagram of a first embodiment of an audio-video conversion device.
- the present embodiment particularly supports communication in multilingual meetings, such as international conferences, multilateral conferences, and bilateral conferences, meetings, lectures, classes, and education.
- the audio-video conversion device includes a camera 1, a video delay unit 2, first and second audio input units 3, 4, first and second audio recognition units 5, 6, a character display unit 7, a layout, and the like. It comprises a setting unit 8, a character image display unit 9, an input unit 10, and a processing unit 11.
- Camera 1 captures the facial expression of speaker A.
- the video delay unit 2 gives a preset delay time difference to the video signal from the camera 1 and outputs delayed video data.
- the image delay unit 2 displays the speaker's facial expression image together with the recognized characters, and gives a predetermined image delay time to assist the recipient in understanding the language. This video delay time can be changed as appropriate according to the reading ability of the conference participants such as the hearing impaired and the speed and ability of the speaker A ⁇ repeater B or C ⁇ interpreter D. Further, the video delay unit 2 may perform appropriate image processing such as enlargement or reduction of the video such as the expression of the speaker A.
- the first voice input unit 3 is configured by a microphone or the like, and inputs the content of the voice of the specific first repeater B who has heard the voice of the speaker A.
- the second voice input unit 4 receives the content of the specific second repeater C who interprets the content spoken by the speaker A by the interpreter D and hears the voice of the interpreter D.
- the repeater B or C should resolve the environmental noise and the effect of the microphone by inputting voice through the first or second voice input unit 3 or 4 such as a narrative microphone in a quiet place provided in the conference. Can also.
- the first and second voice recognition units 5 and 6 recognize the voice input from the first and second voice input units 3 and 4, respectively, and convert them into first and second visible language data such as character data and ideographic data. Convert and output.
- the first voice recognition The knowledge unit 5 receives the repetition in the first language by the first repeater B who has heard the first language (eg, Japanese) spoken by speaker A, and reads the visible language data of the first language (eg, Japanese). : Japanese character string) is output.
- the second voice recognition unit 6 interprets the first language (eg, Japanese) spoken by speaker A and interprets it into the second language (eg, a foreign language such as English).
- the content read by the second repeater C who heard the second language spoken by the interpreter D in the second language is input, and the visible language data of the second language (eg, a foreign language string such as English) is output. I do.
- the first and / or second voice recognition units 5 and 6 can select one or both of the voice read by the first repeater B and the voice of the interpreter D read by the second repeater C. You may do so.
- the first and / or second voice recognition units 5 and 6 are set to recognize the voice of the repeater, and the first and / or second voice recognition units 5 and / or 6 depend on the topic spoken by speaker A or the content of the conference.
- the repeaters B and C may be provided with a selection unit that can select a language database registered in the first and / or second speech recognition devices 5 and 6.
- the first and / or second speech recognition units 5 and 6 are configured to calculate a probability of incorrect conversion in the kana-to-kanji conversion, and a probability calculated by the false conversion probability calculator.
- An output determination unit that determines whether to output a kanji character or a kana character in accordance with the output may be provided.
- the first and / or second speech recognition units 5 and 6 calculate the probability of misrecognition before speech recognition for kanji processing of Japanese equivalent words, and display them in kana characters if the probability is high. It can be done. Further, words that are not registered in the first and / or second voice recognition units 5 and 6 may be displayed in kana characters according to the judgment of the first and / or second repeaters B and C.
- the character display unit 7 visually displays the visible language data of the first language output by the first voice recognition unit 5.
- the interpreter D may interpret the first visible language data displayed by the character display unit 7 and interpret it.
- the layout setting unit 8 includes the first and second visible language data output as a result of the recognition by the first and second speech recognition units 5 and 6, and the speaker A delayed by the video delay unit 2. Input the delayed video data and set the display status on the character video display unit 9.
- the processing unit 11 includes, for the first and second visible language data (character data) and the delayed video data displayed on the character image display unit 9, the number of lines per unit time, the number of characters per unit time, and per line. One or more of the number of characters, colors, sizes, display positions, and other display formats are set, and the layout setting unit 8 sets the first and second visible language data and data according to the settings made by the processing unit 11. Appropriate image processing such as enlargement / reduction of the delayed video data is executed to generate a display image.
- the character image display unit 9 outputs the first and second visible language data output as a result of recognition by the first and second voice recognition units 5 and 6. And the delayed video data of speaker A delayed by video delay unit 2 and displayed.
- the input unit 10 performs various settings of the first and second voice recognition units 5 and 6, the video delay unit 2, the layout setting unit 8, and the like, and instructs data input to an appropriate database or memory.
- the processing unit 11 is a small computer, and controls each unit such as the first and second voice recognition units 5 and 6, the video delay unit 2, the input unit 10, and the layout setting unit 8.
- FIG. 2 shows a flowchart of the first embodiment of the voice conversion processing by the processing unit.
- the processing unit 11 sets the first and second voice recognition units 5 and 6 and the video delay unit 2 according to a command from the input unit 10 or a setting predetermined by an appropriate storage unit (S 0 1).
- an appropriate storage unit S 0 .
- the first and second speech recognition units 5 and 6 for example, a threshold value of a kanji error recognition rate, a language database to be used, and the like are set.
- the video delay unit 2 for example, the delay time of the speaker image is set or selected.
- the processing unit 11 sets the layout setting unit 8 according to a command from the input unit 10 or a setting predetermined by an appropriate storage unit (S03), and sets the layout setting unit 8 Then, the display state and layout of the first and second visible language data and the delayed video data displayed on the character video display unit 9 are set.
- visible language data for example, the number of displayed character strings, the size of the displayed characters, font, color, display position of character strings, and for delayed video data, the size and display position of the speaker image, etc. Each is set appropriately.
- Camera 1 inputs the video of speaker A (S05).
- the video delay unit 2 delays the video input to the camera 1 and performs appropriate image processing as necessary according to the settings and control by the processing unit 11 and outputs delayed video data (s
- the first voice input unit 3 inputs the voice of the first repeater B (S11).
- the first voice recognition unit 5 recognizes the first language by the first repeater B input to the first voice input unit 3 according to the settings and control by the processing unit 11 and generates the first visible language data (eg, Japan). (String of words) (S13). Further, if necessary, the character display unit 7 displays the first visible language data output from the first voice recognition unit 5 (S15).
- the second translator C repeats the voice interpreted by the interpreter based on the speaker's voice and / or the first visible language data displayed on the character display unit 7, and the second voice is read back. Input voice (S17).
- the second voice recognition unit 6 recognizes the second language by the second read-backer C input to the second voice input unit 4 according to the setting and control by the processing unit 11 and generates the second visible language data (eg: (Foreign language character string) (S19).
- the layout setting unit 8 outputs the first and second visible language data and video from the first and second speech recognition units 5, 6.
- the delay video data is input from the delay unit 2, the display layout of the data is set, and a display image is generated and output by appropriate image processing as needed (S21).
- the character image display unit 9 appropriately displays the first and second visible language data and the image delay unit 2 according to the output from the layout setting unit 8 (S23).
- step S 01 If there is, the process returns to step S 01 to execute the process (S 25) . If there is no setting change, if there is no change in speaker A, the processing unit 11 follows the process after step S 03 The process proceeds to the next step. On the other hand, if there is a change in speaker A, the process is terminated (S27), and the process can be executed again.
- FIG. 3 is a schematic configuration diagram of a second embodiment of the audio-video converter.
- the present embodiment particularly supports communication in meetings, meetings, and lectures such as domestic conferences and bilateral conferences.
- the audio-video converter according to the present embodiment includes a camera 1, a video delay unit 2, first and second audio input units 3, 4, a first audio recognition unit 5, a character display unit 7, a layout setting unit 8, It comprises a character video display section 9, an input section 10, a processing section 11, and a selection section 20.
- the second embodiment differs from the first embodiment in that the second speech recognition unit is omitted and a selection unit 20 is further provided, but other configurations and operations are the same. Note that the second voice input unit and the selection unit 20 may be further omitted as necessary.
- FIG. 4 shows a flowchart of a second embodiment of the voice conversion processing by the processing unit.
- the first voice input unit 3 outputs either the voice of the repeater B who has read back the speaker's voice or the voice of the interpreter D who has interpreted the speaker's voice. Is entered.
- the processing unit 11 sets the first voice recognition unit 5, the video delay unit 2, and the selection unit 20 according to a command from the input unit 10 or a setting predetermined by an appropriate storage unit (S 1 0 1).
- the selection unit 20 is omitted, the setting is unnecessary.
- a threshold value of a kanji error recognition rate, a language database to be used, and the like are set.
- the processing unit 11 sets the layout setting unit 8 in accordance with a command from the input unit 10 or a setting predetermined by an appropriate storage unit (S103).
- the display state of the first visual language data in this example, a Japanese character string or a foreign language character string
- the layout are changed.
- For visible language data for example, the number of displayed character strings, the size of displayed characters, fonts, colors, display positions of character strings, and for delayed video data, the size and display position of speaker images Are set as appropriate.
- Camera 1 inputs the video of speaker A (S105).
- the video delay unit 2 delays the video input to the camera 1 and performs appropriate image processing as necessary according to the settings and control by the processing unit 11 and outputs delayed video data (S107) .
- the first voice input unit 3 inputs the voice of the first repeater B or the second repeater C (S111).
- the first speech recognition unit 5 performs the first language by the first repeater B or the second repeater C input to the first speech input unit 3 according to the setting and control by the processing unit 11 (in this example, Japan Word or foreign language) and convert it to the first visual language data (in this example, Japanese character string or foreign language character string) (S111). Further, if necessary, the character display unit 7 displays the first visible language data output from the first voice recognition unit 5 (S115).
- the layout setting unit 8 inputs the first visible language data from the first voice recognition unit 5 and the delayed video data from the video delay unit 2 according to the settings and control by the processing unit 11, and changes the display layout of those data. After setting, a display image is generated and output by appropriate image processing as needed (S122).
- the character image display unit 9 appropriately displays the first visible language data and the image delay unit 2 according to the output from the layout setting unit 8 (S123).
- the processing section 11 returns to step S101 to execute the processing (S125). Also, if there is no setting change, if there is no change in speaker A, the processing unit 11 shifts to the processing after step S103, whereas if there is a change in speaker A, the processing ends. (S127), the process can be executed again.
- FIG. 5 is a schematic configuration diagram of a third embodiment of the audio-video conversion device.
- a third party such as a repeater converts the speaker's speech / language information into text / language information, and presents the linguistic information and the non-verbal information of the speaker via a telecommunication circuit. By doing so, it assists communication between the speaker and the user in different language systems.
- the audio-video conversion device of the present embodiment includes a speaker device 100, a translator device 200, first and second repeater devices 300 and 400, a first and a second device. It includes recognition devices 500 and 600, a display device 700, and a telecommunication circuit 800.
- the speaker device 100 comprises a camera 1 and, if necessary, a microphone.
- the translator device 200 includes a receiver and a microphone.
- the first and second repeater devices 300 and 400 include first and second voice input units 3 and 4, respectively, and a receiver.
- the first and second recognition devices 500 and 600 are respectively a first and second speech recognition unit 5 and 6, an input unit 10-b and 10-c, and a processing unit 111-b. And 1 1 1 c.
- the display device 700 includes a video delay unit 2, a character display unit 7, a layout setting unit 8, a character video display unit 9, an input unit 10-c, and a processing unit 11-c.
- the configuration indicated by the black circles in the figure is a telecommunication circuit 800, and various telecommunication lines such as the Internet, LAN, wireless LAN, mobile phone, and PDA, and telecommunication lines are input and output. It indicates that an interface is provided in each of the devices 100 to 700. Speaker device 100, interpreter device 200, first and second repeater devices 300 and 400, first and second recognition device 500 and 600, display device Each of the 700s is appropriately connected by such an electric communication circuit 800 as needed, and audio and / or video signals are communicated. The connection may be made directly by wire or wirelessly without going through any of the telecommunication circuits 800 in the figure.
- the display device 700 to be installed in a venue or the like can be located anywhere and can be appropriately arranged.
- the input unit 10-a performs various settings of each unit such as the video delay unit 2 and the layout setting unit 8, and instructs data input to an appropriate database or memory.
- the processing unit-a is a small computer, and controls each unit such as the video delay unit 2, the input units 10_a, -b and 10-c, and the layout setting unit 8.
- the input units 10-b and 10-c perform various settings of the first and second speech recognition units 5 and 6 and input data to an appropriate database or memory.
- the processing units 11 1 -b and 11-1 c are small computers and control each unit such as the first and second speech recognition units 5 and 6.
- the flowchart of the voice conversion process according to the third embodiment is the same as that of the first embodiment, and operates as described above.
- FIG. 6 is a schematic configuration diagram of a fourth embodiment of the audio-video converter.
- a third party such as a repeater converts the speaker's speech / language information into text / language information, and presents the linguistic information and the non-verbal information of the speaker via a telecommunication circuit. By doing so, it assists communication between the speaker and the user in different language systems.
- the audio-video conversion device of the present embodiment includes a speaker device 100, an interpreter device 200, first and second repeater devices 300 and 400, and a first recognition device 5 00, a display device 700, and a telecommunication circuit 800.
- the second recognition device 600 including the second speech recognition unit is omitted, and the first recognition device 500 is further provided with a selection unit 20.
- the other configurations and operations are the same.
- the configuration and operation of the selection unit 20 are the same as those of the second embodiment. Note that the second voice input unit and the selection unit 20 may be further omitted as necessary.
- the flowchart of the voice conversion process according to the fourth embodiment is similar to that of the third embodiment, and operates as described above.
- the voice recognition device uses the voice database of the repeater registered in advance, and inputs the voice of the speaker A who has read back the voice to the voice recognition device.
- a high recognition rate can be obtained for any speaker A.
- the repeater can repeat the voice of interpreter D and translate the foreign language into Japanese with a high recognition rate.
- interpreter D translates the voice into a foreign language and repeats the voice in that foreign language, so that Japanese can be translated into a foreign language with a high recognition rate.
- the voice of the questioner can be displayed in characters, bidirectional meeting support can be realized.
- this embodiment can be used not only for domestic conferences but also for communication support at international conferences.
- the method of capturing the video of speaker A and displaying it together with the character string of the recognition result at a certain delay time is adopted.
- the video can be used as a clue for understanding the sound.
- the video delay time by the video delay unit 2 can be changed according to the reading ability of the hearing impaired.
- a hearing-impaired person who is proficient in reading lips can read 5% of errors in speech recognition.
- the character-image conversion method or the character-image conversion device / system of the present invention is a character-image conversion program for causing a computer to execute the respective steps, and a computer-readable recording medium that stores the character-image conversion program. It can be provided by a program product including a conversion program and loadable into an internal memory of a computer, a computer such as a server including the program, or the like.
- the repeater reads the voice of the To a character via a speech recognition device, and by delaying the image, such as the expression of the speaker, and displaying it on a screen with the characters, it is easy for hearing impaired persons, etc. to understand what the speaker has spoken. And a method and an audio-video conversion program for performing the audio-video conversion.
- a conference such as an international conference where a hearing-impaired person attends, a multilateral / bilateral conference, or the like
- the voice of the speaker or the interpreter is read by the repeater and input to the speech recognition device.
- translators of international conferences and the like held in different languages and immediate printing of the conferences (information compensation), conferences with hearing-impaired persons, support for teaching, and voices from the telephone to the repeater Can be transferred to provide the user with the coded information.
- the voice and image of the speaker are transferred to an interpreter, a repeater, and a corrector at a remote location or at home by using a telecommunication circuit that performs communication using a telecommunication line such as the Internet.
- a telecommunication circuit that performs communication using a telecommunication line such as the Internet.
- an intervening repeater or interpreter can use as a home business, and furthermore, a disabled person at home who has difficulty going out can support employment by becoming a repeater.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Machine Translation (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Studio Circuits (AREA)
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU2003220916A AU2003220916A1 (en) | 2002-03-20 | 2003-03-19 | Audio video conversion apparatus and method, and audio video conversion program |
| US10/506,220 US20050228676A1 (en) | 2002-03-20 | 2003-03-19 | Audio video conversion apparatus and method, and audio video conversion program |
| EP03744531A EP1486949A4 (en) | 2002-03-20 | 2003-03-19 | AUDIO VIDEO TRANSFER DEVICE AND METHOD AND AUDIO VIDEO TRANSLATION PROGRAM |
| CA002479479A CA2479479A1 (en) | 2002-03-20 | 2003-03-19 | Audio video conversion apparatus, audio video conversion method, and audio video conversion program |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2002-77773 | 2002-03-20 | ||
| JP2002077773 | 2002-03-20 | ||
| JP2003-68440 | 2003-03-13 | ||
| JP2003068440A JP2003345379A (ja) | 2002-03-20 | 2003-03-13 | 音声映像変換装置及び方法、音声映像変換プログラム |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2003079328A1 true WO2003079328A1 (fr) | 2003-09-25 |
Family
ID=28043788
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2003/003305 Ceased WO2003079328A1 (fr) | 2002-03-20 | 2003-03-19 | Appareil, procede et programme de conversion audio video |
Country Status (7)
| Country | Link |
|---|---|
| US (1) | US20050228676A1 (enExample) |
| EP (1) | EP1486949A4 (enExample) |
| JP (1) | JP2003345379A (enExample) |
| CN (1) | CN1262988C (enExample) |
| AU (1) | AU2003220916A1 (enExample) |
| CA (1) | CA2479479A1 (enExample) |
| WO (1) | WO2003079328A1 (enExample) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220335951A1 (en) * | 2019-09-27 | 2022-10-20 | Nec Corporation | Speech recognition device, speech recognition method, and program |
Families Citing this family (36)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6603835B2 (en) | 1997-09-08 | 2003-08-05 | Ultratec, Inc. | System for text assisted telephony |
| US8416925B2 (en) | 2005-06-29 | 2013-04-09 | Ultratec, Inc. | Device independent text captioned telephone service |
| US8515024B2 (en) | 2010-01-13 | 2013-08-20 | Ultratec, Inc. | Captioned telephone service |
| JP4761568B2 (ja) | 2004-05-12 | 2011-08-31 | 貴司 吉峰 | 会話支援装置 |
| JP2006240826A (ja) * | 2005-03-03 | 2006-09-14 | Mitsubishi Electric Corp | エレベータかご内表示装置 |
| US11258900B2 (en) | 2005-06-29 | 2022-02-22 | Ultratec, Inc. | Device independent text captioned telephone service |
| KR100856407B1 (ko) * | 2006-07-06 | 2008-09-04 | 삼성전자주식회사 | 메타 데이터를 생성하는 데이터 기록 및 재생 장치 및 방법 |
| US7844460B2 (en) * | 2007-02-15 | 2010-11-30 | Motorola, Inc. | Automatic creation of an interactive log based on real-time content |
| CN101309390B (zh) * | 2007-05-17 | 2012-05-23 | 华为技术有限公司 | 视讯通信系统、装置及其字幕显示方法 |
| WO2008154542A1 (en) * | 2007-06-10 | 2008-12-18 | Asia Esl, Llc | Program to intensively teach a second language using advertisements |
| US8149330B2 (en) * | 2008-01-19 | 2012-04-03 | At&T Intellectual Property I, L. P. | Methods, systems, and products for automated correction of closed captioning data |
| US8358328B2 (en) * | 2008-11-20 | 2013-01-22 | Cisco Technology, Inc. | Multiple video camera processing for teleconferencing |
| JP4930564B2 (ja) * | 2009-09-24 | 2012-05-16 | カシオ計算機株式会社 | 画像表示装置及び方法並びにプログラム |
| CN102934107B (zh) * | 2010-02-18 | 2016-09-14 | 株式会社尼康 | 信息处理装置、便携式装置以及信息处理系统 |
| US8670018B2 (en) | 2010-05-27 | 2014-03-11 | Microsoft Corporation | Detecting reactions and providing feedback to an interaction |
| US8963987B2 (en) * | 2010-05-27 | 2015-02-24 | Microsoft Corporation | Non-linguistic signal detection and feedback |
| JP5727777B2 (ja) | 2010-12-17 | 2015-06-03 | 株式会社東芝 | 会議支援装置および会議支援方法 |
| CN104424955B (zh) * | 2013-08-29 | 2018-11-27 | 国际商业机器公司 | 生成音频的图形表示的方法和设备、音频搜索方法和设备 |
| CN103632670A (zh) * | 2013-11-30 | 2014-03-12 | 青岛英特沃克网络科技有限公司 | 语音和文本消息自动转换系统及其方法 |
| US10878721B2 (en) | 2014-02-28 | 2020-12-29 | Ultratec, Inc. | Semiautomated relay method and apparatus |
| US12482458B2 (en) | 2014-02-28 | 2025-11-25 | Ultratec, Inc. | Semiautomated relay method and apparatus |
| US10389876B2 (en) | 2014-02-28 | 2019-08-20 | Ultratec, Inc. | Semiautomated relay method and apparatus |
| US20180270350A1 (en) | 2014-02-28 | 2018-09-20 | Ultratec, Inc. | Semiautomated relay method and apparatus |
| US20180034961A1 (en) | 2014-02-28 | 2018-02-01 | Ultratec, Inc. | Semiautomated Relay Method and Apparatus |
| US9741342B2 (en) * | 2014-11-26 | 2017-08-22 | Panasonic Intellectual Property Corporation Of America | Method and apparatus for recognizing speech by lip reading |
| KR102281341B1 (ko) * | 2015-01-26 | 2021-07-23 | 엘지전자 주식회사 | 싱크 디바이스 및 그 제어 방법 |
| US10397645B2 (en) * | 2017-03-23 | 2019-08-27 | Intel Corporation | Real time closed captioning or highlighting method and apparatus |
| US11017778B1 (en) * | 2018-12-04 | 2021-05-25 | Sorenson Ip Holdings, Llc | Switching between speech recognition systems |
| US10573312B1 (en) * | 2018-12-04 | 2020-02-25 | Sorenson Ip Holdings, Llc | Transcription generation from multiple speech recognition systems |
| CN110246501B (zh) * | 2019-07-02 | 2022-02-01 | 思必驰科技股份有限公司 | 用于会议记录的语音识别方法及系统 |
| US11132535B2 (en) * | 2019-12-16 | 2021-09-28 | Avaya Inc. | Automatic video conference configuration to mitigate a disability |
| US11539900B2 (en) | 2020-02-21 | 2022-12-27 | Ultratec, Inc. | Caption modification and augmentation systems and methods for use by hearing assisted user |
| CN111814451B (zh) * | 2020-05-21 | 2024-11-12 | 北京嘀嘀无限科技发展有限公司 | 文本处理方法、装置、设备和存储介质 |
| US11488604B2 (en) | 2020-08-19 | 2022-11-01 | Sorenson Ip Holdings, Llc | Transcription of audio |
| JP2023133782A (ja) * | 2022-03-14 | 2023-09-27 | 本田技研工業株式会社 | 音声認識テキスト表示システム、音声認識テキスト表示装置、音声認識テキスト表示方法およびプログラム |
| KR102583764B1 (ko) * | 2022-06-29 | 2023-09-27 | (주)액션파워 | 외국어가 포함된 오디오의 음성 인식 방법 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPS63219067A (ja) * | 1987-03-09 | 1988-09-12 | Agency Of Ind Science & Technol | 辞書検索装置 |
| JPH0850698A (ja) * | 1994-08-05 | 1996-02-20 | Mazda Motor Corp | 音声対話型ナビゲーション装置 |
| JPH10234016A (ja) * | 1997-02-21 | 1998-09-02 | Hitachi Ltd | 映像信号処理装置及びそれを備えた映像表示装置及び記録再生装置 |
| JP2002010138A (ja) * | 2000-06-20 | 2002-01-11 | Nippon Telegr & Teleph Corp <Ntt> | 情報処理方法及び情報処理装置 |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5294982A (en) * | 1991-12-24 | 1994-03-15 | National Captioning Institute, Inc. | Method and apparatus for providing dual language captioning of a television program |
| US5701153A (en) * | 1994-01-14 | 1997-12-23 | Legal Video Services, Inc. | Method and system using time information in textual representations of speech for correlation to a second representation of that speech |
| CN1208969C (zh) * | 1998-03-31 | 2005-06-29 | 松下电器产业株式会社 | 传送装置以及传送方法 |
| US7110951B1 (en) * | 2000-03-03 | 2006-09-19 | Dorothy Lemelson, legal representative | System and method for enhancing speech intelligibility for the hearing impaired |
| GB2379312A (en) * | 2000-06-09 | 2003-03-05 | British Broadcasting Corp | Generation subtitles or captions for moving pictures |
| US7035797B2 (en) * | 2001-12-14 | 2006-04-25 | Nokia Corporation | Data-driven filtering of cepstral time trajectories for robust speech recognition |
-
2003
- 2003-03-13 JP JP2003068440A patent/JP2003345379A/ja not_active Withdrawn
- 2003-03-19 CA CA002479479A patent/CA2479479A1/en not_active Abandoned
- 2003-03-19 WO PCT/JP2003/003305 patent/WO2003079328A1/ja not_active Ceased
- 2003-03-19 AU AU2003220916A patent/AU2003220916A1/en not_active Abandoned
- 2003-03-19 EP EP03744531A patent/EP1486949A4/en not_active Withdrawn
- 2003-03-19 CN CN03806570.3A patent/CN1262988C/zh not_active Expired - Fee Related
- 2003-03-19 US US10/506,220 patent/US20050228676A1/en not_active Abandoned
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPS63219067A (ja) * | 1987-03-09 | 1988-09-12 | Agency Of Ind Science & Technol | 辞書検索装置 |
| JPH0850698A (ja) * | 1994-08-05 | 1996-02-20 | Mazda Motor Corp | 音声対話型ナビゲーション装置 |
| JPH10234016A (ja) * | 1997-02-21 | 1998-09-02 | Hitachi Ltd | 映像信号処理装置及びそれを備えた映像表示装置及び記録再生装置 |
| JP2002010138A (ja) * | 2000-06-20 | 2002-01-11 | Nippon Telegr & Teleph Corp <Ntt> | 情報処理方法及び情報処理装置 |
Non-Patent Citations (3)
| Title |
|---|
| KATO ET AL.: "Onsei ninshiki gijutsu o mochiita chokaku shogaisha muke no kokusai kaigi sanka shien system no sekkei", THE JAPAN SOCIETY OF MECHANICAL ENGINEERS (NO. 02-6) ROBOTICS MECHATRONICS KOENKAI '02 KOEN RONBUNSHU, 7 June 2002 (2002-06-07), pages 1A1-C08, XP002968815 * |
| KOBAYASHI ET AL.: "Chokaku shogaisha no tame no onsei ninshiki o katsuyo shita real time jimaku sony system (2)", THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS GIJUTSU KENKYU HOKOKU (ET2000-108), vol. 100, no. 600, 20 January 2001 (2001-01-20), pages 129 - 134, XP002968814 * |
| See also references of EP1486949A4 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220335951A1 (en) * | 2019-09-27 | 2022-10-20 | Nec Corporation | Speech recognition device, speech recognition method, and program |
Also Published As
| Publication number | Publication date |
|---|---|
| CN1262988C (zh) | 2006-07-05 |
| CA2479479A1 (en) | 2003-09-25 |
| AU2003220916A1 (en) | 2003-09-29 |
| US20050228676A1 (en) | 2005-10-13 |
| CN1643573A (zh) | 2005-07-20 |
| JP2003345379A (ja) | 2003-12-03 |
| EP1486949A4 (en) | 2007-06-06 |
| EP1486949A1 (en) | 2004-12-15 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2003079328A1 (fr) | Appareil, procede et programme de conversion audio video | |
| JP2003345379A6 (ja) | 音声映像変換装置及び方法、音声映像変換プログラム | |
| US9111545B2 (en) | Hand-held communication aid for individuals with auditory, speech and visual impairments | |
| US10885318B2 (en) | Performing artificial intelligence sign language translation services in a video relay service environment | |
| US8494859B2 (en) | Universal processing system and methods for production of outputs accessible by people with disabilities | |
| US6377925B1 (en) | Electronic translator for assisting communications | |
| US20090012788A1 (en) | Sign language translation system | |
| US20120209588A1 (en) | Multiple language translation system | |
| JP2005513619A (ja) | リアルタイム翻訳機および多数の口語言語のリアルタイム翻訳を行う方法 | |
| US11848026B2 (en) | Performing artificial intelligence sign language translation services in a video relay service environment | |
| CN116527840B (zh) | 一种基于云边协同的直播会议智能字幕显示方法和系统 | |
| US12243551B2 (en) | Performing artificial intelligence sign language translation services in a video relay service environment | |
| JP2009122989A (ja) | 翻訳装置 | |
| Priya et al. | Indian and English language to sign language translator-an automated portable two way communicator for bridging normal and deprived ones | |
| US20040012643A1 (en) | Systems and methods for visually communicating the meaning of information to the hearing impaired | |
| JP7152454B2 (ja) | 情報処理装置、情報処理方法、情報処理プログラム及び情報処理システム | |
| WO2024008047A1 (zh) | 数字人手语播报方法、装置、设备及存储介质 | |
| CN118520885A (zh) | 一种音视频翻译方法及系统 | |
| TWI795209B (zh) | 多種手語轉譯系統 | |
| Balamani et al. | IYAL: Real-Time Voice to Text Communication for the Deaf | |
| JP5424359B2 (ja) | 理解支援システム、支援端末、理解支援方法およびプログラム | |
| Zimmermann et al. | Internet Based Personal Services on Demand | |
| GB2342202A (en) | Simultaneous translation | |
| JP2025135936A (ja) | 会話支援装置、会話支援システム、会話支援方法、およびプログラム | |
| KR20240074329A (ko) | 청각 장애인을 위한 작업장 음성 지원 장치 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
| DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| WWE | Wipo information: entry into national phase |
Ref document number: 2479479 Country of ref document: CA |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2003744531 Country of ref document: EP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 20038065703 Country of ref document: CN |
|
| WWP | Wipo information: published in national office |
Ref document number: 2003744531 Country of ref document: EP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 10506220 Country of ref document: US |