CN111354362A - Method and device for assisting hearing-impaired communication - Google Patents

Method and device for assisting hearing-impaired communication Download PDF

Info

Publication number
CN111354362A
CN111354362A CN202010096561.1A CN202010096561A CN111354362A CN 111354362 A CN111354362 A CN 111354362A CN 202010096561 A CN202010096561 A CN 202010096561A CN 111354362 A CN111354362 A CN 111354362A
Authority
CN
China
Prior art keywords
hearing
voice
impaired person
emotion
sign language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010096561.1A
Other languages
Chinese (zh)
Inventor
冯博豪
张小帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010096561.1A priority Critical patent/CN111354362A/en
Publication of CN111354362A publication Critical patent/CN111354362A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/117Biometrics derived from hands

Abstract

The embodiment of the disclosure discloses a method and a device for assisting hearing-impaired people to communicate. One embodiment of the method comprises: acquiring the voice of a voice user; converting the speech into text; performing natural language processing on the text, and analyzing the emotion of the voice user; outputting the emotion of the text and voice user in a text mode; acquiring a sign language image set of a hearing-impaired person; inputting the sign language image set into a pre-trained neural network model, and outputting the intention of the hearing-impaired person expressed by the sign language; the intent is output in a phonetic and/or textual manner. This embodiment assists the hearing-impaired person to understand the meaning of the speaker and to express the personal opinion to the speaker who does not understand sign language.

Description

Method and device for assisting hearing-impaired communication
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to a method and a device for assisting hearing-impaired people to communicate.
Background
The handicapped, especially those with hearing impairment, cannot freely express their will due to the limitation of physical conditions. In particular, the ability to participate in litigation is limited, and the right to participate in litigation granted by law cannot be exercised. In case of dealing with a case involving a handicapped person as a national court, a sign language interpreter is generally invited to participate in court trial in order to sufficiently secure the litigation right of the handicapped person.
Currently, sign language translation is often required if a hearing-impaired person is communicating with a voice user. Especially in court trials, judges and lawyers tend to deliberately slow down their speech in order for sign language translators to convey information to hearing impaired. However, in the translation process, an ambiguous situation exists inevitably, and serious situations may cause adverse effects.
The current way of asking sign language translation has the following problems:
(1) finding a suitable sign language translator is not easy. The sign language translator must understand the professional vocabulary and understand how to explain it in the sign language. If the meaning of the intermediate communication deviates, it may lead to unpredictable results.
(2) The interaction efficiency is low. Every time a voice user speaks a sentence, the translator needs to translate the sentence to the hearing-impaired person, and repeatedly confirms whether the hearing-impaired person can understand the meaning of the voice user. Similarly, when expressing the personal opinions, the hearing-impaired person needs to repeatedly confirm that the translator understands his meaning.
(3) The expression ability is weak. In the forensic debate, it is often necessary to continually explain and decipher the points of disputes. The hearing-impaired person is limited in its expression ability, and it is difficult to express personal opinions or to show attitudes in a timely manner, so that it is often at a disadvantage in the whole forensic debate process.
(4) When the hearing-impaired person and the sign language translation communicate through the sign language, other people only have to wait patiently and cannot know the contents of two communications between the hearing-impaired person and the sign language translation.
Disclosure of Invention
Embodiments of the present disclosure provide methods and apparatus for assistive hearing impaired communication.
In a first aspect, embodiments of the present disclosure provide a method for assistive hearing impaired communication, comprising: acquiring the voice of a voice user; converting the speech into text; performing natural language processing on the text, and analyzing the emotion of the voice user; outputting the emotion of the text and voice user in a text mode; acquiring a sign language image set of a hearing-impaired person; inputting the sign language image set into a pre-trained neural network model, and outputting the intention of the hearing-impaired person expressed by the sign language; the intent is output in a phonetic and/or textual manner.
In some embodiments, natural language processing the text further comprises: and modifying the text through natural speech processing, wherein the modified content comprises at least one of the following items: the sentence is not consistent, the grammar is wrong, and the character is wrongly written.
In some embodiments, inputting the set of sign language images into a pre-trained neural network model, outputting the hearing impaired's intent expressed in the sign language, comprises: carrying out image processing on the sign language image set to obtain a gesture image set and a contour image set; inputting the gesture image set into a pre-trained gesture classifier to obtain a gesture classification result; inputting the contour image set into a pre-trained contour classifier to obtain a contour classification result; and integrating the gesture classification result and the outline classification result into the intention of the hearing-impaired person expressed by the hand language.
In some embodiments, prior to outputting the intent in speech, the method further comprises: receiving the emotion selected by the hearing-impaired person; the intention is synthesized to speech according to the emotion selected by the hearing impaired.
In some embodiments, prior to outputting the intent in speech, the method further comprises: natural language processing is carried out on the intention, and the emotion of the hearing-impaired person is analyzed; and synthesizing the intention into voice according to the emotion of the hearing-impaired person.
In some embodiments, the intent is output textually, including: receiving the emotion selected by the hearing-impaired person, or carrying out natural language processing on the intention, and analyzing the emotion of the hearing-impaired person; and outputting the intention and the emotion of the hearing-impaired person.
In some embodiments, the method further comprises: and saving the voice of the voice user and the result of the voice recognition, the sign language image set of the hearing-impaired person and the result of the sign language recognition as an interaction record.
In some embodiments, the method further comprises: receiving a query request input by a hearing-impaired person; and searching the pre-stored knowledge or interaction records in the database according to the query.
In a second aspect, embodiments of the present disclosure provide an apparatus for assistive hearing impaired communication, comprising: a voice acquisition unit configured to acquire a voice of a voice user; a voice conversion unit configured to convert voice into text; the natural language processing unit is configured to perform natural language processing on the text and analyze the emotion of the voice user; an output unit configured to output the emotion of the text and voice user in a text manner; an image acquisition unit configured to acquire a sign language image set of an hearing-impaired person; the sign language recognition unit is configured to input the sign language image set into a pre-trained neural network model and output the intention of the hearing impaired expressed by the sign language; the output unit is further configured to output the intent in a speech manner and/or a text manner.
In some embodiments, the natural language processing unit is further configured to: and modifying the text through natural speech processing, wherein the modified content comprises at least one of the following items: the sentence is not consistent, the grammar is wrong, and the character is wrongly written.
In some embodiments, the sign language recognition unit is further configured to: carrying out image processing on the sign language image set to obtain a gesture image set and a contour image set; inputting the gesture image set into a pre-trained gesture classifier to obtain a gesture classification result; inputting the contour image set into a pre-trained contour classifier to obtain a contour classification result; and integrating the gesture classification result and the outline classification result into the intention of the hearing-impaired person expressed by the hand language.
In some embodiments, the apparatus further comprises an interaction unit configured to: receiving the emotion selected by the hearing-impaired person before outputting the intention in a voice mode; the intention is synthesized to speech according to the emotion selected by the hearing impaired.
In some embodiments, the natural language processing unit is further configured to: before outputting the intention in a voice mode, carrying out natural language processing on the intention, and analyzing the emotion of the hearing-impaired person; and synthesizing the intention into voice according to the emotion of the hearing-impaired person.
In some embodiments, the output unit is further configured to: receiving the emotion selected by the hearing-impaired person, or carrying out natural language processing on the intention, and analyzing the emotion of the hearing-impaired person; and outputting the intention and the emotion of the hearing-impaired person.
In some embodiments, the apparatus further comprises a recording unit configured to: and saving the voice of the voice user and the result of the voice recognition, the sign language image set of the hearing-impaired person and the result of the sign language recognition as an interaction record.
In some embodiments, the apparatus further comprises a querying element configured to: receiving a query request input by a hearing-impaired person; and searching the pre-stored knowledge or interaction records in the database according to the query.
In a third aspect, an embodiment of the present disclosure provides an electronic device for assisting a hearing-impaired person to communicate, including: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement a method as in any one of the first aspects.
In a fourth aspect, embodiments of the disclosure provide a computer readable medium having a computer program stored thereon, wherein the program when executed by a processor implements a method as in any one of the first aspect.
The invention combines a plurality of artificial intelligence technologies such as voice recognition, sign language image recognition, natural language processing, emotion analysis and the like, assists hearing-aided handicapped persons to understand the meaning of a voice user, and eliminates communication obstacles. In addition, the invention can assist the hearing-impaired to express personal opinions. The method does not depend on manual translation, and improves the communication efficiency between the hearing-impaired person and the voice user.
Drawings
Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow chart of one embodiment of a method for assistive hearing impaired communication according to the present disclosure;
FIG. 3 is a schematic diagram of one application scenario of a method for assisted hearing impaired communication according to the present disclosure;
FIG. 4 is a flow chart of yet another embodiment of a method for assistive hearing impaired communication according to the present disclosure;
FIG. 5 is a schematic diagram illustrating one embodiment of an apparatus for assistive hearing impaired communication according to the present disclosure;
FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the disclosed method for assistive hearing-impaired communication or apparatus for assistive hearing-impaired communication may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, a network 103, and a server 104. The network 103 serves as a medium for providing communication links between the terminal devices 101, 102 and the server 104. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102 to interact with the server 104 over the network 103 to receive or send messages or the like.
A microphone and a speaker are mounted on the terminal apparatus 101. The microphone can be used for collecting the voice of a voice user, and the loudspeaker can be used for playing sign language converted text. The terminal device 101 may also be equipped with a camera for collecting the mouth shape of the voice user, so that the hearing-impaired person can recognize the voice from the mouth shape at a distance far from the voice user. The camera-equipped terminal apparatus 101 can also be used to collect sign language images of the hearing-impaired.
The terminal device 102 is provided with a camera for collecting sign language images of the hearing-impaired person. The terminal device 102 may also be equipped with a microphone and a speaker. The same function as that of the terminal apparatus 101 is realized.
The server 104 may be a server that provides various services, such as performing speech recognition on speech collected by the terminal apparatus 101 to convert the speech into text, and then outputting the text on the screens of the terminal apparatus 101 and the terminal apparatus 102. But also on a large screen that is publicly visible. The server 104 performs sign language recognition on the sign language image acquired by the terminal device 102. The intention of the hearing-impaired person is recognized and then displayed on the screens of the terminal apparatuses 101 and 102 in a text manner. The server 104 may also perform speech synthesis on the text, and output the text to a speech user for listening through a speaker of the terminal device 101.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the method for assisting the hearing-impaired person to communicate provided by the embodiment of the present disclosure is generally performed by the server 104, and accordingly, the apparatus for assisting the hearing-impaired person to communicate is generally disposed in the server 104.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of one embodiment of a method for assisted hearing impaired communication is shown in accordance with the present disclosure. The method for assisting the hearing-impaired person to communicate comprises the following steps:
step 201, acquiring the voice of the voice user.
In the present embodiment, the execution subject (e.g., the server shown in fig. 1) of the method for assisting the hearing-impaired person to communicate can collect the voice of the voice user through the microphone. For example, in a court trial, voice information of a judge, an advertiser, or other persons in the court is acquired in real time.
Step 202, converting the speech into text.
In this embodiment, the collected voice information is converted into text information. The voice content can be converted into text content for subsequent analysis of the text content. The speech conversion uses either the RNN-CTC model or the LSTM-DNN model. These models are models trained using a large set of chinese speech data.
The steps of the whole speech conversion process are as follows:
1. and (6) signal processing. Since a lot of noise is often contained in a general speech segment, before formal speech data enters an acoustic model, preprocessing techniques such as noise elimination and channel enhancement are required,
2. and (5) feature extraction. And converting the signal from a time domain to a frequency domain, and extracting effective characteristic vectors for a later acoustic model.
3. The feature vectors are converted into pinyin. The acoustic model converts the feature vector obtained by the preprocessing part into a Chinese pinyin vector and gives confidence.
4. The chinese pinyin vector is converted to text. The language model converts the Chinese pinyin vector obtained by the acoustic model into a Chinese text.
And step 203, performing natural language processing on the text, and analyzing the emotion of the voice user.
In this embodiment, emotion analysis is performed on the text converted from speech. The hearing impaired person is told only from a textual perspective to understand whether the speech of the judge or lawyer is positive emotion (e.g., happy, sunny) or negative emotion (e.g., sad, painful, angry). The model applied for emotion analysis may be a BERT model or an arine model. The two models are pre-trained models, and the training data are word, entity and entity relations in encyclopedia data. The two models have high accuracy of completing emotion analysis tasks, and can reach more than 95%.
Alternatively, the content of the speech recognition may be modified. The text information obtained by the speech conversion may have the problems of sentence inconsistency, grammar error, wrongly written characters and the like. The natural language processing module can accomplish the correction of these errors.
Step 204, outputting the emotion of the text and voice user in a text mode.
In this embodiment, the result of the natural language processing is displayed on a large screen on the scene in real time. The emotion of the voice user is also displayed in text form. For the hearing impaired and other participants to view.
The result of the natural language processing can be projected on a large screen, and can be respectively output to an information interaction interface of the terminal equipment used by a voice user and an information interaction interface of the terminal equipment used by a hearing-impaired user, and the information interaction interfaces can be stored in a database through a recording unit for subsequent inquiry.
Optionally, the terminal device used by the voice user may also be equipped with a keyboard, and the voice user may modify the error in the voice recognition through the keyboard.
Step 205, acquiring a sign language image set of the hearing-impaired person.
In this embodiment, a real-time sign language video of the hearing-impaired person collected by the camera is obtained. Video is a collection of a large number of images.
And step 206, inputting the sign language image set into a pre-trained neural network model, and outputting the intention of the hearing-impaired person expressed by the sign language.
In this embodiment, the class to which each image belongs can be determined by a pre-trained classifier, and then the classes of each image are combined to obtain the intention that the hearing-impaired person wants to express.
Sign language recognition can be realized by utilizing a convolutional neural network sign language recognition method fusing multi-mode data, and by carrying out feature extraction on a gesture image and a contour image from a space dimension and a time dimension and then utilizing a convolutional neural network to classify.
For example, inputting a sign language image set into a pre-trained neural network model and outputting the intention of the hearing impaired expressed by the sign language comprises: carrying out image processing on the sign language image set to obtain a gesture image set and a contour image set; inputting the gesture image set into a pre-trained gesture classifier to obtain a gesture classification result; inputting the contour image set into a pre-trained contour classifier to obtain a contour classification result; and integrating the gesture classification result and the outline classification result into the intention of the hearing-impaired person expressed by the hand language. The collected images are color images, but can be changed into black and white outline images and color gesture images after being processed by an image processing algorithm. The contour image has a change in appearance contour without a change in detail. While the gesture image records detail changes. The gesture classification result is a category and a probability, and the contour classification result is also a category and a probability. The weights of the gesture classification result and the outline classification result can be preset. And then calculating the weighted sum of the probabilities of all the categories, and determining the category with the maximum weighted sum of the probabilities as the intention of the fusion hearing impaired person expressed by the vocabularies.
The training data for sign language recognition is from a database. The data recorded after each communication may be saved in a database for later model training.
Step 207, outputting the intention in a voice mode and/or a text mode.
In this embodiment, the sign language recognition result of the character method may be directly output to a public large screen, or may be respectively output to a personal terminal device. The text can be synthesized into voice and then output through a loudspeaker. Speech and text may also be output simultaneously.
Alternatively, the hearing impaired may modify text that is incorrectly recognized by sign language.
The on-site recording, sign language image and the character information with emotion obtained by system conversion can be sorted and stored. For court trial, the method is convenient for bookmen to carry out the content arrangement of the court trial. In the prior court trial of inviting sign language translators, the bookkeeper can only record based on the sign language translator's information, which results in lost information and inefficiency.
The invention can record all speaking contents of all people including hearing-impaired people in real time, and can greatly improve the court trial recording efficiency.
The invention processes the data of the court trial contents and then stores the data in the database.
The database stores a large amount of information such as legal terms, professional terms and the like, sign language image historical data and voice historical data. The method can provide training samples for sign language recognition and voice recognition, and can help hearing-impaired people to complete information query. The information inquiry can be carried out by the hearing-impaired person and the voice user. The speaking content of the speaker can be displayed in real time through an information interaction interface provided by a screen in front of the user, and the emotion score is given.
Besides the text information, the invention also converts the speaking content of the speaker into sign language to be displayed on the interactive interface for the auditory handicapped with lower culture level to view. Text or speech may be converted to sign language by a pre-trained neural network. The hearing-impaired person can also inquire legal terms, court terms and the like contained in the information database through the information interaction interface. The voice user may also query the historical conversation record through the interactive interface.
The module also has an information input function. The hearing-impaired person can input sign language or character information to the interface through the information interaction module. The contents are converted into voice information and text information. And the interaction module is also provided with emotion options. The hearing-impaired person can select corresponding emotions as required.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for assisting the hearing-impaired person to communicate according to the present embodiment. In the application scenario of fig. 3, the voice user speaks a piece of voice "give an announcement asking to speak your name, year, month and day of birth, nationality, cultural degree, occupation, work unit, job title, home address". Then the microphone in front of him collects the voice and sends it to the server for voice recognition. The server outputs the speech-recognized text to a screen that is visible to everyone. After the hearing-impaired person sees, the contents spoken by the voice user are understood and then responded by sign language. The camera in front of the hearing-impaired person collects the sign language image and sends the sign language image to the server. The server identifies sign language and outputs the text identified by the sign language to the large screen. The server can also synthesize the text into voice and play the voice for voice users and other non-hearing-impaired people to hear.
With further reference to fig. 4, a flow diagram 400 of yet another embodiment of a method for assistive hearing impaired communication is shown. The process 400 of the method for assistive hearing impaired communication includes the steps of:
step 401, acquiring the voice of the voice user.
Step 402, converting the speech into text.
And step 403, performing natural language processing on the text, and analyzing the emotion of the voice user.
Step 404, outputting the emotion of the text and voice user in a text mode.
Step 405, acquiring a sign language image set of the hearing-impaired person.
And step 406, inputting the sign language image set into a pre-trained neural network model, and outputting the intention of the hearing impaired expressed by the sign language.
Steps 401 and 406 are substantially the same as step 201 and 206, and therefore are not described again.
Step 407, obtaining the emotion of the hearing-impaired person.
In this embodiment, there are two ways to obtain the emotion source of the hearing-impaired: 1. the hearing-impaired person can make emotional choices through the interactive interface. The interactive interface may provide emoticon selection options that allow the hearing impaired to select the emotion that needs to be included when the speech is played. Such as sadness, pain, anger. 2. And analyzing the emotion contained in the text information through natural language processing.
And step 408, outputting the intentions and the emotions of the hearing-impaired person in a voice mode and/or a text mode.
In the present embodiment, if the intention is output in a text manner, the emotion is also output in a text manner. If the intention is output as a voice, a desired voice and expression are synthesized according to emotion. The intention and emotion can be output in two ways simultaneously.
The speech synthesis may employ a common neural network model, such as the hundred degree Deep Voice model. This model is divided into 5 parts: the method comprises the steps of phoneme boundary segmentation, grapheme-to-phoneme conversion, phoneme duration prediction, fundamental frequency prediction and audio synthesis. Each component is a small neural network.
The invention has the advantage of greatly simplifying the process of participating in the inquiry dialogue by the hearing-impaired person. Specifically, the present invention has the following advantages:
1. high efficiency. Because the sign language interpreter is not needed to be used as an intermediate medium to transmit information, the information can be directly interacted, the overall inquiry conversation efficiency can be greatly improved, and the inquiry conversation time can be greatly shortened.
2. The accuracy of information transmission is improved, and the loss of intermediate information is reduced. Because the sign language translator is not involved, the information can be more accurately transmitted.
3. Real-time conversion is possible because the system identification and conversion time is very rapid. Therefore, the disadvantage of the hearing-impaired person caused by hearing reasons can be reduced, and the ability of the hearing-impaired person to maintain personal rights and interests is improved.
4. The system integrates the processes of voice recording, on-site real-time video recording of a meeting, character conversion and the like, and can greatly reduce the workload of a bookkeeper.
With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for assisting the communication of the hearing impaired, which corresponds to the method embodiment shown in fig. 2, and which can be applied to various electronic devices.
As shown in fig. 5, the apparatus 500 for assisting the hearing-impaired person to communicate of the present embodiment includes: a voice acquisition unit 501, a voice conversion unit 502, a natural language processing unit 503, an output unit 504, an image acquisition unit 505, and a sign language recognition unit 506. The voice acquiring unit 501 is configured to acquire a voice of a voice user; a voice conversion unit 502 configured to convert voice into text; a natural language processing unit 503 configured to perform natural language processing on the text and analyze the emotion of the voice user; an output unit 504 configured to output the emotion of the text and voice user in a text manner; an image acquisition unit 505 configured to acquire a sign language image set of the hearing-impaired person; a sign language recognition unit 506 configured to input the sign language image set into a pre-trained neural network model, and output an intention of the hearing impaired expressed by the sign language; the output unit 504 is further configured to output the intent in a speech manner and/or a text manner.
In some optional implementations of this embodiment, the natural language processing unit 503 is further configured to: and modifying the text through natural speech processing, wherein the modified content comprises at least one of the following items: the sentence is not consistent, the grammar is wrong, and the character is wrongly written.
In some optional implementations of the present embodiment, the sign language recognition unit 506 is further configured to: carrying out image processing on the sign language image set to obtain a gesture image set and a contour image set; inputting the gesture image set into a pre-trained gesture classifier to obtain a gesture classification result; inputting the contour image set into a pre-trained contour classifier to obtain a contour classification result; and integrating the gesture classification result and the outline classification result into the intention of the hearing-impaired person expressed by the hand language.
In some optional implementations of this embodiment, the apparatus 500 further comprises an interaction unit (not shown in the drawings) configured to: receiving the emotion selected by the hearing-impaired person before outputting the intention in a voice mode; the intention is synthesized to speech according to the emotion selected by the hearing impaired.
In some optional implementations of this embodiment, the natural language processing unit 503 is further configured to: before outputting the intention in a voice mode, carrying out natural language processing on the intention, and analyzing the emotion of the hearing-impaired person; and synthesizing the intention into voice according to the emotion of the hearing-impaired person.
In some optional implementations of this embodiment, the output unit is further configured to: receiving the emotion selected by the hearing-impaired person, or carrying out natural language processing on the intention, and analyzing the emotion of the hearing-impaired person; and outputting the intention and the emotion of the hearing-impaired person.
In some optional implementations of this embodiment, the apparatus 500 further comprises a recording unit (not shown in the drawings) configured to: and saving the voice of the voice user and the result of the voice recognition, the sign language image set of the hearing-impaired person and the result of the sign language recognition as an interaction record.
In some optional implementations of this embodiment, the apparatus 500 further comprises a querying unit (not shown in the drawings) configured to: receiving a query request input by a hearing-impaired person; and searching the pre-stored knowledge or interaction records in the database according to the query.
Referring now to fig. 6, a schematic diagram of an electronic device (e.g., the server or terminal device of fig. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The terminal device/server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring the voice of a voice user; converting the speech into text; performing natural language processing on the text, and analyzing the emotion of the voice user; outputting the emotion of the text and voice user in a text mode; acquiring a sign language image set of a hearing-impaired person; inputting the sign language image set into a pre-trained neural network model, and outputting the intention of the hearing-impaired person expressed by the sign language; the intent is output in a phonetic and/or textual manner.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a voice acquisition unit, a voice conversion unit, a natural language processing unit, an output unit, an image acquisition unit, and a sign language recognition unit. The names of these units do not in some cases constitute a limitation on the unit itself, and for example, the voice acquiring unit may also be described as a "unit that acquires the voice of a voice user".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (18)

1. A method for assistive hearing impaired communication, comprising:
acquiring the voice of a voice user;
converting the speech to text;
performing natural language processing on the text, and analyzing the emotion of the voice user;
outputting the text and the emotion of the voice user in a text mode;
acquiring a sign language image set of a hearing-impaired person;
inputting the sign language image set into a pre-trained neural network model, and outputting the intention of the hearing-impaired person expressed by the sign language;
outputting the intention in a voice manner and/or a text manner.
2. The method of claim 1, wherein the natural language processing the text further comprises:
and correcting the text through natural speech processing, wherein the corrected content comprises at least one of the following items:
the sentence is not consistent, the grammar is wrong, and the character is wrongly written.
3. The method of claim 1, wherein the inputting the set of sign language images into a pre-trained neural network model, outputting the hearing impaired's intent expressed in a sign language, comprises:
carrying out image processing on the sign language image set to obtain a gesture image set and a contour image set;
inputting the gesture image set into a pre-trained gesture classifier to obtain a gesture classification result;
inputting the contour image set into a pre-trained contour classifier to obtain a contour classification result;
and fusing the gesture classification result and the outline classification result into the intention of the hearing impaired person expressed by the language.
4. The method of claim 1, wherein prior to outputting the intent phonetically, the method further comprises:
receiving the emotion selected by the hearing-impaired person;
and synthesizing the intention into voice according to the emotion selected by the hearing-impaired person.
5. The method of claim 1, wherein prior to outputting the intent phonetically, the method further comprises:
performing natural language processing on the intention, and analyzing the emotion of the hearing-impaired person;
and synthesizing the intention into voice according to the emotion of the hearing-impaired person.
6. The method of claim 1, wherein outputting the intent textually comprises:
receiving the emotion selected by the hearing-impaired person, or performing natural language processing on the intention, and analyzing the emotion of the hearing-impaired person;
outputting the intention and the emotion of the hearing impaired person.
7. The method according to one of claims 1-6, wherein the method further comprises:
and storing the voice of the voice user and the result of the voice recognition, the sign language image set of the hearing-impaired person and the result of the sign language recognition as an interactive record.
8. The method of claim 7, wherein the method further comprises:
receiving a query request input by the hearing-impaired person;
and searching the pre-stored knowledge or interaction records in the database according to the query.
9. An apparatus for assistive hearing impaired communication, comprising:
a voice acquisition unit configured to acquire a voice of a voice user;
a voice conversion unit configured to convert the voice into text;
the natural language processing unit is configured to perform natural language processing on the text and analyze the emotion of the voice user;
an output unit configured to output the text and the emotion of the voice user in a text manner;
an image acquisition unit configured to acquire a sign language image set of an hearing-impaired person;
a sign language recognition unit configured to input the sign language image set into a pre-trained neural network model and output an intention of the hearing impaired person expressed in a sign language;
the output unit is further configured to output the intent in a speech manner and/or a text manner.
10. The apparatus of claim 9, wherein the natural language processing unit is further configured to:
and correcting the text through natural speech processing, wherein the corrected content comprises at least one of the following items:
the sentence is not consistent, the grammar is wrong, and the character is wrongly written.
11. The apparatus of claim 9, wherein the sign language recognition unit is further configured to:
carrying out image processing on the sign language image set to obtain a gesture image set and a contour image set;
inputting the gesture image set into a pre-trained gesture classifier to obtain a gesture classification result;
inputting the contour image set into a pre-trained contour classifier to obtain a contour classification result;
and fusing the gesture classification result and the outline classification result into the intention of the hearing impaired person expressed by the language.
12. The apparatus of claim 9, wherein the apparatus further comprises an interaction unit configured to:
receiving an emotion selected by the hearing impaired person before outputting the intent in speech;
and synthesizing the intention into voice according to the emotion selected by the hearing-impaired person.
13. The apparatus of claim 9, wherein the natural language processing unit is further configured to:
before outputting the intention in a voice mode, carrying out natural language processing on the intention, and analyzing the emotion of the hearing-impaired person;
and synthesizing the intention into voice according to the emotion of the hearing-impaired person.
14. The apparatus of claim 9, wherein the output unit is further configured to:
receiving the emotion selected by the hearing-impaired person, or performing natural language processing on the intention, and analyzing the emotion of the hearing-impaired person;
outputting the intention and the emotion of the hearing impaired person.
15. The apparatus according to one of claims 9-14, wherein the apparatus further comprises a recording unit configured to:
and storing the voice of the voice user and the result of the voice recognition, the sign language image set of the hearing-impaired person and the result of the sign language recognition as an interactive record.
16. The apparatus of claim 15, wherein the apparatus further comprises a querying element configured to:
receiving a query request input by the hearing-impaired person;
and searching the pre-stored knowledge or interaction records in the database according to the query.
17. An electronic device for assistive hearing impaired communication, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.
18. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-8.
CN202010096561.1A 2020-02-14 2020-02-14 Method and device for assisting hearing-impaired communication Pending CN111354362A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010096561.1A CN111354362A (en) 2020-02-14 2020-02-14 Method and device for assisting hearing-impaired communication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010096561.1A CN111354362A (en) 2020-02-14 2020-02-14 Method and device for assisting hearing-impaired communication

Publications (1)

Publication Number Publication Date
CN111354362A true CN111354362A (en) 2020-06-30

Family

ID=71196997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010096561.1A Pending CN111354362A (en) 2020-02-14 2020-02-14 Method and device for assisting hearing-impaired communication

Country Status (1)

Country Link
CN (1) CN111354362A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536947A (en) * 2021-06-21 2021-10-22 中山市希道科技有限公司 Face attribute analysis method and device
CN113781876A (en) * 2021-08-05 2021-12-10 深兰科技(上海)有限公司 Method and device for converting text into sign language action video

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN203819120U (en) * 2013-11-19 2014-09-10 浙江吉利汽车研究院有限公司 Device for assisting deaf in taking bus
CN104853257A (en) * 2015-04-30 2015-08-19 北京奇艺世纪科技有限公司 Subtitle display method and device
CN108597522A (en) * 2018-05-10 2018-09-28 北京奇艺世纪科技有限公司 A kind of method of speech processing and device
CN109063624A (en) * 2018-07-26 2018-12-21 深圳市漫牛医疗有限公司 Information processing method, system, electronic equipment and computer readable storage medium
CN110070065A (en) * 2019-04-30 2019-07-30 李冠津 The sign language systems and the means of communication of view-based access control model and speech-sound intelligent
CN110322760A (en) * 2019-07-08 2019-10-11 北京达佳互联信息技术有限公司 Voice data generation method, device, terminal and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN203819120U (en) * 2013-11-19 2014-09-10 浙江吉利汽车研究院有限公司 Device for assisting deaf in taking bus
CN104853257A (en) * 2015-04-30 2015-08-19 北京奇艺世纪科技有限公司 Subtitle display method and device
CN108597522A (en) * 2018-05-10 2018-09-28 北京奇艺世纪科技有限公司 A kind of method of speech processing and device
CN109063624A (en) * 2018-07-26 2018-12-21 深圳市漫牛医疗有限公司 Information processing method, system, electronic equipment and computer readable storage medium
CN110070065A (en) * 2019-04-30 2019-07-30 李冠津 The sign language systems and the means of communication of view-based access control model and speech-sound intelligent
CN110322760A (en) * 2019-07-08 2019-10-11 北京达佳互联信息技术有限公司 Voice data generation method, device, terminal and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536947A (en) * 2021-06-21 2021-10-22 中山市希道科技有限公司 Face attribute analysis method and device
CN113781876A (en) * 2021-08-05 2021-12-10 深兰科技(上海)有限公司 Method and device for converting text into sign language action video
CN113781876B (en) * 2021-08-05 2023-08-29 深兰科技(上海)有限公司 Conversion method and device for converting text into sign language action video

Similar Documents

Publication Publication Date Title
CN107657017B (en) Method and apparatus for providing voice service
CN110517689B (en) Voice data processing method, device and storage medium
US8775181B2 (en) Mobile speech-to-speech interpretation system
US9053096B2 (en) Language translation based on speaker-related information
US20100217591A1 (en) Vowel recognition system and method in speech to text applictions
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN109256133A (en) A kind of voice interactive method, device, equipment and storage medium
Yousaf et al. A novel technique for speech recognition and visualization based mobile application to support two-way communication between deaf-mute and normal peoples
CN110910903B (en) Speech emotion recognition method, device, equipment and computer readable storage medium
CN110880198A (en) Animation generation method and device
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
CN114121006A (en) Image output method, device, equipment and storage medium of virtual character
CN111354362A (en) Method and device for assisting hearing-impaired communication
CN111414453A (en) Structured text generation method and device, electronic equipment and computer readable storage medium
CN114064943A (en) Conference management method, conference management device, storage medium and electronic equipment
Dhanjal et al. An optimized machine translation technique for multi-lingual speech to sign language notation
CN113743267A (en) Multi-mode video emotion visualization method and device based on spiral and text
CN113223555A (en) Video generation method and device, storage medium and electronic equipment
CN112785667A (en) Video generation method, device, medium and electronic equipment
CN117150338A (en) Task processing, automatic question and answer and multimedia data identification model training method
CN108877795B (en) Method and apparatus for presenting information
WO2023142590A1 (en) Sign language video generation method and apparatus, computer device, and storage medium
CN116629236A (en) Backlog extraction method, device, equipment and storage medium
CN113763925B (en) Speech recognition method, device, computer equipment and storage medium
CN115222857A (en) Method, apparatus, electronic device and computer readable medium for generating avatar

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination