CN108615009B

CN108615009B - A kind of sign language interpreter AC system based on dynamic hand gesture recognition

Info

Publication number: CN108615009B
Application number: CN201810373367.6A
Authority: CN
Inventors: 吕蕾; 李燕; 张凯; 张桂娟; 刘弘
Original assignee: Shandong Normal University
Current assignee: Shandong center information technology Limited by Share Ltd.
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2019-07-23
Anticipated expiration: 2038-04-24
Also published as: CN108615009A

Abstract

The sign language interpreter AC system based on dynamic hand gesture recognition that the invention discloses a kind of, including sign language interpreter module and speech recognition module, the sign language interpreter module includes that sequentially connected sign language gesture obtains module, sign language gesture recognition module, the first text importing module and voice playing module, and the speech recognition module includes being sequentially connected voice to obtain module, speech recognition module, the second text importing module and sign language flash demo module.The present invention can solve the real-time communicating questions between deaf-mute and between deaf-mute and normal person.

Description

A kind of sign language interpreter AC system based on dynamic hand gesture recognition

Technical field

The present invention relates to a kind of sign language interpreter AC system, in particular to a kind of sign language interpreter based on dynamic hand gesture recognition AC system.

Background technique

Deaf-mute's enormous amount in China at present, has had reached up to ten million people, and every year with certain speed quick Increase.Since most of normal persons fail to understand sign language, so that exchanging in the presence of very big barrier between deaf-mute and normal person Hinder.

The current sign language interpreter field in China is still within the starting stage, and some systems in relation to sign language interpreter remain in reality Test the even conceptual phase in room stage.

The relevant system of sign language interpreter at present mainly has sign language interpreter data glove, sign language interpreter armlet and is based on The sign language interpreter product of Kinect.People have to the complicated data hand of wearing using the sign Language Recognition based on data glove Set, does not have portability, and the system price is more expensive；The identification of sign language interpreter armlet is discontinuous and requires user of service Ten sectional specification of sign language gesture, do not reach good recognition effect；Based on the sign language interpretation system of Kinect because of equipment volume It is too big, so can not be portable.

Summary of the invention

In order to solve the deficiencies in the prior art, the sign language interpreter exchange based on dynamic hand gesture recognition that the present invention provides a kind of System, the translation precision of the sign language interpretation system is higher, function is more complete, operation is simpler and price is cheaper.

To achieve the goals above, technical scheme is as follows:

A kind of sign language interpreter AC system based on dynamic hand gesture recognition, including sign language interpreter module and speech recognition mould Block；

The sign language interpreter module, is configured as:

Obtain dynamic sign language gesture；

Feature extraction and time series modeling, and outputting sign language label time series are carried out to sign language images of gestures, complete opponent The identification of language gesture；

Sentence synthesis is carried out to sign language label, and sentence is shown using display screen, while by sentence with voice shape Formula plays out；

The speech recognition module, is configured as:

Obtain voice；

Identifying processing is carried out to voice data, obtains discrete vocabulary；

Corresponding gesture animation is obtained in sign language gesture cartoon databank according to discrete vocabulary to be spliced, and passes through display Screen is demonstrated.

Further, the sign language interpreter module includes that sequentially connected sign language gesture obtains module, sign language gesture identification Module, the first text importing module and voice playing module, the speech recognition module include be sequentially connected voice obtain module, Speech recognition module, the second text importing module and sign language flash demo module.

Further, the sign language gesture obtains module and is obtained by camera to dynamic sign language gesture, and to obtaining After the size for each frame image got carries out unified adjustment, storage is into memory queue in the form of array.

Further, the sign language gesture recognition module is using multilayer convolutional neural networks to the sign language hand in memory queue Gesture image carries out multiple characteristic pattern extraction, and carries out time series modeling using characteristic pattern of the shot and long term Memory Neural Networks to extraction, The time series of outputting sign language label completes the identification of sign language gesture.

Further, the first text importing module uses sign language of the Recognition with Recurrent Neural Network to sign language gesture recognition module Label carries out sentence synthesis, and is shown using display screen to sentence.

Further, the sentence synthesis process includes:

Discrete vocabulary is formed according to sign language label, existing Chinese corpus database is combined according to the discrete vocabulary of input, It chooses the maximum template corpus of similarity and generates initial sentence；

The sentence is initialized by Recognition with Recurrent Neural Network structure, sentence amendment is carried out by network iteration, is being repaired During just, increase the accuracy and continuity of sentence using similar word alternative.

Further, the sentence that the voice playing module is used to generate in the first text importing module is with speech form It plays out.

Further, the voice obtains module and acquires voice data by sound wave mode, and is stored in the form of array Into memory, the array indicates real-time sound wave array in column dimension, and the acoustic feature of sound wave is indicated in dimension of being expert at.

Further, the speech recognition module is using Recognition with Recurrent Neural Network algorithm end to end to collecting speech sound waves Data carry out processing identification, obtain discrete vocabulary, and discrete vocabulary is shown by the second text importing module, institute's predicate Sound identification module tail portion is also added with CTC language model.

Further, the discrete vocabulary that the sign language flash demo module is obtained according to speech recognition module is in sign language gesture It is retrieved in cartoon databank, after retrieval, is spliced the animation segment of retrieval according to vocabulary sequence before, and It is demonstrated by display screen.

The corresponding fixed animation of the discrete vocabulary of each in the sign language gesture cartoon databank.

Compared with prior art, the beneficial effects of the present invention are:

1) it the present invention is based on the sign language interpreter AC system of Dynamic Recognition, is handled according to RGB image, it is quasi- to improve identification Exactness；Using deep layer convolutional neural networks, the fine granularity grade feature of image can be preferably extracted；Nerve is remembered using shot and long term Network models the temporal aspect figure of dynamic sign language gesture, can be repaired by postorder frame to the identification of previous frame Just, the feature in timing dimension is taken full advantage of, the accuracy of Dynamic Recognition is improved；Sentence generation part uses circulation nerve Network algorithm can improve the continuity of sentence in conjunction with the information of preceding t-1 vocabulary when generating the vocabulary of t-th of position And readability.

2) speech recognition module of the present invention is had using acoustics identification technology end to end using Recognition with Recurrent Neural Network algorithm The utilization timing dimensional characteristics of effect, and the characteristic information of sound wave is automatically extracted, in identification module tail portion, CTC language model is added, Accuracy of identification is further increased, so that recognition result more meets reality.It is played with text importing, voice, sign language flash demo Etc. complete function, i.e., the recognition result of the sign language gesture of deaf-mute and the voice of normal person is convertible into written form and shown Show on liquid crystal display, word content can also be played by voice, can also be expressed word content by the flash demo of sign language Out.

Detailed description of the invention

The accompanying drawings constituting a part of this application is used to provide further understanding of the present application, and the application's shows Meaning property embodiment and its explanation are not constituted an undue limitation on the present application for explaining the application.

Fig. 1 is the structural block diagram of the sign language interpreter AC system of the invention based on dynamic hand gesture recognition；

Fig. 2 is the work flow diagram of the sign language interpreter AC system of the invention based on dynamic hand gesture recognition；

Fig. 3 is the CNN- for gesture identification of the sign language interpreter AC system of the invention based on dynamic hand gesture recognition LSTM structure chart.

Specific embodiment

The present invention is described further with specific embodiment with reference to the accompanying drawing.

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.

In the present invention, term for example "upper", "lower", "left", "right", "front", "rear", "vertical", "horizontal", " side ", The orientation or positional relationship of the instructions such as "bottom" is to be based on the orientation or positional relationship shown in the drawings, only to facilitate describing this hair Bright each component or component structure relationship and the relative of determination, not refer in particular to either component or element in the present invention, cannot understand For limitation of the present invention.

In the present invention, term such as " affixed ", " connected ", " connection " be shall be understood in a broad sense, and indicate may be a fixed connection, It is also possible to be integrally connected or is detachably connected；It can be directly connected, it can also be indirectly connected through an intermediary.For The related scientific research of this field or technical staff can determine the concrete meaning of above-mentioned term in the present invention as the case may be, It is not considered as limiting the invention.

As background technique is introduced, deaf-mute and normal person's communication difficult are existed in the prior art, and sign language interpreter The problem of system is still within the starting stage, is unable to reach translation effect well, in order to solve technical problem as above, this Shen It please provide a kind of sign language interpreter AC system based on dynamic hand gesture recognition, the translation precision of the sign language interpretation system is higher, Function is more complete, operation is simpler and price is cheaper.

The sign language interpreter module, is configured as:

Obtain dynamic sign language gesture；

The speech recognition module, is configured as:

Obtain voice；

As shown in Figure 1, the sign language interpreter module includes that sequentially connected sign language gesture obtains module, sign language gesture identification Module, the first text importing module and voice playing module, the speech recognition module include be sequentially connected voice obtain module, Speech recognition module, the second text importing module and sign language flash demo module.

As shown in Fig. 2, hand can be changed into after the voice of normal person is by the identifying processing of this system and sign language demonstration Language, and after the identifying processing that the sign language of deaf-mute passes through this system, voice messaging can be changed into, therefore by this system, it can To realize that normal person exchanges with the effective of deaf-mute.

In specific implementation:

The sign language gesture obtains module and is obtained by common RGB camera to dynamic sign language gesture.

The sign language gesture is obtained module and is imaged using RGB color camera to real-time scene, and Python is utilized Language call OpenCV function library creates VideoCapture object to obtain current real-time frame.For t-th of moment, One frame object of VideoCapture object acquisition constructed before use first, is denoted as I_t.In order to which the calculating for reducing algorithm is complicated Image Adjusting is the image of [368,368,3] size, one-dimensional representation row, second dimension expression column, third by degree, unification Dimension indicates channel.The each frame image that will acquire is stored in the form of array into memory queue, in order to avoid memory team Column memory overflows, and the maximum length that queue is arranged is max_length, by the image of camera acquisition from queue tail into team, into Row identification module from queue head carry out data extraction, when the total length of queue reaches maximum length, from queue head into Row data are deleted.

The sign language gesture recognition module using multilayer convolutional neural networks to the sign language images of gestures in memory queue into The multiple characteristic pattern of row extracts, and carries out time series modeling, outputting sign language using characteristic pattern of the shot and long term Memory Neural Networks to extraction The time series of label completes the identification of sign language gesture.

The sign language gesture recognition module is by calling the resize function of OpenCV to be adjusted image.After adjusting Image be input in depth convolutional neural networks and handled.

The depth convolutional neural networks first layer is convolutional layer, and convolution kernel is having a size of [96,11,11,3], the first dimension table Show the number of convolution kernel, the height of two-dimensional representation convolution kernel, the third line indicates that the width of convolution kernel, fourth dimension indicate convolution kernel Port number, convolution kernel is slided respectively along the direction x and the direction y, step-length is 4, after convolution using relu function into Line activating, relu function belong to linear segmented function, can reduce the ladder of propagated forward computation complexity and backpropagation simultaneously The computation complexity of degree.Convolution operation is to belong to linear operation, and by carrying out nonlinear activation, the feature more effectively expressed is reflected Penetrate relationship.Result after convolution is subjected to pondization operation, pond region is [3,3], and pond mode is maximum pond, that is, is chosen Maximum value in [3,3] region as new pixel, delete by other pixels, and the step-length of pond window is 3, passes through Chi Huacao Make, the port number of image is constant, and length and width reduce, to inhibit over-fitting.Then part is carried out to the result of Chi Huahou to ring (LRN) should be normalized, the generalization ability of model is further increased.

The depth convolutional neural networks second layer uses convolution kernel to be handled for the convolution kernel of [256,7,7,3], step A length of 2, it is activated using relu function, then carries out pondization using maximum pond mode and operate, window size is [3,3], Sliding step is 2.Finally handled using local acknowledgement's normalization.

The depth convolutional neural networks third layer uses convolution kernel to be handled for the convolution kernel of [256,5,5,3], step A length of 1, it is activated using relu function, is then handled using the mode in maximum pond, pond window is [2,2], step A length of 2, finally handled using local acknowledgement's normalization.After above-mentioned three layer operation, then four layers of pure convolution operation are carried out, Convolution kernel is [384,3,3,3], and activation primitive is relu function.

The image that cubic convolution is operated is connected to full articulamentum, and full articulamentum has altogether two layers, behind each layer Dropout processing, that is, the calculating process of randomness ignored certain units and be not involved in next step will be carried out, random is general Rate is set as 0.5.Two full articulamentums are 4096 neural units, and activation primitive is tanh function.By obtain 4096 dimensions Vector be input in the LSTM unit at t-th of moment and calculated, one direction of output of LSTM unit is as (t-th of P_t The predicted value at moment), input of another direction as the LSTM unit at the t+1 moment is obtained with the t+1 moment by CNN Feature vector carries out the prediction at t+1 moment together.The output result of LSTM unit is a probability vector, and vector dimension is institute There is the sum of sign language gesture, predicted value of the corresponding sign language in the maximum position of numerical value as t-th of moment in selection probability vector.

Whenever the LSTM unit at t-th of moment completes output as a result, VideoCapture can just be called to obtain next frame Image re-starts CNN operation, as shown in Figure 3.

One discriminant function is set in the application and is judged as a sentence when can't detect gesture motion in continuous 10 frame The end of son.

The first text importing module is carried out using sign language label of the Recognition with Recurrent Neural Network to sign language gesture recognition module Sentence synthesis, and sentence is shown using display screen.

The sentence synthesis process includes:

The voice playing module is for playing out the sentence generated in the first text importing module with speech form.

The voice playing module carries out the conversion to text and voice using the library python language call pyttsx, uses Pyttsx.init () initializes transform engine object engine, then carries out voice using the mode of engine.say (text) Casting, text is the text converted, and in process realization, can rebuild a thread and identification mould before Block is independent, and an event monitoring function is arranged in the thread and monitors identification module, whenever identification module exports one When complete sentence, thread starting engine.say () method is converted

The voice obtains module and acquires voice data by sound wave mode, and is stored in the form of array into memory, The array indicates real-time sound wave array in column dimension, and the acoustic feature of sound wave is indicated in dimension of being expert at.

The speech recognition module is carried out using Recognition with Recurrent Neural Network algorithm end to end to speech sound waves data are collected Processing identification, obtains discrete vocabulary, and discrete vocabulary is shown by the second text importing module, to improve end to end The identification intensity of algorithm increases accuracy of identification in the later processing stage combination CTC language model of network algorithm.

The discrete vocabulary that the sign language flash demo module is obtained according to speech recognition module is in sign language gesture animation data It is retrieved in library, after retrieval, is spliced the animation segment of retrieval according to vocabulary sequence before, and pass through display Screen is demonstrated.

The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Above-mentioned, although the foregoing specific embodiments of the present invention is described with reference to the accompanying drawings, not protects model to the present invention The limitation enclosed, those skilled in the art should understand that, based on the technical solutions of the present invention, those skilled in the art are not Need to make the creative labor the various modifications or changes that can be made still within protection scope of the present invention.

Claims

1. a kind of sign language interpreter AC system based on dynamic hand gesture recognition, it is characterised in that: including sign language interpreter module and language Sound identification module；

The sign language interpreter module, is configured as:

Dynamic sign language gesture is obtained, storage is into memory queue in the form of array；

Multiple characteristic pattern extraction is carried out to the sign language images of gestures in memory queue using multilayer convolutional neural networks, and using length Short-term memory neural network carries out time series modeling to the characteristic pattern of extraction, and the time series of outputting sign language label completes sign language hand The identification of gesture；One discriminant function being set, when can't detect gesture motion in continuous 10 frame, being judged as the knot of a sentence Beam；

Sentence synthesis is carried out to sign language label, and sentence is shown using display screen, at the same by sentence with speech form into Row plays；

The speech recognition module, is configured as:

Obtain voice；

Corresponding gesture animation is obtained in sign language gesture cartoon databank according to discrete vocabulary to be spliced, and by display screen into Row demonstration；

The sign language interpreter module includes that sequentially connected sign language gesture obtains module, sign language gesture recognition module, the first text Display module and voice playing module, the speech recognition module include be sequentially connected voice obtain module, speech recognition module, Second text importing module and sign language flash demo module；

The voice playing module constructs independent with the sign language gesture recognition module in the conversion process to text and voice Thread, in the thread be arranged an event monitoring function monitored, whenever the first text importing module export one When a complete sentence, which is actuated for converting；

The speech recognition module is handled using Recognition with Recurrent Neural Network algorithm end to end speech sound waves data are collected Identification, obtains discrete vocabulary, and discrete vocabulary is shown by the second text importing module, the speech recognition module tail Portion is also added with CTC language model；

The discrete vocabulary that the sign language flash demo module is obtained according to speech recognition module is in sign language gesture cartoon databank Retrieved, after retrieval, spliced the animation segment of retrieval according to vocabulary sequence before, and by display screen into Row demonstration；

2. a kind of sign language interpreter AC system based on dynamic hand gesture recognition as described in claim 1, which is characterized in that described Sign language gesture obtains module and is obtained by camera to dynamic sign language gesture, and to the size of each frame image got After carrying out unified adjustment, storage is into memory queue in the form of array.

3. a kind of sign language interpreter AC system based on dynamic hand gesture recognition as described in claim 1, which is characterized in that described First text importing module carries out sentence synthesis, and benefit using sign language label of the Recognition with Recurrent Neural Network to sign language gesture recognition module Sentence is shown with display screen.

4. a kind of sign language interpreter AC system based on dynamic hand gesture recognition as claimed in claim 3, which is characterized in that described Sentence synthesis process includes:

Discrete vocabulary is formed according to sign language label, existing Chinese corpus database is combined according to the discrete vocabulary of input, is chosen The maximum template corpus of similarity generates initial sentence；

The sentence is initialized by Recognition with Recurrent Neural Network structure, sentence amendment is carried out by network iteration, was being corrected Cheng Zhong increases the accuracy and continuity of sentence using similar word alternative.

5. a kind of sign language interpreter AC system based on dynamic hand gesture recognition as described in claim 1, which is characterized in that described Voice playing module is for playing out the sentence generated in the first text importing module with speech form.

6. a kind of sign language interpreter AC system based on dynamic hand gesture recognition as described in claim 1, which is characterized in that described Voice obtains module and acquires voice data by sound wave mode, and into memory, the array is being arranged for storage in the form of array Real-time sound wave array is indicated in dimension, and the acoustic feature of sound wave is indicated in dimension of being expert at.