CN110322760A - Voice data generation method, device, terminal and storage medium - Google Patents
Voice data generation method, device, terminal and storage medium Download PDFInfo
- Publication number
- CN110322760A CN110322760A CN201910611471.9A CN201910611471A CN110322760A CN 110322760 A CN110322760 A CN 110322760A CN 201910611471 A CN201910611471 A CN 201910611471A CN 110322760 A CN110322760 A CN 110322760A
- Authority
- CN
- China
- Prior art keywords
- gesture
- type
- video frame
- target video
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 76
- 238000001514 detection method Methods 0.000 claims description 25
- 230000015654 memory Effects 0.000 claims description 16
- 208000032041 Hearing impaired Diseases 0.000 abstract description 17
- 230000004888 barrier function Effects 0.000 abstract description 6
- 230000001815 facial effect Effects 0.000 description 42
- 230000008859 change Effects 0.000 description 25
- 230000006854 communication Effects 0.000 description 16
- 238000004891 communication Methods 0.000 description 15
- 230000008569 process Effects 0.000 description 14
- 238000013527 convolutional neural network Methods 0.000 description 13
- 230000033001 locomotion Effects 0.000 description 11
- 238000012545 processing Methods 0.000 description 11
- 230000002093 peripheral effect Effects 0.000 description 10
- 230000001133 acceleration Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 230000001427 coherent effect Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 4
- 238000004088 simulation Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000012163 sequencing technique Methods 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 239000000919 ceramic Substances 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 230000003238 somatosensory effect Effects 0.000 description 2
- 230000001052 transient effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7837—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
- G06F16/784—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09B—EDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
- G09B21/00—Teaching, or communicating with, the blind, deaf or mute
- G09B21/009—Teaching or communicating with deaf persons
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Library & Information Science (AREA)
- Acoustics & Sound (AREA)
- General Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Educational Administration (AREA)
- Educational Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The disclosure is related to Internet technical field about a kind of voice data generation method, device, terminal and storage medium, this method comprises: obtaining at least one target video frame from video to be processed;Gesture identification is carried out to the hand images of at least one target video frame, obtains at least one corresponding gesture-type of target video frame;Corresponding relationship based at least one gesture-type and gesture-type and word, obtains object statement, and object statement includes the corresponding word of at least one gesture-type;According to object statement, the corresponding voice data of object statement is generated.The content that can recognize that the sign language in video is intended by by playing voice data, realize hearing-impaired people with it is strong listen between personage accessible exchange.Video to be processed can be shot to obtain by common camera, and the program does not depend on specific equipment, directly can directly be run in the terminals such as mobile phone, computer, not additional cost, can be popularized preferably in listening barrier crowd.
Description
Technical field
This disclosure relates to Internet technical field more particularly to a kind of voice data generation method, device, terminal and storage
Medium.
Background technique
Barrier crowd's quantity of listening of China is more than 20,000,000 populations, they can only pass through sign language or text in daily life
Exchanged with other people, but most people cannot better understand sign language, therefore, hearing-impaired people can only by hand-written or
It inputs the modes such as text on an electronic device to exchange with other people, but this exchange way significantly reduces exchange
Efficiency.
Currently, hearing-impaired people can also realize the normal communication with other users, the body-sensing by some somatosensory devices
Depth camera is provided in equipment, which obtains the gesture motion of user by depth camera, dynamic to the gesture
It carries out analysis and obtains the corresponding text information of the gesture motion, on the screen by obtained word-information display.
But the somatosensory device volume is larger under normal conditions, hearing-impaired people can not carry, therefore, this scheme
It still cannot achieve the normal communication of hearing-impaired people and other people.
Summary of the invention
The disclosure provides a kind of voice data generation method, device, terminal and storage medium, at least to solve the relevant technologies
Middle hearing-impaired people and strong the problem of listening communication difficult between personage.The technical solution of the disclosure is as follows:
According to the first aspect of the embodiments of the present disclosure, a kind of voice data generation method is provided, this method comprises:
Obtain at least one target video frame from video to be processed, the target video frame be include hand images
Video frame;
Gesture identification is carried out to the hand images of at least one target video frame, obtains at least one target view
The corresponding gesture-type of frequency frame;
Corresponding relationship based at least one gesture-type and gesture-type and word obtains object statement, the mesh
Poster sentence includes the corresponding word of at least one described gesture-type;
According to the object statement, the corresponding voice data of the object statement is generated.
In a kind of possible implementation, the hand images at least one target video frame carry out gesture knowledge
Not, at least one described corresponding gesture-type of target video frame is obtained, comprising:
Gesture identification is carried out to the hand images of each target video frame, based on hand figure in each target video frame
Hand profile as in obtains the gesture shape of each target video frame;
The corresponding relationship of gesture shape and gesture shape and gesture-type based on each target video frame determines
The corresponding gesture-type of each target video frame.
It is described corresponding with word based at least one gesture-type and gesture-type in a kind of possible implementation
Relationship, before obtaining object statement, the method also includes:
When the gesture-type for the successive objective video frame for having destination number is identical, using identical gesture-type as described in
The corresponding gesture-type of successive objective video frame.
It is described corresponding with word based at least one gesture-type and gesture-type in a kind of possible implementation
Relationship obtains object statement, comprising:
When the gesture-type identified is target gesture-type, it is based on the corresponding gesture-type of target video frame, gesture
The corresponding relationship of type and word, the target video frame obtained between first object video frame and the second target video frame are corresponding
Word, the first object video frame are that this identifies the target video frame of the target gesture-type, second target
Video frame is the preceding target video frame for once identifying the target gesture-type;
At least one described word is combined, the object statement is obtained.
It is described corresponding with word based at least one gesture-type and gesture-type in a kind of possible implementation
Relationship obtains object statement, comprising:
When often identifying a gesture-type, the corresponding relationship based on the gesture-type and gesture-type and word,
The corresponding word of the gesture-type is obtained, using the word as the object statement.
It is described according to the object statement in a kind of possible implementation, generate the corresponding voice of the object statement
After data, the method also includes:
When the gesture-type identified is target gesture-type, then to first object video frame and the second target video frame
Between target video frame corresponding to word carry out grammer detection, the first object video frame be this identify the mesh
The target video frame of gesture-type is marked, the second target video frame is the preceding target for once identifying the target gesture-type
Video frame;
When grammer detection does not pass through, based on the target view between the first object video frame and the second target video frame
The corresponding word of frequency frame regenerates new object statement, and the new object statement includes at least one described word.
It is described according to the object statement in a kind of possible implementation, generate the corresponding voice of the object statement
Data, including following either steps:
When in the target video frame including facial image, recognition of face is carried out to the facial image, is obtained described
The corresponding expression type of facial image is based on the expression type, generates the first voice data, the sound of first voice data
Tone character closes the expression type;
When in the target video frame including facial image, recognition of face is carried out to the facial image, is obtained described
The range of age belonging to facial image is based on described the range of age, obtains the corresponding tamber data of described the range of age, is based on institute
Tamber data is stated, second speech data is generated, the tone color of the second speech data meets described the range of age;
When in the target video frame including facial image, recognition of face is carried out to the facial image, is obtained described
The corresponding sex types of facial image are based on the sex types, obtain the corresponding tamber data of the sex types, are based on institute
Tamber data is stated, third voice data is generated, the tone color of the third voice data meets the sex types;
Based on the pace of change of the gesture-type, the corresponding affection data of the pace of change is determined, be based on the feelings
Feel data, generate the 4th voice data, the tone of the 4th voice data meets the pace of change.
It is described according to the object statement in a kind of possible implementation, generate the corresponding voice of the object statement
Data, comprising:
Corresponding relationship based on character element and character element in the object statement and pronunciation, obtains the target
The corresponding pronunciation sequence of sentence;
Based on the pronunciation sequence, the corresponding voice data of the object statement is generated.
It is described that at least one target video frame is obtained from video to be processed in a kind of possible implementation, comprising:
It, will be described to be processed by the convolutional neural networks by the video input convolutional neural networks to be processed
Video is split as multiple video frames;
Any video frame is labeled hand images when detecting in the video frame includes hand images,
Using the video frame as target video frame;
When detecting in the video frame does not include hand images, the video frame is abandoned.
According to the second aspect of an embodiment of the present disclosure, a kind of voice data generating means are provided, which includes: to obtain list
Member, is configured as executing and obtains at least one target video frame from video to be processed, the target video frame be include hand
The video frame of portion's image;
Recognition unit is configured as executing the hand images progress gesture identification at least one target video frame,
Obtain at least one described corresponding gesture-type of target video frame;
Sentence generation unit is configured as executing corresponding with word based at least one gesture-type and gesture-type
Relationship, obtains object statement, and the object statement includes the corresponding word of at least one described gesture-type;
Voice data generation unit is configured as executing that it is corresponding to generate the object statement according to the object statement
Voice data.
In a kind of possible implementation, the recognition unit includes:
Gesture shape obtains subelement, is configured as executing the hand images progress gesture knowledge to each target video frame
Not, based on the hand profile in hand images in each target video frame, the gesture of each target video frame is obtained
Shape;
Gesture-type obtains subelement, is configured as executing gesture shape and hand based on each target video frame
The corresponding relationship of gesture shape and gesture-type determines the corresponding gesture-type of each target video frame.
In a kind of possible implementation, described device further include:
Determination unit is configured as executing when the gesture-type for the successive objective video frame for having destination number is identical, will
Identical gesture-type is as the corresponding gesture-type of the successive objective video frame.
In a kind of possible implementation, the sentence generation unit includes:
Word obtains subelement, is configured as executing when the gesture-type identified is target gesture-type, is based on mesh
The corresponding relationship for marking the corresponding gesture-type of video frame, gesture-type and word, obtains first object video frame and the second target
The corresponding word of target video frame between video frame, the first object video frame are that this identifies the target gesture class
The target video frame of type, the second target video frame are the preceding target video frame for once identifying the target gesture-type;
Subelement is combined, execution is configured as and is combined at least one described word, obtain the object statement.
In a kind of possible implementation, the sentence generation unit is also configured to execute and often identifies a hand
When gesture type, it is corresponding to obtain the gesture-type for the corresponding relationship based on the gesture-type and gesture-type and word
Word, using the word as the object statement.
In a kind of possible implementation, described device further include:
Grammer detection unit is configured as executing when the gesture-type identified is target gesture-type, then to first
Word corresponding to target video frame between target video frame and the second target video frame carries out grammer detection, first mesh
Mark video frame is that this identifies that the target video frame of the target gesture-type, the second target video frame are preceding primary knowledge
Not Chu the target gesture-type target video frame;
The sentence generation unit is configured as executing when grammer detection does not pass through, is based on the first object video
The corresponding word of target video frame between frame and the second target video frame regenerates new object statement, the new target
Sentence includes at least one described word.
In a kind of possible implementation, the voice data generation unit is configured as executing following either steps:
When in the target video frame including facial image, recognition of face is carried out to the facial image, is obtained described
The corresponding expression type of facial image is based on the expression type, generates the first voice data, the sound of first voice data
Tone character closes the expression type;
When in the target video frame including facial image, recognition of face is carried out to the facial image, is obtained described
The range of age belonging to facial image is based on described the range of age, obtains the corresponding tamber data of described the range of age, is based on institute
Tamber data is stated, second speech data is generated, the tone color of the second speech data meets described the range of age;
When in the target video frame including facial image, recognition of face is carried out to the facial image, is obtained described
The corresponding sex types of facial image are based on the sex types, obtain the corresponding tamber data of the sex types, are based on institute
Tamber data is stated, third voice data is generated, the tone color of the third voice data meets the sex types;
Based on the pace of change of the gesture-type, the corresponding affection data of the pace of change is determined, be based on the feelings
Feel data, generate the 4th voice data, the tone of the 4th voice data meets the pace of change.
In a kind of possible implementation, the voice data generation unit includes:
Pronounce retrieval subelement, is configured as the character element executed based in the object statement and character member
The corresponding relationship of element and pronunciation, obtains the corresponding pronunciation sequence of the object statement;
Voice data obtains subelement, is configured as executing based on the pronunciation sequence, it is corresponding to generate the object statement
Voice data.
In a kind of possible implementation, the acquiring unit includes:
Subelement is inputted, is configured as executing by the video input convolutional neural networks to be processed, by the volume
The video to be processed is split as multiple video frames by product neural network;
Subelement is marked, is configured as executing for any video frame, includes hand figure in the video frame when detecting
When picture, hand images are labeled, using the video frame as target video frame;
Subelement is abandoned, is configured as executing when detecting in the video frame does not include hand images, by the view
Frequency frame abandons.
According to the third aspect of an embodiment of the present disclosure, a kind of terminal is provided, comprising:
One or more processors;
For storing one or more memories of one or more of processor-executable instructions;
Wherein, one or more of processors are configured as executing the described in any item voice data of above-mentioned target aspect
Generation method.
According to a fourth aspect of embodiments of the present disclosure, a kind of server is provided, comprising:
One or more processors;
For storing one or more memories of one or more of processor-executable instructions;
Wherein, one or more of processors are configured as executing the described in any item voice data of above-mentioned target aspect
Generation method.
According to a fifth aspect of the embodiments of the present disclosure, a kind of computer readable storage medium is provided, when the storage is situated between
When instruction in matter is executed by the processor of computer equipment, so that computer equipment is able to carry out any one in terms of above-mentioned target
The voice data generation method.
According to a sixth aspect of an embodiment of the present disclosure, a kind of computer program product, including executable instruction are provided, institute is worked as
When stating instruction in computer program product and being executed by the processor of computer equipment, so that the computer equipment is able to carry out
Voice data generation method as described in any one of the above embodiments.
The technical scheme provided by this disclosed embodiment at least bring it is following the utility model has the advantages that
A kind of voice data generation method, device, terminal and the storage medium that the embodiment of the present disclosure provides, by including
The video of sign language carries out object detecting and tracking, obtains the gesture-type of user, by the corresponding relationship of gesture-type and word,
The corresponding sentence of sign language is got, and generates the voice data of the sentence, can be recognized subsequently through voice data is played
The content that sign language in video is intended by, realize hearing-impaired people with it is strong listen between personage accessible exchange.Wherein, wait locate
The video of reason can be shot to obtain by common camera, and therefore, the program does not depend on specific equipment, can directly mobile phone,
It directly runs, not additional cost, can be popularized preferably in listening barrier crowd in the terminals such as computer.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure
Example, and together with specification for explaining the principles of this disclosure, do not constitute the improper restriction to the disclosure.
Fig. 1 is a kind of flow chart of voice data generation method shown according to an exemplary embodiment;
Fig. 2 is a kind of flow chart of voice data generation method shown according to an exemplary embodiment;
Fig. 3 is a kind of schematic diagram of target video frame shown according to an exemplary embodiment;
Fig. 4 is a kind of flow chart of voice data generation method shown according to an exemplary embodiment;
Fig. 5 is the flow chart of another voice data generation method shown according to an exemplary embodiment;
Fig. 6 is a kind of block diagram of voice data generating means shown according to an exemplary embodiment;
Fig. 7 is the block diagram of another voice data generating means shown according to an exemplary embodiment;
Fig. 8 is a kind of block diagram of terminal shown according to an exemplary embodiment;
Fig. 9 is a kind of block diagram of server shown according to an exemplary embodiment.
Specific embodiment
In order to make ordinary people in the field more fully understand the technical solution of the disclosure, below in conjunction with attached drawing, to this public affairs
The technical solution opened in embodiment is clearly and completely described.
It should be noted that term " target " in the specification and claims of the disclosure and above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to embodiment of the disclosure described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.Embodiment described in following exemplary embodiment does not represent and disclosure phase
Consistent all embodiments.On the contrary, they are only and as detailed in the attached claim, the disclosure some aspects
The example of consistent device and method.
The embodiment of the present disclosure can be applied under any scene for needing to translate sign language.
For example, main broadcaster can be hearing-impaired people under live scene, terminal shoots the video of the main broadcaster, will be on the video
It passes to the server of live streaming software context, sign language video is analyzed and processed by server, the sign language in video is turned over
It is translated into voice data, voice data is issued to viewing terminal, viewing terminal plays voice data, to recognize that main broadcaster wants
The semanteme of expression realizes main broadcaster and watches the normal communication of user.
For example, hearing-impaired people and it is strong listen the scene of face-to-face exchange of personage under, hearing-impaired people can pass through mobile phone etc.
Terminal shoots the sign language video of oneself, is analyzed and processed by terminal to sign language video, is voice by sign language interpreter in video
Data, and voice data is played, so that other people can quickly understand the semanteme that user is intended by.
In addition to above-mentioned scene, the method that the embodiment of the present disclosure provides can also be applied to user and watch hearing-impaired people's shooting
Video, by viewing terminal by the sign language interpreter in video under other scenes such as voice data, the embodiment of the present disclosure to this not
It limits.
Fig. 1 is a kind of flow chart of voice data generation method shown according to an exemplary embodiment, as shown in Figure 1,
The voice data generation method can be applied in computer equipment, which can be the terminals such as mobile phone, computer,
It can be the server with association, comprising the following steps:
In step s 11, obtain at least one target video frame from video to be processed, target video frame be include hand
The video frame of portion's image.
In step s 12, gesture identification is carried out to the hand images of at least one target video frame, obtains at least one mesh
Mark the corresponding gesture-type of video frame.
In step s 13, the corresponding relationship based at least one gesture-type and gesture-type and word, obtains target
Sentence, object statement include the corresponding word of at least one gesture-type.
In step S14, according to object statement, the corresponding voice data of object statement is generated.
A kind of voice data generation method, device, terminal and the storage medium that the embodiment of the present disclosure provides, by including
The video of sign language carries out object detecting and tracking, obtains the gesture-type of user, by the corresponding relationship of gesture-type and word,
The corresponding sentence of sign language is got, and generates the voice data of the sentence, can be recognized subsequently through voice data is played
The content that sign language in video is intended by, realize hearing-impaired people with it is strong listen between personage accessible exchange.Wherein, wait locate
The video of reason can be shot to obtain by common camera, and therefore, the program does not depend on specific equipment, can directly mobile phone,
It directly runs, not additional cost, can be popularized preferably in listening barrier crowd in the terminals such as computer.
In a kind of possible implementation, gesture identification is carried out to the hand images of at least one target video frame, is obtained
At least one corresponding gesture-type of target video frame, comprising:
Gesture identification is carried out to the hand images of each target video frame, based in hand images in each target video frame
Hand profile, obtain the gesture shape of each target video frame;
The corresponding relationship of gesture shape and gesture shape and gesture-type based on each target video frame determines each
The corresponding gesture-type of target video frame.
In a kind of possible implementation, based at least one gesture-type and gesture-type pass corresponding with word
System, before obtaining object statement, method further include:
When the gesture-type for the successive objective video frame for having destination number is identical, using identical gesture-type as continuous
The corresponding gesture-type of target video frame.
In a kind of possible implementation, based at least one gesture-type and gesture-type pass corresponding with word
System, obtains object statement, comprising:
When the gesture-type identified is target gesture-type, it is based on the corresponding gesture-type of target video frame, gesture
The corresponding relationship of type and word, the target video frame obtained between first object video frame and the second target video frame are corresponding
Word, first object video frame are that this identifies that the target video frame of target gesture-type, the second target video frame are previous
The secondary target video frame for identifying target gesture-type;
At least one word is combined, object statement is obtained.
In a kind of possible implementation, based at least one gesture-type and gesture-type pass corresponding with word
System, obtains object statement, comprising:
When often identifying a gesture-type, the corresponding relationship based on gesture-type and gesture-type and word is obtained
The corresponding word of gesture-type, using word as object statement.
In a kind of possible implementation, according to object statement, after generating the corresponding voice data of object statement, method
Further include:
When the gesture-type identified is target gesture-type, then to first object video frame and the second target video frame
Between target video frame corresponding to word carry out grammer detection, first object video frame be this identify target gesture class
The target video frame of type, the second target video frame are the preceding target video frame for once identifying target gesture-type;
When grammer detection does not pass through, based on the target video frame between first object video frame and the second target video frame
Corresponding word regenerates new object statement, and new object statement includes at least one word.
In a kind of possible implementation, according to object statement, the corresponding voice data of object statement is generated, including following
Either step:
When in target video frame including facial image, recognition of face is carried out to facial image, it is corresponding to obtain facial image
Expression type, be based on expression type, generate the first voice data, the tone of the first voice data meets expression type;
When in target video frame including facial image, recognition of face is carried out to facial image, is obtained belonging to facial image
The range of age, be based on the range of age, obtain the corresponding tamber data of the range of age, be based on tamber data, generate the second voice
The tone color of data, second speech data meets the range of age;
When in target video frame including facial image, recognition of face is carried out to facial image, it is corresponding to obtain facial image
Sex types, be based on sex types, obtain the corresponding tamber data of sex types, be based on tamber data, generate third voice
The tone color of data, third voice data meets sex types;
Pace of change based on gesture-type determines the corresponding affection data of pace of change, is based on affection data, generates the
The tone of four voice data, the 4th voice data meets pace of change.
In a kind of possible implementation, according to object statement, the corresponding voice data of object statement is generated, comprising:
It is corresponding to obtain object statement for corresponding relationship based on character element and character element in object statement and pronunciation
Pronunciation sequence;
Based on pronunciation sequence, the corresponding voice data of object statement is generated.
In a kind of possible implementation, at least one target video frame is obtained from video to be processed, comprising:
By in video input convolutional neural networks model to be processed, by convolutional neural networks model by video to be processed
It is split as multiple video frames;
Any video frame is labeled hand images, will regard when detecting in video frame includes hand images
Frequency frame is as target video frame;
When detecting in video frame does not include hand images, video frame is abandoned.
All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination
It repeats one by one.
Fig. 2 is a kind of flow chart of voice data generation method shown according to an exemplary embodiment, as shown in Fig. 2,
This method can be applied in computer equipment, which can be the terminals such as mobile phone, computer, or with application
Associated server, the present embodiment are illustrated so that server is executing subject as an example, comprising the following steps:
In the step s 21, server obtains at least one target video frame from video to be processed, and target video frame is
Video frame including hand images.
Wherein, video to be processed can be completed by terminal shooting after, one section of complete video of upload can also be with
It is to carry out shooting the video for being sent to server in real time by terminal.The video to be processed is connected by still image one by one
Made of connecing, each still image is a video frame.
The specific implementation of above-mentioned steps S21 can be with are as follows: server is after getting video to be processed, to be processed
Each of video video frame carry out hand images detection, determine in video frame whether include hand images, work as video frame
In include hand images when, the region where hand images is marked, target video frame is obtained;When not including in video frame
When hand images, which is abandoned.By abandoning a part of useless video frame, reduce subsequent view to be treated
Frequency number of frames, and then reduce the calculation amount of server, improve processing speed.
Wherein, server determines in video frame whether to include that the detailed processes of hand images can be by first network reality
Existing, which can be SSD (Single Shot multibox Detector, the more case detectors of single) network, HMM
(Hidden Markov Model, hidden Markov model) network or other convolutional neural networks.Correspondingly, in the step
In a kind of possible implementation of S21, video to be processed is split as multiple video frames by server, for any video frame,
Server obtains the characteristic of the video frame using first network, determines in characteristic whether include target signature data,
The target signature data are the corresponding characteristic of hand;When in characteristic including target signature data, according to target spy
The position for levying data, determines the position of hand images;The position of hand images is marked by rectangle frame, output has rectangle collimation mark
The target video frame of note;When in characteristic not including target signature data, which is abandoned.Pass through convolutional Neural net
Network analyzes video to be processed, can quickly and accurately analyze video.
Wherein, target video frame with rectangle frame mark can with as shown in figure 3, Fig. 3 shows 3 target video frames,
Hand images in each target video frame pass through rectangle frame and are labelled with out.
Wherein, first network can use training sample and be trained to obtain to convolutional neural networks.For example, using instruction
Practice the stage that sample is trained convolutional neural networks, picture largely including hand images can be prepared, to these pictures
In hand images be labeled, i.e., by the region where hand images in picture by rectangle frame mark out come.Utilize mark
Picture afterwards obtains the first network of training completion to being trained in convolutional neural networks.
It should be noted that the present embodiment is illustrated so that first network is to video analysis to be processed as an example,
In some embodiments, video to be processed can be analyzed with other methods such as image scannings, the embodiment of the present disclosure is treated
The method that the video of processing is analyzed is without limitation.
In step S22, server carries out gesture identification to the hand images of at least one target video frame, is somebody's turn to do
At least one corresponding gesture-type of target video frame.
In the present embodiment, server carries out the opportunity of gesture identification to the hand images of at least one target video frame
It can be following any opportunity: (1) after getting the target complete video frame of video to be processed, to the hand of target video frame
Portion's image, which carries out gesture identification, reduces running memory by the way that video frame is divided into two-step pretreatment;(2) a mesh is being got
After marking video frame, gesture identification is carried out to the hand images of the target video frame, obtains the gesture-type of the target video frame
Later, the step of obtaining next target video frame is executed, by thoroughly being handled each video frame, is conducive to improve
The real-time of exchange.
In addition, under the detailed process that server identifies the hand images of at least one target video frame may include
State process: server carries out gesture identification to the hand images of each target video frame, based on hand in each target video frame
Hand profile in image obtains the gesture shape of each target video frame;Gesture shape based on each target video frame with
And the corresponding relationship of gesture shape and gesture-type, determine the corresponding gesture-type of each target video frame.
In addition, what above-mentioned server can realize the detailed process of the analysis of hand images by the second network, the
Two networks can be SSD network, HMM network or other convolutional neural networks.Correspondingly, in a kind of possibility of step S22
In implementation, server carries out target detection using first network and obtains hand images, using the second network to hand images
Tracked, obtains the corresponding gesture-type of hand images.That is, server uses the second net in the embodiment of the present disclosure
When network classifies to gesture, target detection can also be carried out to next video frame using first network, it is total by two networks
The classification of gesture-type is obtained with processing, accelerates the speed of gesture classification.
The training process of second network can be with are as follows: the picture for preparing a large amount of different gesture shapes divides these pictures
Class mark.Such as, it is all 1 that gesture-type, which is the label of all pictures of " than the heart ", and gesture-type is all pictures of " good "
Label is all 2.It will be trained in picture input convolutional neural networks after mark, obtain the second network of training completion.
In addition, above-mentioned server can also realize the analytic process of hand images by first network.It that is to say, lead to
The same network is crossed to realize target detection and target classification.Server using first network detection video frame in whether include
It is corresponding that hand images can be obtained to the hand images progress gesture identification after detecting hand images in hand images
Gesture-type, it is only necessary to which the classification of target detection and target can be completed in a network, so that in the algorithm occupancy of analysis video
Deposit it is smaller, thus be easy to terminal calling.
It should be noted that the second network of input can be target view when passing through the second Network Recognition gesture-type
Frequency frame, the hand images being also possible in target video frame, the embodiment of the present disclosure do not limit this.
In step S23, when the gesture-type for the successive objective video frame for having destination number is identical, server will be identical
Gesture-type as the corresponding gesture-type of successive objective video frame.
Since video is when being shot, one second available to multiple video frames, therefore, when user makes gesture motion
When, the same gesture motion appears in multiple video frames.User can generate it in the change procedure of gesture motion
The corresponding movement of his gesture-type, since the duration of the gesture motion generated in gesture motion change procedure is shorter, and
The sign language duration that user makes can be relatively long, and in order to determine which is that the sign language that user makes acts, which is
The movement that user generates in gesture change procedure, when the gesture-type for the successive objective video frame for having destination number is identical,
Server can be using identical gesture-type as the corresponding gesture-type of successive objective video frame, so that the gesture that user makes
Movement, server can only generate a corresponding word or sentence, avoid the intermediate hand that will be generated in gesture change procedure
Gesture is misidentified, and the experience of user is improved, and also improves the accuracy rate of identification, it is thus also avoided that server is directed to one of user
Movement generates multiple dittographs.
The specific implementation of above-mentioned steps S23 can be with are as follows: server is after getting a gesture-type, by the hand
For gesture type as gesture-type to be determined, server obtains the gesture-type of next target video frame again.When next target regards
When the gesture-type of frequency frame is identical as gesture-type to be determined, the read-around ratio of gesture-type to be determined is added 1, continues to hold
Row obtains the step of gesture-type of next target video frame;When the gesture-type and gesture-type to be determined of next video frame
When different, it is determined that whether the read-around ratio of the gesture-type to be determined is greater than destination number, if gesture-type to be determined
Read-around ratio is not less than destination number, it is determined that gesture-type to be determined is effective gesture-type, by the hand of next video frame
Gesture type is as gesture-type to be determined;If the frequency of occurrence of gesture-type to be determined is less than destination number, will be to true
Fixed gesture-type is determined as invalid gesture-type, using the gesture-type of next target video as gesture-type to be determined.
Wherein, destination number can be any values such as 10,15,20, and destination number can be by the number of interior video frame per second
Amount determines that the gesture pace of change of perhaps user is determining or other modes determine, the embodiment of the present disclosure does not limit this.
In step s 24, when the gesture-type identified is target gesture-type, it is based on the corresponding hand of target video frame
The corresponding relationship of gesture type, gesture-type and word, server obtain between first object video frame and the second target video frame
The corresponding word of target video frame, first object video frame is that this identifies the target video frame of target gesture-type, the
Two target video frames are the preceding target video frame for once identifying target gesture-type.
Wherein, target gesture-type can be a gesture-type set in advance, and the target gesture is for indicating one
The statement of words is completed.When detecting target gesture-type, illustrate that user wants to indicate that the words has stated completion.In addition,
One gesture-type can correspond at least one word.
Wherein, server obtains the corresponding word of target video frame between first object video frame and the second target video frame
The detailed process of language can be with are as follows: server obtains the corresponding gesture-type of multiple successive objective video frames, obtains from database
At least one corresponding word of each gesture-type, the database are corresponding for corresponding storage gesture-type and gesture-type
At least one word.
It should be noted that being only to be with the completion for indicating a word by target gesture in the embodiments of the present disclosure
Example be illustrated, in some embodiments, can also shooting video terminal on setting button, indicated by click keys
The completion of a word, or indicate the completion of a word by other means, the embodiment of the present disclosure determines one to server
The mode whether words are completed is without limitation.
In step s 25, at least one word is combined by server, obtains multiple sentences.
When server gets a word, directly using the word as sentence.When server gets multiple words
When, the detailed process of server generated statement can be with are as follows: obtains multiple sentences by combining multiple word orders;Alternatively, base
Corpus is retrieved in multiple words, obtains multiple sentences in corpus, wherein includes a large amount of true sentence in corpus.
In a kind of possible implementation, server obtains multiple sentences, specific mistake by combining multiple word orders
Journey can be with are as follows: the corresponding word of each gesture-type is carried out group according to the chronological order of gesture-type by server
It closes, obtains a sentence, since some gesture-types correspond to multiple words, server is needed the every of the gesture-type
A word is once combined with the word of other gesture-types, so obtaining multiple sentences.Due to the word order and spoken language of sign language
Word order statement it is identical, therefore, the corresponding vocabulary of gesture-type directly can be subjected to permutation and combination sequentially in time, protected
On the basis of demonstrate,proving accuracy rate, the formation speed of sentence is accelerated.
In alternatively possible implementation, server is based on multiple words and retrieves corpus, obtains more in corpus
A sentence, detailed process can be with are as follows: server local is stored with corpus, and server is when obtaining multiple words, by multiple words
Language is combined as term, is retrieved in corpus, and multiple sentences are obtained from corpus, wherein each sentence
Including the corresponding word of each gesture-type.By searching true sentence in corpus, the sentence that ensure that
Smoothness.
Since some gesture-types correspond to multiple words, and therefore, it is necessary to by the corresponding each word of gesture-type and other
The corresponding word of gesture-type is combined, and obtains multiple retrieval vocabulary.For each retrieval vocabulary, obtaining from corpus should
Retrieve at least one corresponding sentence of vocabulary.
In step S26, server carries out score value calculating to each sentence, using the sentence of highest score as target language
Sentence.
Wherein, server can it is whether clear and coherent according to sentence, whether include the corresponding word of each gesture-type, word
Whether sequence in sentence with the time of origin sequence phase equal conditions of corresponding gesture-type carries out score value to each sentence
It calculates.According to the different generating mode of sentence, server can carry out score value calculating to sentence according to different conditions.In addition,
Any one or a variety of conditions can also be combined to carry out score value calculating by server.
By server by being illustrated for combining to obtain multiple sentences by multiple word orders, server can foundation
The smoothness of sentence carries out score value calculating to each sentence, using the sentence of highest score as object statement.Due to certain gestures
Type can correspond to multiple words, and multiple word semantic may differ larger, when the gesture-type corresponding word selected is to use
When the word that family is intended by, the sentence is clear and coherent, when the gesture-type corresponding word of selection is not the word that user is intended by
When, which may not be clear and coherent.By judging the clear and coherent degree of sentence, in the corresponding multiple words of gesture-type, use is got
The word that family is intended by improves the precision of sign language interpreter.
Server can judge whether sentence is clear and coherent based on N-gram algorithm, and N-gram algorithm may determine that per N number of phase
Whether adjacent vocabulary arranges in pairs or groups, and server can determine the collocation degree in a sentence per N number of adjacent words, base based on N-gram algorithm
In the collocation degree of every N number of adjacent words, the clear and coherent degree of sentence is determined, wherein N can be any numbers such as 2,3,4,5, can be with
It is the vocabulary number for including in the sentence.Wherein, the collocation degree of adjacent words is higher, and the clear and coherent degree of sentence is higher.Using N-
Gram algorithm can be accurately judged to the smoothness of sentence, so that it is determined that going out to meet the sentence of user's requirement, further improve
The precision of sign language interpreter.
Multiple words are based on server and retrieve corpus, are illustrated for multiple sentences in acquisition corpus, base
The sequencing of word in the time of origin sequence of each gesture-type and each sentence, carries out score value meter to each sentence
It calculates, wherein the phase velocity of sequencing of the sequencing of the gesture-type word corresponding with gesture-type in sentence is higher,
The score value of sentence is higher.Wherein, the true sentence for the problems such as sentence in corpus is no word order, logic, from corpus
The sentence filtered out is the true sentence in daily life, Neng Gougeng without verifying word order or logic with the presence or absence of problem
Exchange between good simulation normal users, improves the effect of sign language interpreter, and only needs to verify the suitable of word in the sentence
Whether the time of origin of sequence and gesture-type sequence is identical, simplifies judgement process.
In step s 27, server is based on object statement, generates the corresponding voice data of object statement.
Wherein, voice data is the audio data of object statement.
The specific implementation process of above-mentioned steps S27 can be with are as follows: server is based on the character element and word in object statement
The corresponding relationship of symbol element and pronunciation obtains the corresponding pronunciation sequence of object statement, based on pronunciation sequence, generates object statement pair
The voice data answered.
Wherein, server obtains the pronunciation sequence of object statement, and generates the corresponding language of object statement according to pronunciation sequence
The specific process of sound data may include following processes: server by text regularization method to object statement at
Reason, is converted into Chinese characters kind character for the non-Chinese characters kind character in object statement, obtains first object sentence;Server is to the first mesh
Poster sentence carries out word segmentation processing and part-of-speech tagging, obtains at least one participle and at least one corresponding part of speech result of participle;
Based on the part of speech result of each participle and the corresponding relationship of pronunciation, the pronunciation of each word segmentation result is obtained;It is tied based on each participle
The pronunciation of fruit carries out prosody prediction to each word segmentation result by rhythm model, obtains the pronunciation sequence with prosody tags;Clothes
Business device carries out parameters,acoustic prediction to each pronunciation unit in pronunciation sequence using acoustic model, obtains each pronunciation unit pair
The parameters,acoustic answered;The corresponding parameters,acoustic of each pronunciation unit is converted into corresponding voice data by server.Wherein, acoustics
Model can be using LSTM (Long Short-Term Memory, shot and long term memory) network model.
By the way that the pronunciation of word segmentation result to be handled to rhythm model, the voice being subsequently generated can be made more lively,
Normal communication between two users of more preferable simulation, enhances user experience, improves sign language interpreter effect.
In addition, can also refer to the state of user when generating voice data, export the voice being consistent with the state of user
Data.In a kind of possible implementation, multiple expression types and the corresponding tone letter of expression type are stored in server
Breath.When in target video frame including facial image, server carries out recognition of face to facial image, and it is corresponding to obtain facial image
Expression type, be based on expression type, generate the first voice data, the tone of the first voice data meets expression type.For example,
When server detects that the expression type of user is glad, the first more cheerful and more light-hearted voice data of tone can be generated.
Multiple the ranges of age and the corresponding sound of the range of age are stored in alternatively possible implementation, in server
Chromatic number evidence.When in target video frame including facial image, server carries out recognition of face to facial image, obtains facial image
Affiliated the range of age is based on the range of age, obtains the corresponding tamber data of the range of age, is based on tamber data, generates second
The tone color of voice data, second speech data meets the range of age.For example, when server detects that the range of age of user is 5-
At 10 years old, the second speech data that tone color is relatively immature is generated.
Sex types and the corresponding patch numbers of sex types are stored in alternatively possible implementation, in server
According to.When in target video frame including facial image, recognition of face is carried out to facial image, obtains the corresponding gender of facial image
Type is based on sex types, obtains the corresponding tamber data of sex types, is based on tamber data, generates third voice data, the
The tone color of three voice data meets sex types.For example, can generate tone color is women when server detects that user is women
Third voice data.
Multiple paces of change and the corresponding feelings of pace of change are stored in alternatively possible implementation, in server
Feel data.Pace of change based on gesture-type, server determine the corresponding affection data of pace of change, are based on affection data,
The 4th voice data is generated, the tone of the 4th voice data meets pace of change.For example, the gesture pace of change as user is very fast
When, illustrate that the mood of user is more exciting, then generates higher 4th voice data of intonation.
In summary step, the voice data generation method that the embodiment of the present disclosure provides is as shown in figure 4, hearing-impaired people passes through
A Duan Shouyu is shown in face of camera, camera shooting includes the video of sign language, is carried out by Sign Language Recognition module to video
Sign Language Recognition analysis, obtains multiple gesture-types, obtains the corresponding word of gesture-type by sign language interpreter module, will at least one
A word synthesizes object statement, and by listening voice synthetic module to generate the voice data of object statement, which is played
Personage is listened to strong, realizes hearing-impaired people and the strong normal communication listened between personage.
It should be noted that the mode of above-mentioned four kinds of generations voice data, can choose any one or a variety of progress
In conjunction with the tone color oneself liked or tone, Lai Shengcheng voice data can also be selected by user, and the embodiment of the present disclosure is only pair
Improve sound effect mode be illustrated, the embodiment of the present disclosure to improve sound effect concrete mode without limitation.
A kind of voice data generation method that the embodiment of the present disclosure provides, by carrying out target inspection to the video for including sign language
It surveys and tracks, obtain the gesture-type of user, by the corresponding relationship of gesture-type and word, get the corresponding language of sign language
Sentence, and the voice data of the sentence is generated, it can recognize that table is wanted in the sign language in video subsequently through voice data is played
The content reached, realize hearing-impaired people with it is strong listen between personage accessible exchange.Wherein, video to be processed can be by common
Camera shoots to obtain, and therefore, the program does not depend on specific equipment, directly can directly transport in the terminals such as mobile phone, computer
Row, not additional cost, can popularize preferably in listening barrier crowd.
In addition, judging effective gesture and invalid gesture by the duration of detection gesture, avoids gesture and changed
The intermediate gesture generated in journey is misidentified, and is improved the accuracy of sign language interpreter, is improved user experience.
In addition, server is after getting multiple object statements, it can also be according to certain condition to multiple object statements
Score value calculating is carried out, using the highest sentence of score value as object statement, so that object statement is more in line with the demand of user, is improved
User experience, enhances the effect of sign language interpreter.
In addition, server can also generate the voice data being consistent with the state of user according to the state of user, so that should
Exchange between the more preferable simulation normal users of voice data, so that the communication process is more vivid.
Above-mentioned Fig. 2 is to generate the corresponding voice of word after user in short statement completion to 4 illustrated embodiments
What data instance was illustrated, and in a kind of possible embodiment, server can generate in real time after getting gesture-type
The corresponding voice data of gesture-type.Below based on the embodiment further progress introduction of Fig. 5.Fig. 5 is according to an exemplary reality
A kind of flow chart of the voice data generation method exemplified is applied, as shown in figure 5, this method is used in server, including following
Step:
In step s 51, server obtains at least one target video frame, at least one target from video to be processed
Video frame is the video frame for including hand images.
In step S52, server carries out gesture identification to the hand images of at least one target video frame, is somebody's turn to do
At least one corresponding gesture-type of target video frame.
In step S53, when the gesture-type for the successive objective video frame for having destination number is identical, server will be identical
Gesture-type as the corresponding gesture-type of successive objective video frame.
Wherein, step S51 to step S53 is similar to step S23 with step S21, and this is no longer going to repeat them.
In step S54, after server often identifies a gesture-type, it is based on the gesture-type and gesture-type
With the corresponding relationship of word, the corresponding word of gesture-type is obtained, using the word as object statement.
Wherein, the corresponding word of a gesture-type, since the word and gesture-type are one-to-one relationship, and
The word order of sign language with the strong spoken word order for listening personage be it is identical, therefore, can should after server determines gesture-type
The corresponding unique word of gesture-type is determined as object statement, which can accurately express the semanteme of sign language.
In step S55, server is based on object statement, generates the corresponding voice data of object statement.
Ask middle step S55 similar with step S27, this is no longer going to repeat them.
In step S56, when the gesture-type identified is target gesture-type, then server is to first object video
Word corresponding to target video frame between frame and the second target video frame carries out grammer detection, and first object video frame is this
The secondary target video frame for identifying target gesture-type, the second target video frame are the preceding mesh for once identifying target gesture-type
Mark video frame.
At the end of a word that user is intended by is expressed by sign language, server can also export in real time the word
Word arranged sequentially in time, generate a sentence, grammer detection carried out to the sentence, determines the language that exports in real time
Whether sentence is accurate.
In step S57, when grammer detection does not pass through, server is based on first object video frame and the second target video
The corresponding word of target video frame between frame regenerates new object statement, and new object statement includes at least one word
Language.
That is, the word is exported again, step S24 is similar to step S26, herein not when grammer is there are when problem
It repeats one by one again.
It should be noted that then continuing to execute the step of analysis to next video frame is handled when grammer detection passes through.
In step S58, server generates the corresponding voice data of new object statement based on new object statement.
Step S58 is similar with step S27, and this is no longer going to repeat them.
A kind of voice data generation method that the embodiment of the present disclosure provides, it is defeated after determining an effective gesture-type
The corresponding voice data of the gesture-type out improves the speed of translation by real time translation, also improves hearing-impaired people and is good for
The exchange between personage is listened to experience, it being capable of the more preferable strong oral communication listened between personage of simulation.Also, server is in short
After output finishes, also symbol can be also regenerated when the grammer of the word is there are when problem to word progress grammer detection
The sentence for closing grammer, improves the accuracy of translation.
All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination
It repeats one by one.
Fig. 6 is a kind of voice data generating means block diagram shown according to an exemplary embodiment.Referring to Fig. 6, the device
Including acquiring unit 601, recognition unit 602, sentence generation unit 603 and voice data generation unit 604.
Acquiring unit 601, is configured as executing and obtains at least one target video frame, the target from video to be processed
Video frame is the video frame for including hand images;
Recognition unit 602 is configured as executing the hand images progress gesture identification at least one target video frame,
Obtain at least one corresponding gesture-type of target video frame;
Sentence generation unit 603 is configured as executing based at least one gesture-type and gesture-type and word
Corresponding relationship obtains object statement, which includes the corresponding word of at least one gesture-type;
Voice data generation unit 604 is configured as executing generating the corresponding language of the object statement according to the object statement
Sound data.
The embodiment of the present disclosure provide voice data generating means, by include sign language video carry out target detection with
Tracking, obtains the gesture-type of user, by the corresponding relationship of gesture-type and word, gets the corresponding sentence of sign language, and
The voice data for generating the sentence, subsequently through playing in voice data can recognize that the sign language in video is intended by
Hold, realize hearing-impaired people with it is strong listen between personage accessible exchange.Wherein, video to be processed can be by common camera
Shooting obtains, and therefore, the program does not depend on specific equipment, directly can directly run, not have in the terminals such as mobile phone, computer
Additional cost can be popularized preferably in listening barrier crowd.
In a kind of possible implementation, as shown in fig. 7, the recognition unit 602 includes:
Gesture shape obtains subelement 6021, is configured as executing the hand images progress gesture to each target video frame
Identification, based on the hand profile in hand images in each target video frame, obtains the sign-shaped of each target video frame
Shape;
Gesture-type obtains subelement 6022, be configured as executing gesture shape based on each target video frame and
The corresponding relationship of gesture shape and gesture-type determines the corresponding gesture-type of each target video frame.
In a kind of possible implementation, as shown in fig. 7, the device further include:
Determination unit 605 is configured as executing when the gesture-type for the successive objective video frame for having destination number is identical,
Using identical gesture-type as the corresponding gesture-type of successive objective video frame.
In a kind of possible implementation, as shown in fig. 7, the sentence generation unit 603 includes:
Word obtains subelement 6031, is configured as executing the base when the gesture-type identified is target gesture-type
In the corresponding relationship of the corresponding gesture-type of target video frame, gesture-type and word, first object video frame and second is obtained
The corresponding word of target video frame between target video frame, the first object video frame are that this identifies the target gesture class
The target video frame of type, the second target video frame are the preceding target video frame for once identifying the target gesture-type;
Subelement 6032 is combined, execution is configured as and is combined at least one word, obtain the object statement.
In a kind of possible implementation, as shown in fig. 7, the sentence generation unit 603, is also configured to execute every knowledge
Not Chu a gesture-type when, the corresponding relationship based on the gesture-type and gesture-type and word obtains the gesture-type
Corresponding word, using the word as the object statement.
In a kind of possible implementation, as shown in fig. 7, the device further include:
Grammer detection unit 606 is configured as executing when the gesture-type identified is target gesture-type, then to the
Word corresponding to target video frame between one target video frame and the second target video frame carries out grammer detection, first mesh
Mark video frame is that this identifies the target video frame of the target gesture-type, which once identifies to be preceding
The target video frame of the target gesture-type;
The sentence generation unit 603 is configured as executing when grammer detection does not pass through, is based on the first object video frame
And the second corresponding word of target video frame between target video frame regenerates new object statement, the new object statement
Including at least one word.
In a kind of possible implementation, as shown in fig. 7, the voice data generation unit 603, is configured as executing following
Either step:
When in the target video frame including facial image, recognition of face is carried out to the facial image, obtains the face figure
As corresponding expression type, it is based on the expression type, generates the first voice data, the tone of first voice data meets the table
Feelings type;
When in the target video frame including facial image, recognition of face is carried out to the facial image, obtains the face figure
As affiliated the range of age, it is based on the range of age, obtains the corresponding tamber data of the range of age, is based on the tamber data,
Second speech data is generated, the tone color of the second speech data meets the range of age;
When in the target video frame including facial image, recognition of face is carried out to the facial image, obtains the face figure
As corresponding sex types, the sex types are based on, the corresponding tamber data of the sex types is obtained, is based on the tamber data,
Third voice data is generated, the tone color of the third voice data meets the sex types;
Pace of change based on the gesture-type determines the corresponding affection data of the pace of change, is based on the affection data,
The 4th voice data is generated, the tone of the 4th voice data meets the pace of change.
In a kind of possible implementation, as shown in fig. 7, the voice data generation unit 604 includes:
Pronounce retrieval subelement 6041, is configured as executing character element and character based in the object statement
The corresponding relationship of element and pronunciation obtains the corresponding pronunciation sequence of the object statement;
Voice data obtains subelement 6042, is configured as executing based on the pronunciation sequence, it is corresponding to generate the object statement
Voice data.
In a kind of possible implementation, as shown in fig. 7, the acquiring unit 601 includes:
Subelement 6011 is inputted, is configured as executing by the video input convolutional neural networks model to be processed, by
The video to be processed is split as multiple video frames by the convolutional neural networks model;
Subelement 6012 is marked, is configured as executing for any video frame, includes hand in the video frame when detecting
When image, hand images are labeled, using the video frame as target video frame;
Subelement 6013 is abandoned, is configured as executing when detecting in the video frame does not include hand images, by the view
Frequency frame abandons.
It should be understood that voice data generating means provided by the above embodiment are when generating voice data, only more than
The division progress of each functional unit is stated for example, can according to need and in practical application by above-mentioned function distribution by difference
Functional unit complete, i.e., the internal structure of voice data generating means is divided into different functional units, with complete more than
The all or part of function of description.In addition, voice data generating means provided by the above embodiment and voice data generation side
Method embodiment belongs to same design, and specific implementation process is detailed in embodiment of the method, and which is not described herein again.
Fig. 8 is a kind of structural block diagram for terminal that the embodiment of the present disclosure provides.The terminal 800 is for executing above-described embodiment
The step of middle terminal executes, can be portable mobile termianl, such as: smart phone, tablet computer, MP3 player (Moving
Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3), MP4
(Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 4) is broadcast
Put device, laptop or desktop computer.Terminal 800 be also possible to referred to as user equipment, portable terminal, laptop terminal,
Other titles such as terminal console.
In general, terminal 800 includes: processor 801 and memory 802.
Processor 801 may include one or more processing cores, such as 4 core processors, 8 core processors etc..Place
Reason device 801 can use DSP (Digital Signal Processing, Digital Signal Processing), FPGA (Field-
Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, may be programmed
Logic array) at least one of example, in hardware realize.Processor 801 also may include primary processor and coprocessor, master
Processor is the processor for being handled data in the awake state, also referred to as CPU (Central Processing
Unit, central processing unit);Coprocessor is the low power processor for being handled data in the standby state.?
In some embodiments, processor 801 can be integrated with GPU (Graphics Processing Unit, image processor),
GPU is used to be responsible for the rendering and drafting of content to be shown needed for display screen.In some embodiments, processor 801 can also be wrapped
AI (Artificial Intelligence, artificial intelligence) processor is included, the AI processor is for handling related machine learning
Calculating operation.
Memory 802 may include one or more computer readable storage mediums, which can
To be non-transient.Memory 802 may also include high-speed random access memory and nonvolatile memory, such as one
Or multiple disk storage equipments, flash memory device.In some embodiments, the non-transient computer in memory 802 can
Storage medium is read for storing at least one instruction, at least one instruction performed by processor 801 for realizing this Shen
Please in embodiment of the method provide voice data generation method.
In some embodiments, terminal 800 is also optional includes: peripheral device interface 803 and at least one peripheral equipment.
It can be connected by bus or signal wire between processor 801, memory 802 and peripheral device interface 803.Each peripheral equipment
It can be connected by bus, signal wire or circuit board with peripheral device interface 803.Specifically, peripheral equipment includes: radio circuit
804, at least one of touch display screen 805, CCD camera assembly 806, voicefrequency circuit 807, positioning component 808 and power supply 809.
Peripheral device interface 803 can be used for I/O (Input/Output, input/output) is relevant outside at least one
Peripheral equipment is connected to processor 801 and memory 802.In some embodiments, processor 801, memory 802 and peripheral equipment
Interface 803 is integrated on same chip or circuit board;In some other embodiments, processor 801, memory 802 and outer
Any one or two in peripheral equipment interface 803 can realize on individual chip or circuit board, the present embodiment to this not
It is limited.
Radio circuit 804 is for receiving and emitting RF (Radio Frequency, radio frequency) signal, also referred to as electromagnetic signal.It penetrates
Frequency circuit 804 is communicated by electromagnetic signal with communication network and other communication equipments.Radio circuit 804 turns electric signal
It is changed to electromagnetic signal to be sent, alternatively, the electromagnetic signal received is converted to electric signal.Optionally, radio circuit 804 wraps
It includes: antenna system, RF transceiver, one or more amplifiers, tuner, oscillator, digital signal processor, codec chip
Group, user identity module card etc..Radio circuit 804 can be carried out by least one wireless communication protocol with other terminals
Communication.The wireless communication protocol includes but is not limited to: WWW, Metropolitan Area Network (MAN), Intranet, each third generation mobile communication network (2G, 3G,
4G and 5G), WLAN and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.In some embodiments, it penetrates
Frequency circuit 804 can also include NFC (Near Field Communication, wireless near field communication) related circuit, this
Application is not limited this.
Display screen 805 is for showing UI (User Interface, user interface).The UI may include figure, text, figure
Mark, video and its their any combination.When display screen 805 is touch display screen, display screen 805 also there is acquisition to show
The ability of the touch signal on the surface or surface of screen 805.The touch signal can be used as control signal and be input to processor
801 are handled.At this point, display screen 805 can be also used for providing virtual push button and/or dummy keyboard, also referred to as soft button and/or
Soft keyboard.In some embodiments, display screen 805 can be one, and the front panel of terminal 800 is arranged;In other embodiments
In, display screen 805 can be at least two, be separately positioned on the different surfaces of terminal 800 or in foldover design;In still other reality
It applies in example, display screen 805 can be flexible display screen, be arranged on the curved surface of terminal 800 or on fold plane.Even, it shows
Display screen 805 can also be arranged to non-rectangle irregular figure, namely abnormity screen.Display screen 805 can use LCD (Liquid
Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode)
Etc. materials preparation.
CCD camera assembly 806 is for acquiring image or video.Optionally, CCD camera assembly 806 include front camera and
Rear camera.In general, the front panel of terminal is arranged in front camera, the back side of terminal is arranged in rear camera.One
In a little embodiments, rear camera at least two is main camera, depth of field camera, wide-angle camera, focal length camera shooting respectively
Any one in head, to realize that main camera and the fusion of depth of field camera realize background blurring function, main camera and wide-angle
Camera fusion realizes that pan-shot and VR (Virtual Reality, virtual reality) shooting function or other fusions are clapped
Camera shooting function.In some embodiments, CCD camera assembly 806 can also include flash lamp.Flash lamp can be monochromatic warm flash lamp,
It is also possible to double-colored temperature flash lamp.Double-colored temperature flash lamp refers to the combination of warm light flash lamp and cold light flash lamp, can be used for not
With the light compensation under colour temperature.
Voicefrequency circuit 807 may include microphone and loudspeaker.Microphone is used to acquire the sound wave of user and environment, and will
Sound wave, which is converted to electric signal and is input to processor 801, to be handled, or is input to radio circuit 804 to realize voice communication.
For stereo acquisition or the purpose of noise reduction, microphone can be separately positioned on the different parts of terminal 800 to be multiple.Mike
Wind can also be array microphone or omnidirectional's acquisition type microphone.Loudspeaker is then used to that processor 801 or radio circuit will to be come from
804 electric signal is converted to sound wave.Loudspeaker can be traditional wafer speaker, be also possible to piezoelectric ceramic loudspeaker.When
When loudspeaker is piezoelectric ceramic loudspeaker, the audible sound wave of the mankind can be not only converted electrical signals to, it can also be by telecommunications
Number the sound wave that the mankind do not hear is converted to carry out the purposes such as ranging.In some embodiments, voicefrequency circuit 807 can also include
Earphone jack.
Positioning component 808 is used for the current geographic position of positioning terminal 800, to realize navigation or LBS (Location
Based Service, location based service).Positioning component 808 can be the GPS (Global based on the U.S.
Positioning System, global positioning system), the Gray of the dipper system of China or Russia receive this system or European Union
The positioning component of Galileo system.
Power supply 809 is used to be powered for the various components in terminal 800.Power supply 809 can be alternating current, direct current,
Disposable battery or rechargeable battery.When power supply 809 includes rechargeable battery, which can support wired charging
Or wireless charging.The rechargeable battery can be also used for supporting fast charge technology.
In some embodiments, terminal 800 further includes having one or more sensors 810.The one or more sensors
810 include but is not limited to: acceleration transducer 811, gyro sensor 812, pressure sensor 813, fingerprint sensor 814,
Optical sensor 815 and proximity sensor 816.
The acceleration that acceleration transducer 811 can detecte in three reference axis of the coordinate system established with terminal 800 is big
It is small.For example, acceleration transducer 811 can be used for detecting component of the acceleration of gravity in three reference axis.Processor 801 can
With the acceleration of gravity signal acquired according to acceleration transducer 811, touch display screen 805 is controlled with transverse views or longitudinal view
Figure carries out the display of user interface.Acceleration transducer 811 can be also used for the acquisition of game or the exercise data of user.
Gyro sensor 812 can detecte body direction and the rotational angle of terminal 800, and gyro sensor 812 can
To cooperate with acquisition user to act the 3D of terminal 800 with acceleration transducer 811.Processor 801 is according to gyro sensor 812
Following function may be implemented in the data of acquisition: when action induction (for example changing UI according to the tilt operation of user), shooting
Image stabilization, game control and inertial navigation.
The lower layer of side frame and/or touch display screen 805 in terminal 800 can be set in pressure sensor 813.Work as pressure
When the side frame of terminal 800 is arranged in sensor 813, user can detecte to the gripping signal of terminal 800, by processor 801
Right-hand man's identification or prompt operation are carried out according to the gripping signal that pressure sensor 813 acquires.When the setting of pressure sensor 813 exists
When the lower layer of touch display screen 805, the pressure operation of touch display screen 805 is realized to UI circle according to user by processor 801
Operability control on face is controlled.Operability control includes button control, scroll bar control, icon control, menu
At least one of control.
Fingerprint sensor 814 is used to acquire the fingerprint of user, collected according to fingerprint sensor 814 by processor 801
The identity of fingerprint recognition user, alternatively, by fingerprint sensor 814 according to the identity of collected fingerprint recognition user.It is identifying
When the identity of user is trusted identity out, the user is authorized to execute relevant sensitive operation, the sensitive operation packet by processor 801
Include solution lock screen, check encryption information, downloading software, payment and change setting etc..Terminal can be set in fingerprint sensor 814
800 front, the back side or side.When being provided with physical button or manufacturer Logo in terminal 800, fingerprint sensor 814 can be with
It is integrated with physical button or manufacturer's mark.
Optical sensor 815 is for acquiring ambient light intensity.In one embodiment, processor 801 can be according to optics
The ambient light intensity that sensor 815 acquires controls the display brightness of touch display screen 805.Specifically, when ambient light intensity is higher
When, the display brightness of touch display screen 805 is turned up;When ambient light intensity is lower, the display for turning down touch display screen 805 is bright
Degree.In another embodiment, the ambient light intensity that processor 801 can also be acquired according to optical sensor 815, dynamic adjust
The acquisition parameters of CCD camera assembly 806.
Proximity sensor 816, also referred to as range sensor are generally arranged at the front panel of terminal 800.Proximity sensor 816
For acquiring the distance between the front of user Yu terminal 800.In one embodiment, when proximity sensor 816 detects use
When family and the distance between the front of terminal 800 gradually become smaller, touch display screen 805 is controlled from bright screen state by processor 801
It is switched to breath screen state;When proximity sensor 816 detects user and the distance between the front of terminal 800 becomes larger,
Touch display screen 805 is controlled by processor 801 and is switched to bright screen state from breath screen state.
It will be understood by those skilled in the art that the restriction of the not structure paired terminal 800 of structure shown in Fig. 8, can wrap
It includes than illustrating more or fewer components, perhaps combine certain components or is arranged using different components.
Fig. 9 is a kind of block diagram of server 900 shown according to an exemplary embodiment.The server 900 can be because of configuration
Or performance is different and generate bigger difference, may include one or more processors (central processing
Units, CPU) 901 and one or more memory 902, wherein at least one instruction is stored in memory 902,
The method that at least one instruction is loaded by processor 901 and executed to realize above-mentioned each embodiment of the method offer.Certainly, the clothes
Business device can also have the components such as wired or wireless network interface, keyboard and input/output interface, to carry out input and output,
The server can also include other for realizing the component of functions of the equipments, and this will not be repeated here.
Server 900 can be used for executing step performed by server in above-mentioned voice data generation method.
In the exemplary embodiment, a kind of computer readable storage medium is additionally provided, the instruction in the storage medium
When being executed by the processor of computer equipment, so that the voice data that computer equipment is able to carry out embodiment of the present disclosure offer is raw
At method.
In the exemplary embodiment, a kind of computer program product, including executable instruction are additionally provided, when the computer
When instruction in program product is executed by the processor of computer equipment, so that the computer equipment is able to carry out disclosure implementation
The voice data generation method that example provides.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure
Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or
Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure
Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following
Claim is pointed out.
It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and
And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the accompanying claims.
Claims (10)
1. a kind of voice data generation method, which is characterized in that the described method includes:
At least one target video frame is obtained from video to be processed, the target video frame is the video for including hand images
Frame;
Gesture identification is carried out to the hand images of at least one target video frame, obtains at least one described target video frame
Corresponding gesture-type;
Corresponding relationship based at least one gesture-type and gesture-type and word obtains object statement, the target language
Sentence includes the corresponding word of at least one described gesture-type;
According to the object statement, the corresponding voice data of the object statement is generated.
2. the method according to claim 1, wherein the hand figure at least one target video frame
As carrying out gesture identification, at least one described corresponding gesture-type of target video frame is obtained, comprising:
Gesture identification is carried out to the hand images of each target video frame, based in hand images in each target video frame
Hand profile, obtain the gesture shape of each target video frame;
The corresponding relationship of gesture shape and gesture shape and gesture-type based on each target video frame, determine described in
The corresponding gesture-type of each target video frame.
3. according to the method described in claim 2, it is characterized in that, described be based at least one gesture-type and gesture-type
With the corresponding relationship of word, before obtaining object statement, the method also includes:
When the gesture-type for the successive objective video frame for having destination number is identical, using identical gesture-type as described continuous
The corresponding gesture-type of target video frame.
4. the method according to claim 1, wherein described be based at least one gesture-type and gesture-type
With the corresponding relationship of word, object statement is obtained, comprising:
When the gesture-type identified is target gesture-type, it is based on the corresponding gesture-type of target video frame, gesture-type
With the corresponding relationship of word, the corresponding word of target video frame between first object video frame and the second target video frame is obtained
Language, the first object video frame are that this identifies the target video frame of the target gesture-type, the second target view
Frequency frame is the preceding target video frame for once identifying the target gesture-type;
At least one described word is combined, the object statement is obtained.
5. the method according to claim 1, wherein described be based at least one gesture-type and gesture-type
With the corresponding relationship of word, object statement is obtained, comprising:
When often identifying a gesture-type, the corresponding relationship based on the gesture-type and gesture-type and word is obtained
The corresponding word of the gesture-type, using the word as the object statement.
6. according to the method described in claim 5, generating the target language it is characterized in that, described according to the object statement
After the corresponding voice data of sentence, the method also includes:
When the gesture-type identified is target gesture-type, then between first object video frame and the second target video frame
Target video frame corresponding to word carry out grammer detection, the first object video frame be this identify the target hand
The target video frame of gesture type, the second target video frame are the preceding target video for once identifying the target gesture-type
Frame;
When grammer detection does not pass through, based on the target video frame between the first object video frame and the second target video frame
Corresponding word regenerates new object statement, and the new object statement includes at least one described word.
7. a kind of voice data generating means, which is characterized in that described device includes:
Acquiring unit, is configured as executing and obtains at least one target video frame, the target video from video to be processed
Frame is the video frame for including hand images;
Recognition unit is configured as executing the hand images progress gesture identification at least one target video frame, obtain
At least one described corresponding gesture-type of target video frame;
Sentence generation unit is configured as executing based at least one gesture-type and gesture-type pass corresponding with word
System, obtains object statement, and the object statement includes the corresponding word of at least one described gesture-type;
Voice data generation unit is configured as executing generating the corresponding voice of the object statement according to the object statement
Data.
8. a kind of terminal characterized by comprising
One or more processors;
For storing one or more memories of one or more of processor-executable instructions;
Wherein, one or more of processors are configured as perform claim and require 1 to 6 described in any item voice data generations
Method.
9. a kind of server characterized by comprising
One or more processors;
For storing one or more memories of one or more of processor-executable instructions;
Wherein, one or more of processors are configured as perform claim and require 1 to 6 described in any item voice data generations
Method.
10. a kind of computer readable storage medium, when the instruction in the storage medium is executed by the processor of computer equipment
When, so that computer equipment is able to carry out such as voice data generation method described in any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910611471.9A CN110322760B (en) | 2019-07-08 | 2019-07-08 | Voice data generation method, device, terminal and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910611471.9A CN110322760B (en) | 2019-07-08 | 2019-07-08 | Voice data generation method, device, terminal and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110322760A true CN110322760A (en) | 2019-10-11 |
CN110322760B CN110322760B (en) | 2020-11-03 |
Family
ID=68123138
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910611471.9A Active CN110322760B (en) | 2019-07-08 | 2019-07-08 | Voice data generation method, device, terminal and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110322760B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110716648A (en) * | 2019-10-22 | 2020-01-21 | 上海商汤智能科技有限公司 | Gesture control method and device |
CN110730360A (en) * | 2019-10-25 | 2020-01-24 | 北京达佳互联信息技术有限公司 | Video uploading and playing methods and devices, client equipment and storage medium |
CN110826441A (en) * | 2019-10-25 | 2020-02-21 | 深圳追一科技有限公司 | Interaction method, interaction device, terminal equipment and storage medium |
CN111144287A (en) * | 2019-12-25 | 2020-05-12 | Oppo广东移动通信有限公司 | Audio-visual auxiliary communication method, device and readable storage medium |
CN111354362A (en) * | 2020-02-14 | 2020-06-30 | 北京百度网讯科技有限公司 | Method and device for assisting hearing-impaired communication |
CN113031464A (en) * | 2021-03-22 | 2021-06-25 | 北京市商汤科技开发有限公司 | Device control method, device, electronic device and storage medium |
CN113656644A (en) * | 2021-07-26 | 2021-11-16 | 北京达佳互联信息技术有限公司 | Gesture language recognition method and device, electronic equipment and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101527092A (en) * | 2009-04-08 | 2009-09-09 | 西安理工大学 | Computer assisted hand language communication method under special session context |
CN101605399A (en) * | 2008-06-13 | 2009-12-16 | 英华达(上海)电子有限公司 | A kind of portable terminal and method that realizes Sign Language Recognition |
CN102096467A (en) * | 2010-12-28 | 2011-06-15 | 赵剑桥 | Light-reflecting type mobile sign language recognition system and finger-bending measurement method |
CN103116576A (en) * | 2013-01-29 | 2013-05-22 | 安徽安泰新型包装材料有限公司 | Voice and gesture interactive translation device and control method thereof |
CN108846378A (en) * | 2018-07-03 | 2018-11-20 | 百度在线网络技术(北京)有限公司 | Sign Language Recognition processing method and processing device |
CN109063624A (en) * | 2018-07-26 | 2018-12-21 | 深圳市漫牛医疗有限公司 | Information processing method, system, electronic equipment and computer readable storage medium |
CN109446876A (en) * | 2018-08-31 | 2019-03-08 | 百度在线网络技术(北京)有限公司 | Sign language information processing method, device, electronic equipment and readable storage medium storing program for executing |
CN109858357A (en) * | 2018-12-27 | 2019-06-07 | 深圳市赛亿科技开发有限公司 | A kind of gesture identification method and system |
CN109934091A (en) * | 2019-01-17 | 2019-06-25 | 深圳壹账通智能科技有限公司 | Auxiliary manner of articulation, device, computer equipment and storage medium based on image recognition |
-
2019
- 2019-07-08 CN CN201910611471.9A patent/CN110322760B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101605399A (en) * | 2008-06-13 | 2009-12-16 | 英华达(上海)电子有限公司 | A kind of portable terminal and method that realizes Sign Language Recognition |
CN101527092A (en) * | 2009-04-08 | 2009-09-09 | 西安理工大学 | Computer assisted hand language communication method under special session context |
CN102096467A (en) * | 2010-12-28 | 2011-06-15 | 赵剑桥 | Light-reflecting type mobile sign language recognition system and finger-bending measurement method |
CN103116576A (en) * | 2013-01-29 | 2013-05-22 | 安徽安泰新型包装材料有限公司 | Voice and gesture interactive translation device and control method thereof |
CN108846378A (en) * | 2018-07-03 | 2018-11-20 | 百度在线网络技术(北京)有限公司 | Sign Language Recognition processing method and processing device |
CN109063624A (en) * | 2018-07-26 | 2018-12-21 | 深圳市漫牛医疗有限公司 | Information processing method, system, electronic equipment and computer readable storage medium |
CN109446876A (en) * | 2018-08-31 | 2019-03-08 | 百度在线网络技术(北京)有限公司 | Sign language information processing method, device, electronic equipment and readable storage medium storing program for executing |
CN109858357A (en) * | 2018-12-27 | 2019-06-07 | 深圳市赛亿科技开发有限公司 | A kind of gesture identification method and system |
CN109934091A (en) * | 2019-01-17 | 2019-06-25 | 深圳壹账通智能科技有限公司 | Auxiliary manner of articulation, device, computer equipment and storage medium based on image recognition |
Non-Patent Citations (1)
Title |
---|
陈小柏: "基于视觉的连续手语识别系统的研究", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110716648A (en) * | 2019-10-22 | 2020-01-21 | 上海商汤智能科技有限公司 | Gesture control method and device |
CN110716648B (en) * | 2019-10-22 | 2021-08-24 | 上海商汤智能科技有限公司 | Gesture control method and device |
CN110730360A (en) * | 2019-10-25 | 2020-01-24 | 北京达佳互联信息技术有限公司 | Video uploading and playing methods and devices, client equipment and storage medium |
CN110826441A (en) * | 2019-10-25 | 2020-02-21 | 深圳追一科技有限公司 | Interaction method, interaction device, terminal equipment and storage medium |
CN111144287A (en) * | 2019-12-25 | 2020-05-12 | Oppo广东移动通信有限公司 | Audio-visual auxiliary communication method, device and readable storage medium |
CN111354362A (en) * | 2020-02-14 | 2020-06-30 | 北京百度网讯科技有限公司 | Method and device for assisting hearing-impaired communication |
CN113031464A (en) * | 2021-03-22 | 2021-06-25 | 北京市商汤科技开发有限公司 | Device control method, device, electronic device and storage medium |
CN113656644A (en) * | 2021-07-26 | 2021-11-16 | 北京达佳互联信息技术有限公司 | Gesture language recognition method and device, electronic equipment and storage medium |
CN113656644B (en) * | 2021-07-26 | 2024-03-15 | 北京达佳互联信息技术有限公司 | Gesture language recognition method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110322760B (en) | 2020-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110288077B (en) | Method and related device for synthesizing speaking expression based on artificial intelligence | |
CN110322760A (en) | Voice data generation method, device, terminal and storage medium | |
US20200294488A1 (en) | Method, device and storage medium for speech recognition | |
CN110379430A (en) | Voice-based cartoon display method, device, computer equipment and storage medium | |
CN106575500B (en) | Method and apparatus for synthesizing speech based on facial structure | |
JP5843207B2 (en) | Intuitive computing method and system | |
CN110853617B (en) | Model training method, language identification method, device and equipment | |
CN111063342B (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
CN111524501B (en) | Voice playing method, device, computer equipment and computer readable storage medium | |
CN110992927B (en) | Audio generation method, device, computer readable storage medium and computing equipment | |
CN111105788B (en) | Sensitive word score detection method and device, electronic equipment and storage medium | |
CN112116904B (en) | Voice conversion method, device, equipment and storage medium | |
CN112735429B (en) | Method for determining lyric timestamp information and training method of acoustic model | |
CN111031386A (en) | Video dubbing method and device based on voice synthesis, computer equipment and medium | |
CN110162598A (en) | A kind of data processing method and device, a kind of device for data processing | |
CN110148406A (en) | A kind of data processing method and device, a kind of device for data processing | |
CN113750523A (en) | Motion generation method, device, equipment and storage medium for three-dimensional virtual object | |
CN113220590A (en) | Automatic testing method, device, equipment and medium for voice interaction application | |
CN114882862A (en) | Voice processing method and related equipment | |
CN111428079A (en) | Text content processing method and device, computer equipment and storage medium | |
CN109961802A (en) | Sound quality comparative approach, device, electronic equipment and storage medium | |
CN109829067B (en) | Audio data processing method and device, electronic equipment and storage medium | |
CN116580707A (en) | Method and device for generating action video based on voice | |
US12112743B2 (en) | Speech recognition method and apparatus with cascaded hidden layers and speech segments, computer device, and computer-readable storage medium | |
CN115394285A (en) | Voice cloning method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
OL01 | Intention to license declared |