CN110322760A

CN110322760A - Voice data generation method, device, terminal and storage medium

Info

Publication number: CN110322760A
Application number: CN201910611471.9A
Authority: CN
Inventors: 常兵虎; 胡玉坤; 车浩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-07-08
Filing date: 2019-07-08
Publication date: 2019-10-11
Anticipated expiration: 2039-07-08
Also published as: CN110322760B

Abstract

The disclosure is related to Internet technical field about a kind of voice data generation method, device, terminal and storage medium, this method comprises: obtaining at least one target video frame from video to be processed；Gesture identification is carried out to the hand images of at least one target video frame, obtains at least one corresponding gesture-type of target video frame；Corresponding relationship based at least one gesture-type and gesture-type and word, obtains object statement, and object statement includes the corresponding word of at least one gesture-type；According to object statement, the corresponding voice data of object statement is generated.The content that can recognize that the sign language in video is intended by by playing voice data, realize hearing-impaired people with it is strong listen between personage accessible exchange.Video to be processed can be shot to obtain by common camera, and the program does not depend on specific equipment, directly can directly be run in the terminals such as mobile phone, computer, not additional cost, can be popularized preferably in listening barrier crowd.

Description

Voice data generation method, device, terminal and storage medium

Technical field

This disclosure relates to Internet technical field more particularly to a kind of voice data generation method, device, terminal and storage Medium.

Background technique

Barrier crowd's quantity of listening of China is more than 20,000,000 populations, they can only pass through sign language or text in daily life Exchanged with other people, but most people cannot better understand sign language, therefore, hearing-impaired people can only by hand-written or It inputs the modes such as text on an electronic device to exchange with other people, but this exchange way significantly reduces exchange Efficiency.

Currently, hearing-impaired people can also realize the normal communication with other users, the body-sensing by some somatosensory devices Depth camera is provided in equipment, which obtains the gesture motion of user by depth camera, dynamic to the gesture It carries out analysis and obtains the corresponding text information of the gesture motion, on the screen by obtained word-information display.

But the somatosensory device volume is larger under normal conditions, hearing-impaired people can not carry, therefore, this scheme It still cannot achieve the normal communication of hearing-impaired people and other people.

Summary of the invention

The disclosure provides a kind of voice data generation method, device, terminal and storage medium, at least to solve the relevant technologies Middle hearing-impaired people and strong the problem of listening communication difficult between personage.The technical solution of the disclosure is as follows:

According to the first aspect of the embodiments of the present disclosure, a kind of voice data generation method is provided, this method comprises:

Obtain at least one target video frame from video to be processed, the target video frame be include hand images Video frame；

Gesture identification is carried out to the hand images of at least one target video frame, obtains at least one target view The corresponding gesture-type of frequency frame；

Corresponding relationship based at least one gesture-type and gesture-type and word obtains object statement, the mesh Poster sentence includes the corresponding word of at least one described gesture-type；

According to the object statement, the corresponding voice data of the object statement is generated.

In a kind of possible implementation, the hand images at least one target video frame carry out gesture knowledge Not, at least one described corresponding gesture-type of target video frame is obtained, comprising:

Gesture identification is carried out to the hand images of each target video frame, based on hand figure in each target video frame Hand profile as in obtains the gesture shape of each target video frame；

The corresponding relationship of gesture shape and gesture shape and gesture-type based on each target video frame determines The corresponding gesture-type of each target video frame.

It is described corresponding with word based at least one gesture-type and gesture-type in a kind of possible implementation Relationship, before obtaining object statement, the method also includes:

When the gesture-type for the successive objective video frame for having destination number is identical, using identical gesture-type as described in The corresponding gesture-type of successive objective video frame.

It is described corresponding with word based at least one gesture-type and gesture-type in a kind of possible implementation Relationship obtains object statement, comprising:

When the gesture-type identified is target gesture-type, it is based on the corresponding gesture-type of target video frame, gesture The corresponding relationship of type and word, the target video frame obtained between first object video frame and the second target video frame are corresponding Word, the first object video frame are that this identifies the target video frame of the target gesture-type, second target Video frame is the preceding target video frame for once identifying the target gesture-type；

At least one described word is combined, the object statement is obtained.

When often identifying a gesture-type, the corresponding relationship based on the gesture-type and gesture-type and word, The corresponding word of the gesture-type is obtained, using the word as the object statement.

It is described according to the object statement in a kind of possible implementation, generate the corresponding voice of the object statement After data, the method also includes:

When the gesture-type identified is target gesture-type, then to first object video frame and the second target video frame Between target video frame corresponding to word carry out grammer detection, the first object video frame be this identify the mesh The target video frame of gesture-type is marked, the second target video frame is the preceding target for once identifying the target gesture-type Video frame；

When grammer detection does not pass through, based on the target view between the first object video frame and the second target video frame The corresponding word of frequency frame regenerates new object statement, and the new object statement includes at least one described word.

It is described according to the object statement in a kind of possible implementation, generate the corresponding voice of the object statement Data, including following either steps:

When in the target video frame including facial image, recognition of face is carried out to the facial image, is obtained described The corresponding expression type of facial image is based on the expression type, generates the first voice data, the sound of first voice data Tone character closes the expression type；

When in the target video frame including facial image, recognition of face is carried out to the facial image, is obtained described The range of age belonging to facial image is based on described the range of age, obtains the corresponding tamber data of described the range of age, is based on institute Tamber data is stated, second speech data is generated, the tone color of the second speech data meets described the range of age；

When in the target video frame including facial image, recognition of face is carried out to the facial image, is obtained described The corresponding sex types of facial image are based on the sex types, obtain the corresponding tamber data of the sex types, are based on institute Tamber data is stated, third voice data is generated, the tone color of the third voice data meets the sex types；

Based on the pace of change of the gesture-type, the corresponding affection data of the pace of change is determined, be based on the feelings Feel data, generate the 4th voice data, the tone of the 4th voice data meets the pace of change.

It is described according to the object statement in a kind of possible implementation, generate the corresponding voice of the object statement Data, comprising:

Corresponding relationship based on character element and character element in the object statement and pronunciation, obtains the target The corresponding pronunciation sequence of sentence；

Based on the pronunciation sequence, the corresponding voice data of the object statement is generated.

It is described that at least one target video frame is obtained from video to be processed in a kind of possible implementation, comprising:

It, will be described to be processed by the convolutional neural networks by the video input convolutional neural networks to be processed Video is split as multiple video frames；

Any video frame is labeled hand images when detecting in the video frame includes hand images, Using the video frame as target video frame；

When detecting in the video frame does not include hand images, the video frame is abandoned.

According to the second aspect of an embodiment of the present disclosure, a kind of voice data generating means are provided, which includes: to obtain list Member, is configured as executing and obtains at least one target video frame from video to be processed, the target video frame be include hand The video frame of portion's image；

Recognition unit is configured as executing the hand images progress gesture identification at least one target video frame, Obtain at least one described corresponding gesture-type of target video frame；

Sentence generation unit is configured as executing corresponding with word based at least one gesture-type and gesture-type Relationship, obtains object statement, and the object statement includes the corresponding word of at least one described gesture-type；

Voice data generation unit is configured as executing that it is corresponding to generate the object statement according to the object statement Voice data.

In a kind of possible implementation, the recognition unit includes:

Gesture shape obtains subelement, is configured as executing the hand images progress gesture knowledge to each target video frame Not, based on the hand profile in hand images in each target video frame, the gesture of each target video frame is obtained Shape；

Gesture-type obtains subelement, is configured as executing gesture shape and hand based on each target video frame The corresponding relationship of gesture shape and gesture-type determines the corresponding gesture-type of each target video frame.

In a kind of possible implementation, described device further include:

Determination unit is configured as executing when the gesture-type for the successive objective video frame for having destination number is identical, will Identical gesture-type is as the corresponding gesture-type of the successive objective video frame.

In a kind of possible implementation, the sentence generation unit includes:

Word obtains subelement, is configured as executing when the gesture-type identified is target gesture-type, is based on mesh The corresponding relationship for marking the corresponding gesture-type of video frame, gesture-type and word, obtains first object video frame and the second target The corresponding word of target video frame between video frame, the first object video frame are that this identifies the target gesture class The target video frame of type, the second target video frame are the preceding target video frame for once identifying the target gesture-type；

Subelement is combined, execution is configured as and is combined at least one described word, obtain the object statement.

In a kind of possible implementation, the sentence generation unit is also configured to execute and often identifies a hand When gesture type, it is corresponding to obtain the gesture-type for the corresponding relationship based on the gesture-type and gesture-type and word Word, using the word as the object statement.

In a kind of possible implementation, described device further include:

Grammer detection unit is configured as executing when the gesture-type identified is target gesture-type, then to first Word corresponding to target video frame between target video frame and the second target video frame carries out grammer detection, first mesh Mark video frame is that this identifies that the target video frame of the target gesture-type, the second target video frame are preceding primary knowledge Not Chu the target gesture-type target video frame；

The sentence generation unit is configured as executing when grammer detection does not pass through, is based on the first object video The corresponding word of target video frame between frame and the second target video frame regenerates new object statement, the new target Sentence includes at least one described word.

In a kind of possible implementation, the voice data generation unit is configured as executing following either steps:

In a kind of possible implementation, the voice data generation unit includes:

Pronounce retrieval subelement, is configured as the character element executed based in the object statement and character member The corresponding relationship of element and pronunciation, obtains the corresponding pronunciation sequence of the object statement；

Voice data obtains subelement, is configured as executing based on the pronunciation sequence, it is corresponding to generate the object statement Voice data.

In a kind of possible implementation, the acquiring unit includes:

Subelement is inputted, is configured as executing by the video input convolutional neural networks to be processed, by the volume The video to be processed is split as multiple video frames by product neural network；

Subelement is marked, is configured as executing for any video frame, includes hand figure in the video frame when detecting When picture, hand images are labeled, using the video frame as target video frame；

Subelement is abandoned, is configured as executing when detecting in the video frame does not include hand images, by the view Frequency frame abandons.

According to the third aspect of an embodiment of the present disclosure, a kind of terminal is provided, comprising:

One or more processors；

For storing one or more memories of one or more of processor-executable instructions；

Wherein, one or more of processors are configured as executing the described in any item voice data of above-mentioned target aspect Generation method.

According to a fourth aspect of embodiments of the present disclosure, a kind of server is provided, comprising:

One or more processors；

According to a fifth aspect of the embodiments of the present disclosure, a kind of computer readable storage medium is provided, when the storage is situated between When instruction in matter is executed by the processor of computer equipment, so that computer equipment is able to carry out any one in terms of above-mentioned target The voice data generation method.

According to a sixth aspect of an embodiment of the present disclosure, a kind of computer program product, including executable instruction are provided, institute is worked as When stating instruction in computer program product and being executed by the processor of computer equipment, so that the computer equipment is able to carry out Voice data generation method as described in any one of the above embodiments.

The technical scheme provided by this disclosed embodiment at least bring it is following the utility model has the advantages that

A kind of voice data generation method, device, terminal and the storage medium that the embodiment of the present disclosure provides, by including The video of sign language carries out object detecting and tracking, obtains the gesture-type of user, by the corresponding relationship of gesture-type and word, The corresponding sentence of sign language is got, and generates the voice data of the sentence, can be recognized subsequently through voice data is played The content that sign language in video is intended by, realize hearing-impaired people with it is strong listen between personage accessible exchange.Wherein, wait locate The video of reason can be shot to obtain by common camera, and therefore, the program does not depend on specific equipment, can directly mobile phone, It directly runs, not additional cost, can be popularized preferably in listening barrier crowd in the terminals such as computer.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure Example, and together with specification for explaining the principles of this disclosure, do not constitute the improper restriction to the disclosure.

Fig. 1 is a kind of flow chart of voice data generation method shown according to an exemplary embodiment；

Fig. 2 is a kind of flow chart of voice data generation method shown according to an exemplary embodiment；

Fig. 3 is a kind of schematic diagram of target video frame shown according to an exemplary embodiment；

Fig. 4 is a kind of flow chart of voice data generation method shown according to an exemplary embodiment；

Fig. 5 is the flow chart of another voice data generation method shown according to an exemplary embodiment；

Fig. 6 is a kind of block diagram of voice data generating means shown according to an exemplary embodiment；

Fig. 7 is the block diagram of another voice data generating means shown according to an exemplary embodiment；

Fig. 8 is a kind of block diagram of terminal shown according to an exemplary embodiment；

Fig. 9 is a kind of block diagram of server shown according to an exemplary embodiment.

Specific embodiment

In order to make ordinary people in the field more fully understand the technical solution of the disclosure, below in conjunction with attached drawing, to this public affairs The technical solution opened in embodiment is clearly and completely described.

It should be noted that term " target " in the specification and claims of the disclosure and above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiment of the disclosure described herein can in addition to illustrating herein or Sequence other than those of description is implemented.Embodiment described in following exemplary embodiment does not represent and disclosure phase Consistent all embodiments.On the contrary, they are only and as detailed in the attached claim, the disclosure some aspects The example of consistent device and method.

The embodiment of the present disclosure can be applied under any scene for needing to translate sign language.

For example, main broadcaster can be hearing-impaired people under live scene, terminal shoots the video of the main broadcaster, will be on the video It passes to the server of live streaming software context, sign language video is analyzed and processed by server, the sign language in video is turned over It is translated into voice data, voice data is issued to viewing terminal, viewing terminal plays voice data, to recognize that main broadcaster wants The semanteme of expression realizes main broadcaster and watches the normal communication of user.

For example, hearing-impaired people and it is strong listen the scene of face-to-face exchange of personage under, hearing-impaired people can pass through mobile phone etc. Terminal shoots the sign language video of oneself, is analyzed and processed by terminal to sign language video, is voice by sign language interpreter in video Data, and voice data is played, so that other people can quickly understand the semanteme that user is intended by.

In addition to above-mentioned scene, the method that the embodiment of the present disclosure provides can also be applied to user and watch hearing-impaired people's shooting Video, by viewing terminal by the sign language interpreter in video under other scenes such as voice data, the embodiment of the present disclosure to this not It limits.

Fig. 1 is a kind of flow chart of voice data generation method shown according to an exemplary embodiment, as shown in Figure 1, The voice data generation method can be applied in computer equipment, which can be the terminals such as mobile phone, computer, It can be the server with association, comprising the following steps:

In step s 11, obtain at least one target video frame from video to be processed, target video frame be include hand The video frame of portion's image.

In step s 12, gesture identification is carried out to the hand images of at least one target video frame, obtains at least one mesh Mark the corresponding gesture-type of video frame.

In step s 13, the corresponding relationship based at least one gesture-type and gesture-type and word, obtains target Sentence, object statement include the corresponding word of at least one gesture-type.

In step S14, according to object statement, the corresponding voice data of object statement is generated.

In a kind of possible implementation, gesture identification is carried out to the hand images of at least one target video frame, is obtained At least one corresponding gesture-type of target video frame, comprising:

Gesture identification is carried out to the hand images of each target video frame, based in hand images in each target video frame Hand profile, obtain the gesture shape of each target video frame；

The corresponding relationship of gesture shape and gesture shape and gesture-type based on each target video frame determines each The corresponding gesture-type of target video frame.

In a kind of possible implementation, based at least one gesture-type and gesture-type pass corresponding with word System, before obtaining object statement, method further include:

When the gesture-type for the successive objective video frame for having destination number is identical, using identical gesture-type as continuous The corresponding gesture-type of target video frame.

In a kind of possible implementation, based at least one gesture-type and gesture-type pass corresponding with word System, obtains object statement, comprising:

When the gesture-type identified is target gesture-type, it is based on the corresponding gesture-type of target video frame, gesture The corresponding relationship of type and word, the target video frame obtained between first object video frame and the second target video frame are corresponding Word, first object video frame are that this identifies that the target video frame of target gesture-type, the second target video frame are previous The secondary target video frame for identifying target gesture-type；

At least one word is combined, object statement is obtained.

When often identifying a gesture-type, the corresponding relationship based on gesture-type and gesture-type and word is obtained The corresponding word of gesture-type, using word as object statement.

In a kind of possible implementation, according to object statement, after generating the corresponding voice data of object statement, method Further include:

When the gesture-type identified is target gesture-type, then to first object video frame and the second target video frame Between target video frame corresponding to word carry out grammer detection, first object video frame be this identify target gesture class The target video frame of type, the second target video frame are the preceding target video frame for once identifying target gesture-type；

When grammer detection does not pass through, based on the target video frame between first object video frame and the second target video frame Corresponding word regenerates new object statement, and new object statement includes at least one word.

In a kind of possible implementation, according to object statement, the corresponding voice data of object statement is generated, including following Either step:

When in target video frame including facial image, recognition of face is carried out to facial image, it is corresponding to obtain facial image Expression type, be based on expression type, generate the first voice data, the tone of the first voice data meets expression type；

When in target video frame including facial image, recognition of face is carried out to facial image, is obtained belonging to facial image The range of age, be based on the range of age, obtain the corresponding tamber data of the range of age, be based on tamber data, generate the second voice The tone color of data, second speech data meets the range of age；

When in target video frame including facial image, recognition of face is carried out to facial image, it is corresponding to obtain facial image Sex types, be based on sex types, obtain the corresponding tamber data of sex types, be based on tamber data, generate third voice The tone color of data, third voice data meets sex types；

Pace of change based on gesture-type determines the corresponding affection data of pace of change, is based on affection data, generates the The tone of four voice data, the 4th voice data meets pace of change.

In a kind of possible implementation, according to object statement, the corresponding voice data of object statement is generated, comprising:

It is corresponding to obtain object statement for corresponding relationship based on character element and character element in object statement and pronunciation Pronunciation sequence；

Based on pronunciation sequence, the corresponding voice data of object statement is generated.

In a kind of possible implementation, at least one target video frame is obtained from video to be processed, comprising:

By in video input convolutional neural networks model to be processed, by convolutional neural networks model by video to be processed It is split as multiple video frames；

Any video frame is labeled hand images, will regard when detecting in video frame includes hand images Frequency frame is as target video frame；

When detecting in video frame does not include hand images, video frame is abandoned.

All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination It repeats one by one.

Fig. 2 is a kind of flow chart of voice data generation method shown according to an exemplary embodiment, as shown in Fig. 2, This method can be applied in computer equipment, which can be the terminals such as mobile phone, computer, or with application Associated server, the present embodiment are illustrated so that server is executing subject as an example, comprising the following steps:

In the step s 21, server obtains at least one target video frame from video to be processed, and target video frame is Video frame including hand images.

Wherein, video to be processed can be completed by terminal shooting after, one section of complete video of upload can also be with It is to carry out shooting the video for being sent to server in real time by terminal.The video to be processed is connected by still image one by one Made of connecing, each still image is a video frame.

The specific implementation of above-mentioned steps S21 can be with are as follows: server is after getting video to be processed, to be processed Each of video video frame carry out hand images detection, determine in video frame whether include hand images, work as video frame In include hand images when, the region where hand images is marked, target video frame is obtained；When not including in video frame When hand images, which is abandoned.By abandoning a part of useless video frame, reduce subsequent view to be treated Frequency number of frames, and then reduce the calculation amount of server, improve processing speed.

Wherein, server determines in video frame whether to include that the detailed processes of hand images can be by first network reality Existing, which can be SSD (Single Shot multibox Detector, the more case detectors of single) network, HMM (Hidden Markov Model, hidden Markov model) network or other convolutional neural networks.Correspondingly, in the step In a kind of possible implementation of S21, video to be processed is split as multiple video frames by server, for any video frame, Server obtains the characteristic of the video frame using first network, determines in characteristic whether include target signature data, The target signature data are the corresponding characteristic of hand；When in characteristic including target signature data, according to target spy The position for levying data, determines the position of hand images；The position of hand images is marked by rectangle frame, output has rectangle collimation mark The target video frame of note；When in characteristic not including target signature data, which is abandoned.Pass through convolutional Neural net Network analyzes video to be processed, can quickly and accurately analyze video.

Wherein, target video frame with rectangle frame mark can with as shown in figure 3, Fig. 3 shows 3 target video frames, Hand images in each target video frame pass through rectangle frame and are labelled with out.

Wherein, first network can use training sample and be trained to obtain to convolutional neural networks.For example, using instruction Practice the stage that sample is trained convolutional neural networks, picture largely including hand images can be prepared, to these pictures In hand images be labeled, i.e., by the region where hand images in picture by rectangle frame mark out come.Utilize mark Picture afterwards obtains the first network of training completion to being trained in convolutional neural networks.

It should be noted that the present embodiment is illustrated so that first network is to video analysis to be processed as an example, In some embodiments, video to be processed can be analyzed with other methods such as image scannings, the embodiment of the present disclosure is treated The method that the video of processing is analyzed is without limitation.

In step S22, server carries out gesture identification to the hand images of at least one target video frame, is somebody's turn to do At least one corresponding gesture-type of target video frame.

In the present embodiment, server carries out the opportunity of gesture identification to the hand images of at least one target video frame It can be following any opportunity: (1) after getting the target complete video frame of video to be processed, to the hand of target video frame Portion's image, which carries out gesture identification, reduces running memory by the way that video frame is divided into two-step pretreatment；(2) a mesh is being got After marking video frame, gesture identification is carried out to the hand images of the target video frame, obtains the gesture-type of the target video frame Later, the step of obtaining next target video frame is executed, by thoroughly being handled each video frame, is conducive to improve The real-time of exchange.

In addition, under the detailed process that server identifies the hand images of at least one target video frame may include State process: server carries out gesture identification to the hand images of each target video frame, based on hand in each target video frame Hand profile in image obtains the gesture shape of each target video frame；Gesture shape based on each target video frame with And the corresponding relationship of gesture shape and gesture-type, determine the corresponding gesture-type of each target video frame.

In addition, what above-mentioned server can realize the detailed process of the analysis of hand images by the second network, the Two networks can be SSD network, HMM network or other convolutional neural networks.Correspondingly, in a kind of possibility of step S22 In implementation, server carries out target detection using first network and obtains hand images, using the second network to hand images Tracked, obtains the corresponding gesture-type of hand images.That is, server uses the second net in the embodiment of the present disclosure When network classifies to gesture, target detection can also be carried out to next video frame using first network, it is total by two networks The classification of gesture-type is obtained with processing, accelerates the speed of gesture classification.

The training process of second network can be with are as follows: the picture for preparing a large amount of different gesture shapes divides these pictures Class mark.Such as, it is all 1 that gesture-type, which is the label of all pictures of " than the heart ", and gesture-type is all pictures of " good " Label is all 2.It will be trained in picture input convolutional neural networks after mark, obtain the second network of training completion.

In addition, above-mentioned server can also realize the analytic process of hand images by first network.It that is to say, lead to The same network is crossed to realize target detection and target classification.Server using first network detection video frame in whether include It is corresponding that hand images can be obtained to the hand images progress gesture identification after detecting hand images in hand images Gesture-type, it is only necessary to which the classification of target detection and target can be completed in a network, so that in the algorithm occupancy of analysis video Deposit it is smaller, thus be easy to terminal calling.

It should be noted that the second network of input can be target view when passing through the second Network Recognition gesture-type Frequency frame, the hand images being also possible in target video frame, the embodiment of the present disclosure do not limit this.

In step S23, when the gesture-type for the successive objective video frame for having destination number is identical, server will be identical Gesture-type as the corresponding gesture-type of successive objective video frame.

Since video is when being shot, one second available to multiple video frames, therefore, when user makes gesture motion When, the same gesture motion appears in multiple video frames.User can generate it in the change procedure of gesture motion The corresponding movement of his gesture-type, since the duration of the gesture motion generated in gesture motion change procedure is shorter, and The sign language duration that user makes can be relatively long, and in order to determine which is that the sign language that user makes acts, which is The movement that user generates in gesture change procedure, when the gesture-type for the successive objective video frame for having destination number is identical, Server can be using identical gesture-type as the corresponding gesture-type of successive objective video frame, so that the gesture that user makes Movement, server can only generate a corresponding word or sentence, avoid the intermediate hand that will be generated in gesture change procedure Gesture is misidentified, and the experience of user is improved, and also improves the accuracy rate of identification, it is thus also avoided that server is directed to one of user Movement generates multiple dittographs.

The specific implementation of above-mentioned steps S23 can be with are as follows: server is after getting a gesture-type, by the hand For gesture type as gesture-type to be determined, server obtains the gesture-type of next target video frame again.When next target regards When the gesture-type of frequency frame is identical as gesture-type to be determined, the read-around ratio of gesture-type to be determined is added 1, continues to hold Row obtains the step of gesture-type of next target video frame；When the gesture-type and gesture-type to be determined of next video frame When different, it is determined that whether the read-around ratio of the gesture-type to be determined is greater than destination number, if gesture-type to be determined Read-around ratio is not less than destination number, it is determined that gesture-type to be determined is effective gesture-type, by the hand of next video frame Gesture type is as gesture-type to be determined；If the frequency of occurrence of gesture-type to be determined is less than destination number, will be to true Fixed gesture-type is determined as invalid gesture-type, using the gesture-type of next target video as gesture-type to be determined.

Wherein, destination number can be any values such as 10,15,20, and destination number can be by the number of interior video frame per second Amount determines that the gesture pace of change of perhaps user is determining or other modes determine, the embodiment of the present disclosure does not limit this.

In step s 24, when the gesture-type identified is target gesture-type, it is based on the corresponding hand of target video frame The corresponding relationship of gesture type, gesture-type and word, server obtain between first object video frame and the second target video frame The corresponding word of target video frame, first object video frame is that this identifies the target video frame of target gesture-type, the Two target video frames are the preceding target video frame for once identifying target gesture-type.

Wherein, target gesture-type can be a gesture-type set in advance, and the target gesture is for indicating one The statement of words is completed.When detecting target gesture-type, illustrate that user wants to indicate that the words has stated completion.In addition, One gesture-type can correspond at least one word.

Wherein, server obtains the corresponding word of target video frame between first object video frame and the second target video frame The detailed process of language can be with are as follows: server obtains the corresponding gesture-type of multiple successive objective video frames, obtains from database At least one corresponding word of each gesture-type, the database are corresponding for corresponding storage gesture-type and gesture-type At least one word.

It should be noted that being only to be with the completion for indicating a word by target gesture in the embodiments of the present disclosure Example be illustrated, in some embodiments, can also shooting video terminal on setting button, indicated by click keys The completion of a word, or indicate the completion of a word by other means, the embodiment of the present disclosure determines one to server The mode whether words are completed is without limitation.

In step s 25, at least one word is combined by server, obtains multiple sentences.

When server gets a word, directly using the word as sentence.When server gets multiple words When, the detailed process of server generated statement can be with are as follows: obtains multiple sentences by combining multiple word orders；Alternatively, base Corpus is retrieved in multiple words, obtains multiple sentences in corpus, wherein includes a large amount of true sentence in corpus.

In a kind of possible implementation, server obtains multiple sentences, specific mistake by combining multiple word orders Journey can be with are as follows: the corresponding word of each gesture-type is carried out group according to the chronological order of gesture-type by server It closes, obtains a sentence, since some gesture-types correspond to multiple words, server is needed the every of the gesture-type A word is once combined with the word of other gesture-types, so obtaining multiple sentences.Due to the word order and spoken language of sign language Word order statement it is identical, therefore, the corresponding vocabulary of gesture-type directly can be subjected to permutation and combination sequentially in time, protected On the basis of demonstrate,proving accuracy rate, the formation speed of sentence is accelerated.

In alternatively possible implementation, server is based on multiple words and retrieves corpus, obtains more in corpus A sentence, detailed process can be with are as follows: server local is stored with corpus, and server is when obtaining multiple words, by multiple words Language is combined as term, is retrieved in corpus, and multiple sentences are obtained from corpus, wherein each sentence Including the corresponding word of each gesture-type.By searching true sentence in corpus, the sentence that ensure that Smoothness.

Since some gesture-types correspond to multiple words, and therefore, it is necessary to by the corresponding each word of gesture-type and other The corresponding word of gesture-type is combined, and obtains multiple retrieval vocabulary.For each retrieval vocabulary, obtaining from corpus should Retrieve at least one corresponding sentence of vocabulary.

In step S26, server carries out score value calculating to each sentence, using the sentence of highest score as target language Sentence.

Wherein, server can it is whether clear and coherent according to sentence, whether include the corresponding word of each gesture-type, word Whether sequence in sentence with the time of origin sequence phase equal conditions of corresponding gesture-type carries out score value to each sentence It calculates.According to the different generating mode of sentence, server can carry out score value calculating to sentence according to different conditions.In addition, Any one or a variety of conditions can also be combined to carry out score value calculating by server.

By server by being illustrated for combining to obtain multiple sentences by multiple word orders, server can foundation The smoothness of sentence carries out score value calculating to each sentence, using the sentence of highest score as object statement.Due to certain gestures Type can correspond to multiple words, and multiple word semantic may differ larger, when the gesture-type corresponding word selected is to use When the word that family is intended by, the sentence is clear and coherent, when the gesture-type corresponding word of selection is not the word that user is intended by When, which may not be clear and coherent.By judging the clear and coherent degree of sentence, in the corresponding multiple words of gesture-type, use is got The word that family is intended by improves the precision of sign language interpreter.

Server can judge whether sentence is clear and coherent based on N-gram algorithm, and N-gram algorithm may determine that per N number of phase Whether adjacent vocabulary arranges in pairs or groups, and server can determine the collocation degree in a sentence per N number of adjacent words, base based on N-gram algorithm In the collocation degree of every N number of adjacent words, the clear and coherent degree of sentence is determined, wherein N can be any numbers such as 2,3,4,5, can be with It is the vocabulary number for including in the sentence.Wherein, the collocation degree of adjacent words is higher, and the clear and coherent degree of sentence is higher.Using N- Gram algorithm can be accurately judged to the smoothness of sentence, so that it is determined that going out to meet the sentence of user's requirement, further improve The precision of sign language interpreter.

Multiple words are based on server and retrieve corpus, are illustrated for multiple sentences in acquisition corpus, base The sequencing of word in the time of origin sequence of each gesture-type and each sentence, carries out score value meter to each sentence It calculates, wherein the phase velocity of sequencing of the sequencing of the gesture-type word corresponding with gesture-type in sentence is higher, The score value of sentence is higher.Wherein, the true sentence for the problems such as sentence in corpus is no word order, logic, from corpus The sentence filtered out is the true sentence in daily life, Neng Gougeng without verifying word order or logic with the presence or absence of problem Exchange between good simulation normal users, improves the effect of sign language interpreter, and only needs to verify the suitable of word in the sentence Whether the time of origin of sequence and gesture-type sequence is identical, simplifies judgement process.

In step s 27, server is based on object statement, generates the corresponding voice data of object statement.

Wherein, voice data is the audio data of object statement.

The specific implementation process of above-mentioned steps S27 can be with are as follows: server is based on the character element and word in object statement The corresponding relationship of symbol element and pronunciation obtains the corresponding pronunciation sequence of object statement, based on pronunciation sequence, generates object statement pair The voice data answered.

Wherein, server obtains the pronunciation sequence of object statement, and generates the corresponding language of object statement according to pronunciation sequence The specific process of sound data may include following processes: server by text regularization method to object statement at Reason, is converted into Chinese characters kind character for the non-Chinese characters kind character in object statement, obtains first object sentence；Server is to the first mesh Poster sentence carries out word segmentation processing and part-of-speech tagging, obtains at least one participle and at least one corresponding part of speech result of participle； Based on the part of speech result of each participle and the corresponding relationship of pronunciation, the pronunciation of each word segmentation result is obtained；It is tied based on each participle The pronunciation of fruit carries out prosody prediction to each word segmentation result by rhythm model, obtains the pronunciation sequence with prosody tags；Clothes Business device carries out parameters,acoustic prediction to each pronunciation unit in pronunciation sequence using acoustic model, obtains each pronunciation unit pair The parameters,acoustic answered；The corresponding parameters,acoustic of each pronunciation unit is converted into corresponding voice data by server.Wherein, acoustics Model can be using LSTM (Long Short-Term Memory, shot and long term memory) network model.

By the way that the pronunciation of word segmentation result to be handled to rhythm model, the voice being subsequently generated can be made more lively, Normal communication between two users of more preferable simulation, enhances user experience, improves sign language interpreter effect.

In addition, can also refer to the state of user when generating voice data, export the voice being consistent with the state of user Data.In a kind of possible implementation, multiple expression types and the corresponding tone letter of expression type are stored in server Breath.When in target video frame including facial image, server carries out recognition of face to facial image, and it is corresponding to obtain facial image Expression type, be based on expression type, generate the first voice data, the tone of the first voice data meets expression type.For example, When server detects that the expression type of user is glad, the first more cheerful and more light-hearted voice data of tone can be generated.

Multiple the ranges of age and the corresponding sound of the range of age are stored in alternatively possible implementation, in server Chromatic number evidence.When in target video frame including facial image, server carries out recognition of face to facial image, obtains facial image Affiliated the range of age is based on the range of age, obtains the corresponding tamber data of the range of age, is based on tamber data, generates second The tone color of voice data, second speech data meets the range of age.For example, when server detects that the range of age of user is 5- At 10 years old, the second speech data that tone color is relatively immature is generated.

Sex types and the corresponding patch numbers of sex types are stored in alternatively possible implementation, in server According to.When in target video frame including facial image, recognition of face is carried out to facial image, obtains the corresponding gender of facial image Type is based on sex types, obtains the corresponding tamber data of sex types, is based on tamber data, generates third voice data, the The tone color of three voice data meets sex types.For example, can generate tone color is women when server detects that user is women Third voice data.

Multiple paces of change and the corresponding feelings of pace of change are stored in alternatively possible implementation, in server Feel data.Pace of change based on gesture-type, server determine the corresponding affection data of pace of change, are based on affection data, The 4th voice data is generated, the tone of the 4th voice data meets pace of change.For example, the gesture pace of change as user is very fast When, illustrate that the mood of user is more exciting, then generates higher 4th voice data of intonation.

In summary step, the voice data generation method that the embodiment of the present disclosure provides is as shown in figure 4, hearing-impaired people passes through A Duan Shouyu is shown in face of camera, camera shooting includes the video of sign language, is carried out by Sign Language Recognition module to video Sign Language Recognition analysis, obtains multiple gesture-types, obtains the corresponding word of gesture-type by sign language interpreter module, will at least one A word synthesizes object statement, and by listening voice synthetic module to generate the voice data of object statement, which is played Personage is listened to strong, realizes hearing-impaired people and the strong normal communication listened between personage.

It should be noted that the mode of above-mentioned four kinds of generations voice data, can choose any one or a variety of progress In conjunction with the tone color oneself liked or tone, Lai Shengcheng voice data can also be selected by user, and the embodiment of the present disclosure is only pair Improve sound effect mode be illustrated, the embodiment of the present disclosure to improve sound effect concrete mode without limitation.

A kind of voice data generation method that the embodiment of the present disclosure provides, by carrying out target inspection to the video for including sign language It surveys and tracks, obtain the gesture-type of user, by the corresponding relationship of gesture-type and word, get the corresponding language of sign language Sentence, and the voice data of the sentence is generated, it can recognize that table is wanted in the sign language in video subsequently through voice data is played The content reached, realize hearing-impaired people with it is strong listen between personage accessible exchange.Wherein, video to be processed can be by common Camera shoots to obtain, and therefore, the program does not depend on specific equipment, directly can directly transport in the terminals such as mobile phone, computer Row, not additional cost, can popularize preferably in listening barrier crowd.

In addition, judging effective gesture and invalid gesture by the duration of detection gesture, avoids gesture and changed The intermediate gesture generated in journey is misidentified, and is improved the accuracy of sign language interpreter, is improved user experience.

In addition, server is after getting multiple object statements, it can also be according to certain condition to multiple object statements Score value calculating is carried out, using the highest sentence of score value as object statement, so that object statement is more in line with the demand of user, is improved User experience, enhances the effect of sign language interpreter.

In addition, server can also generate the voice data being consistent with the state of user according to the state of user, so that should Exchange between the more preferable simulation normal users of voice data, so that the communication process is more vivid.

Above-mentioned Fig. 2 is to generate the corresponding voice of word after user in short statement completion to 4 illustrated embodiments What data instance was illustrated, and in a kind of possible embodiment, server can generate in real time after getting gesture-type The corresponding voice data of gesture-type.Below based on the embodiment further progress introduction of Fig. 5.Fig. 5 is according to an exemplary reality A kind of flow chart of the voice data generation method exemplified is applied, as shown in figure 5, this method is used in server, including following Step:

In step s 51, server obtains at least one target video frame, at least one target from video to be processed Video frame is the video frame for including hand images.

In step S52, server carries out gesture identification to the hand images of at least one target video frame, is somebody's turn to do At least one corresponding gesture-type of target video frame.

In step S53, when the gesture-type for the successive objective video frame for having destination number is identical, server will be identical Gesture-type as the corresponding gesture-type of successive objective video frame.

Wherein, step S51 to step S53 is similar to step S23 with step S21, and this is no longer going to repeat them.

In step S54, after server often identifies a gesture-type, it is based on the gesture-type and gesture-type With the corresponding relationship of word, the corresponding word of gesture-type is obtained, using the word as object statement.

Wherein, the corresponding word of a gesture-type, since the word and gesture-type are one-to-one relationship, and The word order of sign language with the strong spoken word order for listening personage be it is identical, therefore, can should after server determines gesture-type The corresponding unique word of gesture-type is determined as object statement, which can accurately express the semanteme of sign language.

In step S55, server is based on object statement, generates the corresponding voice data of object statement.

Ask middle step S55 similar with step S27, this is no longer going to repeat them.

In step S56, when the gesture-type identified is target gesture-type, then server is to first object video Word corresponding to target video frame between frame and the second target video frame carries out grammer detection, and first object video frame is this The secondary target video frame for identifying target gesture-type, the second target video frame are the preceding mesh for once identifying target gesture-type Mark video frame.

At the end of a word that user is intended by is expressed by sign language, server can also export in real time the word Word arranged sequentially in time, generate a sentence, grammer detection carried out to the sentence, determines the language that exports in real time Whether sentence is accurate.

In step S57, when grammer detection does not pass through, server is based on first object video frame and the second target video The corresponding word of target video frame between frame regenerates new object statement, and new object statement includes at least one word Language.

That is, the word is exported again, step S24 is similar to step S26, herein not when grammer is there are when problem It repeats one by one again.

It should be noted that then continuing to execute the step of analysis to next video frame is handled when grammer detection passes through.

In step S58, server generates the corresponding voice data of new object statement based on new object statement.

Step S58 is similar with step S27, and this is no longer going to repeat them.

A kind of voice data generation method that the embodiment of the present disclosure provides, it is defeated after determining an effective gesture-type The corresponding voice data of the gesture-type out improves the speed of translation by real time translation, also improves hearing-impaired people and is good for The exchange between personage is listened to experience, it being capable of the more preferable strong oral communication listened between personage of simulation.Also, server is in short After output finishes, also symbol can be also regenerated when the grammer of the word is there are when problem to word progress grammer detection The sentence for closing grammer, improves the accuracy of translation.

Fig. 6 is a kind of voice data generating means block diagram shown according to an exemplary embodiment.Referring to Fig. 6, the device Including acquiring unit 601, recognition unit 602, sentence generation unit 603 and voice data generation unit 604.

Acquiring unit 601, is configured as executing and obtains at least one target video frame, the target from video to be processed Video frame is the video frame for including hand images；

Recognition unit 602 is configured as executing the hand images progress gesture identification at least one target video frame, Obtain at least one corresponding gesture-type of target video frame；

Sentence generation unit 603 is configured as executing based at least one gesture-type and gesture-type and word Corresponding relationship obtains object statement, which includes the corresponding word of at least one gesture-type；

Voice data generation unit 604 is configured as executing generating the corresponding language of the object statement according to the object statement Sound data.

The embodiment of the present disclosure provide voice data generating means, by include sign language video carry out target detection with Tracking, obtains the gesture-type of user, by the corresponding relationship of gesture-type and word, gets the corresponding sentence of sign language, and The voice data for generating the sentence, subsequently through playing in voice data can recognize that the sign language in video is intended by Hold, realize hearing-impaired people with it is strong listen between personage accessible exchange.Wherein, video to be processed can be by common camera Shooting obtains, and therefore, the program does not depend on specific equipment, directly can directly run, not have in the terminals such as mobile phone, computer Additional cost can be popularized preferably in listening barrier crowd.

In a kind of possible implementation, as shown in fig. 7, the recognition unit 602 includes:

Gesture shape obtains subelement 6021, is configured as executing the hand images progress gesture to each target video frame Identification, based on the hand profile in hand images in each target video frame, obtains the sign-shaped of each target video frame Shape；

Gesture-type obtains subelement 6022, be configured as executing gesture shape based on each target video frame and The corresponding relationship of gesture shape and gesture-type determines the corresponding gesture-type of each target video frame.

In a kind of possible implementation, as shown in fig. 7, the device further include:

Determination unit 605 is configured as executing when the gesture-type for the successive objective video frame for having destination number is identical, Using identical gesture-type as the corresponding gesture-type of successive objective video frame.

In a kind of possible implementation, as shown in fig. 7, the sentence generation unit 603 includes:

Word obtains subelement 6031, is configured as executing the base when the gesture-type identified is target gesture-type In the corresponding relationship of the corresponding gesture-type of target video frame, gesture-type and word, first object video frame and second is obtained The corresponding word of target video frame between target video frame, the first object video frame are that this identifies the target gesture class The target video frame of type, the second target video frame are the preceding target video frame for once identifying the target gesture-type；

Subelement 6032 is combined, execution is configured as and is combined at least one word, obtain the object statement.

In a kind of possible implementation, as shown in fig. 7, the sentence generation unit 603, is also configured to execute every knowledge Not Chu a gesture-type when, the corresponding relationship based on the gesture-type and gesture-type and word obtains the gesture-type Corresponding word, using the word as the object statement.

Grammer detection unit 606 is configured as executing when the gesture-type identified is target gesture-type, then to the Word corresponding to target video frame between one target video frame and the second target video frame carries out grammer detection, first mesh Mark video frame is that this identifies the target video frame of the target gesture-type, which once identifies to be preceding The target video frame of the target gesture-type；

The sentence generation unit 603 is configured as executing when grammer detection does not pass through, is based on the first object video frame And the second corresponding word of target video frame between target video frame regenerates new object statement, the new object statement Including at least one word.

In a kind of possible implementation, as shown in fig. 7, the voice data generation unit 603, is configured as executing following Either step:

When in the target video frame including facial image, recognition of face is carried out to the facial image, obtains the face figure As corresponding expression type, it is based on the expression type, generates the first voice data, the tone of first voice data meets the table Feelings type；

When in the target video frame including facial image, recognition of face is carried out to the facial image, obtains the face figure As affiliated the range of age, it is based on the range of age, obtains the corresponding tamber data of the range of age, is based on the tamber data, Second speech data is generated, the tone color of the second speech data meets the range of age；

When in the target video frame including facial image, recognition of face is carried out to the facial image, obtains the face figure As corresponding sex types, the sex types are based on, the corresponding tamber data of the sex types is obtained, is based on the tamber data, Third voice data is generated, the tone color of the third voice data meets the sex types；

Pace of change based on the gesture-type determines the corresponding affection data of the pace of change, is based on the affection data, The 4th voice data is generated, the tone of the 4th voice data meets the pace of change.

In a kind of possible implementation, as shown in fig. 7, the voice data generation unit 604 includes:

Pronounce retrieval subelement 6041, is configured as executing character element and character based in the object statement The corresponding relationship of element and pronunciation obtains the corresponding pronunciation sequence of the object statement；

Voice data obtains subelement 6042, is configured as executing based on the pronunciation sequence, it is corresponding to generate the object statement Voice data.

In a kind of possible implementation, as shown in fig. 7, the acquiring unit 601 includes:

Subelement 6011 is inputted, is configured as executing by the video input convolutional neural networks model to be processed, by The video to be processed is split as multiple video frames by the convolutional neural networks model；

Subelement 6012 is marked, is configured as executing for any video frame, includes hand in the video frame when detecting When image, hand images are labeled, using the video frame as target video frame；

Subelement 6013 is abandoned, is configured as executing when detecting in the video frame does not include hand images, by the view Frequency frame abandons.

It should be understood that voice data generating means provided by the above embodiment are when generating voice data, only more than The division progress of each functional unit is stated for example, can according to need and in practical application by above-mentioned function distribution by difference Functional unit complete, i.e., the internal structure of voice data generating means is divided into different functional units, with complete more than The all or part of function of description.In addition, voice data generating means provided by the above embodiment and voice data generation side Method embodiment belongs to same design, and specific implementation process is detailed in embodiment of the method, and which is not described herein again.

Fig. 8 is a kind of structural block diagram for terminal that the embodiment of the present disclosure provides.The terminal 800 is for executing above-described embodiment The step of middle terminal executes, can be portable mobile termianl, such as: smart phone, tablet computer, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 4) is broadcast Put device, laptop or desktop computer.Terminal 800 be also possible to referred to as user equipment, portable terminal, laptop terminal, Other titles such as terminal console.

In general, terminal 800 includes: processor 801 and memory 802.

Processor 801 may include one or more processing cores, such as 4 core processors, 8 core processors etc..Place Reason device 801 can use DSP (Digital Signal Processing, Digital Signal Processing), FPGA (Field- Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, may be programmed Logic array) at least one of example, in hardware realize.Processor 801 also may include primary processor and coprocessor, master Processor is the processor for being handled data in the awake state, also referred to as CPU (Central Processing Unit, central processing unit)；Coprocessor is the low power processor for being handled data in the standby state.? In some embodiments, processor 801 can be integrated with GPU (Graphics Processing Unit, image processor), GPU is used to be responsible for the rendering and drafting of content to be shown needed for display screen.In some embodiments, processor 801 can also be wrapped AI (Artificial Intelligence, artificial intelligence) processor is included, the AI processor is for handling related machine learning Calculating operation.

Memory 802 may include one or more computer readable storage mediums, which can To be non-transient.Memory 802 may also include high-speed random access memory and nonvolatile memory, such as one Or multiple disk storage equipments, flash memory device.In some embodiments, the non-transient computer in memory 802 can Storage medium is read for storing at least one instruction, at least one instruction performed by processor 801 for realizing this Shen Please in embodiment of the method provide voice data generation method.

In some embodiments, terminal 800 is also optional includes: peripheral device interface 803 and at least one peripheral equipment. It can be connected by bus or signal wire between processor 801, memory 802 and peripheral device interface 803.Each peripheral equipment It can be connected by bus, signal wire or circuit board with peripheral device interface 803.Specifically, peripheral equipment includes: radio circuit 804, at least one of touch display screen 805, CCD camera assembly 806, voicefrequency circuit 807, positioning component 808 and power supply 809.

Peripheral device interface 803 can be used for I/O (Input/Output, input/output) is relevant outside at least one Peripheral equipment is connected to processor 801 and memory 802.In some embodiments, processor 801, memory 802 and peripheral equipment Interface 803 is integrated on same chip or circuit board；In some other embodiments, processor 801, memory 802 and outer Any one or two in peripheral equipment interface 803 can realize on individual chip or circuit board, the present embodiment to this not It is limited.

Radio circuit 804 is for receiving and emitting RF (Radio Frequency, radio frequency) signal, also referred to as electromagnetic signal.It penetrates Frequency circuit 804 is communicated by electromagnetic signal with communication network and other communication equipments.Radio circuit 804 turns electric signal It is changed to electromagnetic signal to be sent, alternatively, the electromagnetic signal received is converted to electric signal.Optionally, radio circuit 804 wraps It includes: antenna system, RF transceiver, one or more amplifiers, tuner, oscillator, digital signal processor, codec chip Group, user identity module card etc..Radio circuit 804 can be carried out by least one wireless communication protocol with other terminals Communication.The wireless communication protocol includes but is not limited to: WWW, Metropolitan Area Network (MAN), Intranet, each third generation mobile communication network (2G, 3G, 4G and 5G), WLAN and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.In some embodiments, it penetrates Frequency circuit 804 can also include NFC (Near Field Communication, wireless near field communication) related circuit, this Application is not limited this.

Display screen 805 is for showing UI (User Interface, user interface).The UI may include figure, text, figure Mark, video and its their any combination.When display screen 805 is touch display screen, display screen 805 also there is acquisition to show The ability of the touch signal on the surface or surface of screen 805.The touch signal can be used as control signal and be input to processor 801 are handled.At this point, display screen 805 can be also used for providing virtual push button and/or dummy keyboard, also referred to as soft button and/or Soft keyboard.In some embodiments, display screen 805 can be one, and the front panel of terminal 800 is arranged；In other embodiments In, display screen 805 can be at least two, be separately positioned on the different surfaces of terminal 800 or in foldover design；In still other reality It applies in example, display screen 805 can be flexible display screen, be arranged on the curved surface of terminal 800 or on fold plane.Even, it shows Display screen 805 can also be arranged to non-rectangle irregular figure, namely abnormity screen.Display screen 805 can use LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) Etc. materials preparation.

CCD camera assembly 806 is for acquiring image or video.Optionally, CCD camera assembly 806 include front camera and Rear camera.In general, the front panel of terminal is arranged in front camera, the back side of terminal is arranged in rear camera.One In a little embodiments, rear camera at least two is main camera, depth of field camera, wide-angle camera, focal length camera shooting respectively Any one in head, to realize that main camera and the fusion of depth of field camera realize background blurring function, main camera and wide-angle Camera fusion realizes that pan-shot and VR (Virtual Reality, virtual reality) shooting function or other fusions are clapped Camera shooting function.In some embodiments, CCD camera assembly 806 can also include flash lamp.Flash lamp can be monochromatic warm flash lamp, It is also possible to double-colored temperature flash lamp.Double-colored temperature flash lamp refers to the combination of warm light flash lamp and cold light flash lamp, can be used for not With the light compensation under colour temperature.

Voicefrequency circuit 807 may include microphone and loudspeaker.Microphone is used to acquire the sound wave of user and environment, and will Sound wave, which is converted to electric signal and is input to processor 801, to be handled, or is input to radio circuit 804 to realize voice communication. For stereo acquisition or the purpose of noise reduction, microphone can be separately positioned on the different parts of terminal 800 to be multiple.Mike Wind can also be array microphone or omnidirectional's acquisition type microphone.Loudspeaker is then used to that processor 801 or radio circuit will to be come from 804 electric signal is converted to sound wave.Loudspeaker can be traditional wafer speaker, be also possible to piezoelectric ceramic loudspeaker.When When loudspeaker is piezoelectric ceramic loudspeaker, the audible sound wave of the mankind can be not only converted electrical signals to, it can also be by telecommunications Number the sound wave that the mankind do not hear is converted to carry out the purposes such as ranging.In some embodiments, voicefrequency circuit 807 can also include Earphone jack.

Positioning component 808 is used for the current geographic position of positioning terminal 800, to realize navigation or LBS (Location Based Service, location based service).Positioning component 808 can be the GPS (Global based on the U.S. Positioning System, global positioning system), the Gray of the dipper system of China or Russia receive this system or European Union The positioning component of Galileo system.

Power supply 809 is used to be powered for the various components in terminal 800.Power supply 809 can be alternating current, direct current, Disposable battery or rechargeable battery.When power supply 809 includes rechargeable battery, which can support wired charging Or wireless charging.The rechargeable battery can be also used for supporting fast charge technology.

In some embodiments, terminal 800 further includes having one or more sensors 810.The one or more sensors 810 include but is not limited to: acceleration transducer 811, gyro sensor 812, pressure sensor 813, fingerprint sensor 814, Optical sensor 815 and proximity sensor 816.

The acceleration that acceleration transducer 811 can detecte in three reference axis of the coordinate system established with terminal 800 is big It is small.For example, acceleration transducer 811 can be used for detecting component of the acceleration of gravity in three reference axis.Processor 801 can With the acceleration of gravity signal acquired according to acceleration transducer 811, touch display screen 805 is controlled with transverse views or longitudinal view Figure carries out the display of user interface.Acceleration transducer 811 can be also used for the acquisition of game or the exercise data of user.

Gyro sensor 812 can detecte body direction and the rotational angle of terminal 800, and gyro sensor 812 can To cooperate with acquisition user to act the 3D of terminal 800 with acceleration transducer 811.Processor 801 is according to gyro sensor 812 Following function may be implemented in the data of acquisition: when action induction (for example changing UI according to the tilt operation of user), shooting Image stabilization, game control and inertial navigation.

The lower layer of side frame and/or touch display screen 805 in terminal 800 can be set in pressure sensor 813.Work as pressure When the side frame of terminal 800 is arranged in sensor 813, user can detecte to the gripping signal of terminal 800, by processor 801 Right-hand man's identification or prompt operation are carried out according to the gripping signal that pressure sensor 813 acquires.When the setting of pressure sensor 813 exists When the lower layer of touch display screen 805, the pressure operation of touch display screen 805 is realized to UI circle according to user by processor 801 Operability control on face is controlled.Operability control includes button control, scroll bar control, icon control, menu At least one of control.

Fingerprint sensor 814 is used to acquire the fingerprint of user, collected according to fingerprint sensor 814 by processor 801 The identity of fingerprint recognition user, alternatively, by fingerprint sensor 814 according to the identity of collected fingerprint recognition user.It is identifying When the identity of user is trusted identity out, the user is authorized to execute relevant sensitive operation, the sensitive operation packet by processor 801 Include solution lock screen, check encryption information, downloading software, payment and change setting etc..Terminal can be set in fingerprint sensor 814 800 front, the back side or side.When being provided with physical button or manufacturer Logo in terminal 800, fingerprint sensor 814 can be with It is integrated with physical button or manufacturer's mark.

Optical sensor 815 is for acquiring ambient light intensity.In one embodiment, processor 801 can be according to optics The ambient light intensity that sensor 815 acquires controls the display brightness of touch display screen 805.Specifically, when ambient light intensity is higher When, the display brightness of touch display screen 805 is turned up；When ambient light intensity is lower, the display for turning down touch display screen 805 is bright Degree.In another embodiment, the ambient light intensity that processor 801 can also be acquired according to optical sensor 815, dynamic adjust The acquisition parameters of CCD camera assembly 806.

Proximity sensor 816, also referred to as range sensor are generally arranged at the front panel of terminal 800.Proximity sensor 816 For acquiring the distance between the front of user Yu terminal 800.In one embodiment, when proximity sensor 816 detects use When family and the distance between the front of terminal 800 gradually become smaller, touch display screen 805 is controlled from bright screen state by processor 801 It is switched to breath screen state；When proximity sensor 816 detects user and the distance between the front of terminal 800 becomes larger, Touch display screen 805 is controlled by processor 801 and is switched to bright screen state from breath screen state.

It will be understood by those skilled in the art that the restriction of the not structure paired terminal 800 of structure shown in Fig. 8, can wrap It includes than illustrating more or fewer components, perhaps combine certain components or is arranged using different components.

Fig. 9 is a kind of block diagram of server 900 shown according to an exemplary embodiment.The server 900 can be because of configuration Or performance is different and generate bigger difference, may include one or more processors (central processing Units, CPU) 901 and one or more memory 902, wherein at least one instruction is stored in memory 902, The method that at least one instruction is loaded by processor 901 and executed to realize above-mentioned each embodiment of the method offer.Certainly, the clothes Business device can also have the components such as wired or wireless network interface, keyboard and input/output interface, to carry out input and output, The server can also include other for realizing the component of functions of the equipments, and this will not be repeated here.

Server 900 can be used for executing step performed by server in above-mentioned voice data generation method.

In the exemplary embodiment, a kind of computer readable storage medium is additionally provided, the instruction in the storage medium When being executed by the processor of computer equipment, so that the voice data that computer equipment is able to carry out embodiment of the present disclosure offer is raw At method.

In the exemplary embodiment, a kind of computer program product, including executable instruction are additionally provided, when the computer When instruction in program product is executed by the processor of computer equipment, so that the computer equipment is able to carry out disclosure implementation The voice data generation method that example provides.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following Claim is pointed out.

It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the accompanying claims.

Claims

1. a kind of voice data generation method, which is characterized in that the described method includes:

At least one target video frame is obtained from video to be processed, the target video frame is the video for including hand images Frame；

Gesture identification is carried out to the hand images of at least one target video frame, obtains at least one described target video frame Corresponding gesture-type；

Corresponding relationship based at least one gesture-type and gesture-type and word obtains object statement, the target language Sentence includes the corresponding word of at least one described gesture-type；

2. the method according to claim 1, wherein the hand figure at least one target video frame As carrying out gesture identification, at least one described corresponding gesture-type of target video frame is obtained, comprising:

The corresponding relationship of gesture shape and gesture shape and gesture-type based on each target video frame, determine described in The corresponding gesture-type of each target video frame.

3. according to the method described in claim 2, it is characterized in that, described be based at least one gesture-type and gesture-type With the corresponding relationship of word, before obtaining object statement, the method also includes:

When the gesture-type for the successive objective video frame for having destination number is identical, using identical gesture-type as described continuous The corresponding gesture-type of target video frame.

4. the method according to claim 1, wherein described be based at least one gesture-type and gesture-type With the corresponding relationship of word, object statement is obtained, comprising:

When the gesture-type identified is target gesture-type, it is based on the corresponding gesture-type of target video frame, gesture-type With the corresponding relationship of word, the corresponding word of target video frame between first object video frame and the second target video frame is obtained Language, the first object video frame are that this identifies the target video frame of the target gesture-type, the second target view Frequency frame is the preceding target video frame for once identifying the target gesture-type；

At least one described word is combined, the object statement is obtained.

5. the method according to claim 1, wherein described be based at least one gesture-type and gesture-type With the corresponding relationship of word, object statement is obtained, comprising:

When often identifying a gesture-type, the corresponding relationship based on the gesture-type and gesture-type and word is obtained The corresponding word of the gesture-type, using the word as the object statement.

6. according to the method described in claim 5, generating the target language it is characterized in that, described according to the object statement After the corresponding voice data of sentence, the method also includes:

When the gesture-type identified is target gesture-type, then between first object video frame and the second target video frame Target video frame corresponding to word carry out grammer detection, the first object video frame be this identify the target hand The target video frame of gesture type, the second target video frame are the preceding target video for once identifying the target gesture-type Frame；

When grammer detection does not pass through, based on the target video frame between the first object video frame and the second target video frame Corresponding word regenerates new object statement, and the new object statement includes at least one described word.

7. a kind of voice data generating means, which is characterized in that described device includes:

Acquiring unit, is configured as executing and obtains at least one target video frame, the target video from video to be processed Frame is the video frame for including hand images；

Sentence generation unit is configured as executing based at least one gesture-type and gesture-type pass corresponding with word System, obtains object statement, and the object statement includes the corresponding word of at least one described gesture-type；

Voice data generation unit is configured as executing generating the corresponding voice of the object statement according to the object statement Data.

8. a kind of terminal characterized by comprising

One or more processors；

Wherein, one or more of processors are configured as perform claim and require 1 to 6 described in any item voice data generations Method.

9. a kind of server characterized by comprising

One or more processors；

10. a kind of computer readable storage medium, when the instruction in the storage medium is executed by the processor of computer equipment When, so that computer equipment is able to carry out such as voice data generation method described in any one of claims 1 to 6.