CN110322760B

CN110322760B - Voice data generation method, device, terminal and storage medium

Info

Publication number: CN110322760B
Application number: CN201910611471.9A
Authority: CN
Inventors: 常兵虎; 胡玉坤; 车浩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-07-08
Filing date: 2019-07-08
Publication date: 2020-11-03
Anticipated expiration: 2039-07-08
Also published as: CN110322760A

Abstract

The present disclosure relates to a voice data generation method, apparatus, terminal and storage medium, and relates to the field of internet technology, the method includes: acquiring at least one target video frame from a video to be processed; performing gesture recognition on a hand image of at least one target video frame to obtain a gesture type corresponding to the at least one target video frame; obtaining a target sentence based on at least one gesture type and the corresponding relation between the gesture type and the words, wherein the target sentence comprises the words corresponding to the at least one gesture type; and generating voice data corresponding to the target sentence according to the target sentence. The content that the sign language in the video is required to express can be known by playing the voice data, and barrier-free communication between the hearing impaired people and the hearing-impaired people is realized. The video to be processed can be shot by a common camera, the scheme does not depend on specific equipment, can be directly operated on terminals such as a mobile phone and a computer, has no extra cost, and can be better popularized among people with hearing disabilities.

Description

Voice data generation method, device, terminal and storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for generating voice data.

Background

The number of people with hearing impairment in China exceeds 2000 million, and the people can only communicate with other people through sign language or characters in daily life, but most people cannot understand the sign language well, so that the hearing impairment can only communicate with other people through handwriting or inputting characters on electronic equipment, and the communication efficiency is greatly reduced.

At present, hearing-impaired people can also realize normal communication with other users through some motion sensing devices, and the motion sensing devices are provided with a depth camera, and the motion sensing devices acquire gesture actions of the users through the depth camera, analyze the gesture actions to acquire character information corresponding to the gesture actions, and display the acquired character information on a screen.

However, the somatosensory device is large in size and cannot be carried by a hearing-impaired person, so that the hearing-impaired person cannot normally communicate with other persons by the aid of the scheme.

Disclosure of Invention

The present disclosure provides a voice data generating method, device, terminal and storage medium, to at least solve the problem of difficult communication between a hearing impaired person and a hearing-impaired person in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided a voice data generation method, including:

acquiring at least one target video frame from a video to be processed, wherein the target video frame is a video frame comprising a hand image;

performing gesture recognition on a hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame;

obtaining a target sentence based on at least one gesture type and the corresponding relation between the gesture type and the words, wherein the target sentence comprises the words corresponding to the at least one gesture type;

and generating voice data corresponding to the target sentence according to the target sentence.

In one possible implementation manner, the performing gesture recognition on the hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame includes:

performing gesture recognition on a hand image of each target video frame, and acquiring a gesture shape of each target video frame based on a hand contour in the hand image in each target video frame;

and determining the gesture type corresponding to each target video frame based on the gesture shape of each target video frame and the corresponding relation between the gesture shape and the gesture type.

In a possible implementation manner, before obtaining the target sentence based on at least one gesture type and a corresponding relationship between the gesture type and the word, the method further includes:

and when the gesture types of the continuous target video frames with the target number are the same, taking the same gesture type as the gesture type corresponding to the continuous target video frames.

In one possible implementation manner, the obtaining a target sentence based on at least one gesture type and a corresponding relationship between the gesture type and a word includes:

when the recognized gesture type is a target gesture type, acquiring words corresponding to a target video frame between a first target video frame and a second target video frame based on the gesture type corresponding to the target video frame and the corresponding relation between the gesture type and the words, wherein the first target video frame is the target video frame recognized the target gesture type at this time, and the second target video frame is the target video frame recognized the target gesture type at the previous time;

and combining the at least one word to obtain the target sentence.

and when one gesture type is identified, acquiring words corresponding to the gesture type based on the gesture type and the corresponding relation between the gesture type and the words, and taking the words as the target sentence.

In a possible implementation manner, after the generating, according to the target sentence, voice data corresponding to the target sentence, the method further includes:

when the recognized gesture type is a target gesture type, carrying out grammar detection on words corresponding to a target video frame between a first target video frame and a second target video frame, wherein the first target video frame is the target video frame in which the target gesture type is recognized at this time, and the second target video frame is the target video frame in which the target gesture type is recognized at the previous time;

when the grammar detection fails, regenerating a new target sentence based on a word corresponding to the target video frame between the first target video frame and the second target video frame, wherein the new target sentence comprises the at least one word.

In a possible implementation manner, the generating, according to the target sentence, voice data corresponding to the target sentence includes any one of the following steps:

when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an expression type corresponding to the face image, and generating first voice data based on the expression type, wherein the tone of the first voice data conforms to the expression type;

when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an age range to which the face image belongs, acquiring tone data corresponding to the age range based on the age range, and generating second voice data based on the tone data, wherein the tone of the second voice data conforms to the age range;

when the target video frame comprises a face image, carrying out face recognition on the face image to obtain a gender type corresponding to the face image, acquiring tone data corresponding to the gender type based on the gender type, and generating third voice data based on the tone data, wherein the tone of the third voice data accords with the gender type;

determining emotion data corresponding to the change speed based on the change speed of the gesture type, and generating fourth voice data based on the emotion data, wherein the tone of the fourth voice data conforms to the change speed.

In one possible implementation manner, the generating, according to the target sentence, voice data corresponding to the target sentence includes:

acquiring a pronunciation sequence corresponding to the target sentence based on the character elements in the target sentence and the corresponding relation between the character elements and pronunciations;

and generating voice data corresponding to the target sentence based on the pronunciation sequence.

In one possible implementation, the obtaining at least one target video frame from a video to be processed includes:

inputting the video to be processed into a convolutional neural network, and splitting the video to be processed into a plurality of video frames by the convolutional neural network;

for any video frame, when detecting that the video frame comprises a hand image, marking the hand image, and taking the video frame as a target video frame;

when detecting that no hand image is included in the video frame, discarding the video frame.

According to a second aspect of the embodiments of the present disclosure, there is provided a voice data generation apparatus including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire at least one target video frame from a video to be processed, and the target video frame is a video frame comprising a hand image;

the recognition unit is configured to perform gesture recognition on a hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame;

the sentence generating unit is configured to execute corresponding relations between at least one gesture type and the gesture type and words to obtain a target sentence, and the target sentence comprises the words corresponding to the at least one gesture type;

and the voice data generation unit is configured to generate voice data corresponding to the target statement according to the target statement.

In one possible implementation, the identification unit includes:

a gesture shape acquisition subunit configured to perform gesture recognition on a hand image of each target video frame, and acquire a gesture shape of each target video frame based on a hand contour in the hand image in the each target video frame;

a gesture type obtaining subunit configured to perform determining a gesture type corresponding to each target video frame based on the gesture shape of each target video frame and the corresponding relationship between the gesture shape and the gesture type.

In one possible implementation, the apparatus further includes:

the determining unit is configured to execute that the gesture types of continuous target video frames with the target number are the same, and take the same gesture type as the gesture type corresponding to the continuous target video frames.

In one possible implementation, the statement generation unit includes:

the word acquisition subunit is configured to, when the recognized gesture type is a target gesture type, acquire a word corresponding to a target video frame between a first target video frame and a second target video frame based on the gesture type corresponding to the target video frame and the corresponding relationship between the gesture type and the word, wherein the first target video frame is the target video frame in which the target gesture type is recognized this time, and the second target video frame is the target video frame in which the target gesture type is recognized last time;

a combining subunit configured to perform combining the at least one word to obtain the target sentence.

In a possible implementation manner, the sentence generation unit is further configured to, when each gesture type is recognized, obtain a word corresponding to the gesture type based on the gesture type and a correspondence between the gesture type and the word, and use the word as the target sentence.

In one possible implementation, the apparatus further includes:

the grammar detection unit is configured to execute grammar detection on words corresponding to a target video frame between a first target video frame and a second target video frame when the recognized gesture type is the target gesture type, wherein the first target video frame is the target video frame of which the target gesture type is recognized at this time, and the second target video frame is the target video frame of which the target gesture type is recognized at the previous time;

the sentence generating unit is configured to perform, when the syntax detection fails, regeneration of a new target sentence based on a word corresponding to a target video frame between the first target video frame and the second target video frame, the new target sentence including the at least one word.

In one possible implementation, the voice data generating unit is configured to perform any one of the following steps:

In one possible implementation, the voice data generating unit includes:

a pronunciation sequence acquisition subunit configured to execute acquisition of a pronunciation sequence corresponding to the target sentence based on the character elements in the target sentence and the corresponding relationship between the character elements and pronunciations;

and the voice data acquisition subunit is configured to generate the voice data corresponding to the target sentence based on the pronunciation sequence.

In one possible implementation, the obtaining unit includes:

an input subunit configured to perform input of the video to be processed into a convolutional neural network, the convolutional neural network splitting the video to be processed into a plurality of video frames;

the annotation subunit is configured to perform annotation on a hand image when detecting that the hand image is included in any video frame, and take the video frame as a target video frame;

a discarding subunit configured to perform discarding the video frame when it is detected that no hand image is included in the video frame.

According to a third aspect of the embodiments of the present disclosure, there is provided a terminal, including:

one or more processors;

one or more memories for storing the one or more processor-executable instructions;

wherein the one or more processors are configured to perform the method of generating speech data of any of the above objective aspects.

According to a fourth aspect of embodiments of the present disclosure, there is provided a server, including:

one or more processors;

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions of the storage medium, when executed by a processor of a computer device, enable the computer device to perform the voice data generating method of any one of the above-mentioned object aspects.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer program product comprising executable instructions that, when executed by a processor of a computer device, enable the computer device to perform the method of speech data generation as defined in any one of the above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the voice data generation method, the device, the terminal and the storage medium provided by the embodiment of the disclosure, the gesture type of the user is obtained by performing target detection and tracking on the video including the sign language, the sentence corresponding to the sign language is obtained through the corresponding relation between the gesture type and the words, the voice data of the sentence is generated, the content which the sign language in the video is required to express can be known through subsequently playing the voice data, and barrier-free communication between a hearing impaired person and a hearing-strengthened person is realized. The video to be processed can be shot by a common camera, so that the scheme does not depend on specific equipment, can directly run on terminals such as a mobile phone and a computer, has no extra cost, and can be well popularized among people with hearing impairment.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow chart illustrating a method of voice data generation in accordance with an exemplary embodiment;

FIG. 2 is a flow chart illustrating a method of voice data generation in accordance with an exemplary embodiment;

FIG. 3 is a diagram illustrating a target video frame in accordance with an exemplary embodiment;

FIG. 4 is a flow chart illustrating a method of voice data generation in accordance with an exemplary embodiment;

FIG. 5 is a flow chart illustrating another method of speech data generation in accordance with an exemplary embodiment;

FIG. 6 is a block diagram illustrating a speech data generation apparatus according to an exemplary embodiment;

FIG. 7 is a block diagram illustrating another speech data generation apparatus in accordance with an illustrative embodiment;

FIG. 8 is a block diagram illustrating a terminal in accordance with an exemplary embodiment;

FIG. 9 is a block diagram illustrating a server in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "object," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The embodiment of the disclosure can be applied to any scene needing sign language translation.

For example, in a live broadcast scene, the anchor can be a hearing-impaired person, the terminal shoots a video of the anchor, the video is uploaded to a server associated with live broadcast software, the server analyzes and processes the sign language video, the sign language in the video is translated into voice data, the voice data is sent to the watching terminal, and the watching terminal plays the voice data, so that the semantic meaning which the anchor wants to express is known, and the normal communication between the anchor and the watching user is realized.

For example, in a scene of face-to-face communication between a hearing impaired person and a hearing impaired person, the hearing impaired person can shoot own sign language video through a terminal such as a mobile phone, analyze and process the sign language video through the terminal, translate sign language in the video into voice data, and play the voice data, so that other people can quickly know the semantic meaning which the user wants to express.

In addition to the above scenarios, the method provided by the embodiment of the present disclosure may also be applied to other scenarios that a user watches a video shot by a hearing-impaired person, and a watching terminal translates a sign language in the video into voice data, and the like.

Fig. 1 is a flowchart illustrating a voice data generating method according to an exemplary embodiment, and as shown in fig. 1, the voice data generating method may be applied to a computer device, where the computer device may be a terminal such as a mobile phone, a computer, or the like, or may be a server associated with an application, and includes the following steps:

in step S11, at least one target video frame is acquired from the video to be processed, the target video frame being a video frame including a hand image.

In step S12, gesture recognition is performed on the hand image of the at least one target video frame, so as to obtain a gesture type corresponding to the at least one target video frame.

In step S13, a target sentence is obtained based on the at least one gesture type and the corresponding relationship between the gesture type and the word, where the target sentence includes the word corresponding to the at least one gesture type.

In step S14, speech data corresponding to the target sentence is generated from the target sentence.

In one possible implementation manner, performing gesture recognition on a hand image of at least one target video frame to obtain a gesture type corresponding to the at least one target video frame includes:

performing gesture recognition on the hand image of each target video frame, and acquiring a gesture shape of each target video frame based on a hand contour in the hand image in each target video frame;

In one possible implementation manner, before obtaining the target sentence based on at least one gesture type and a corresponding relationship between the gesture type and the word, the method further includes:

In one possible implementation manner, obtaining the target sentence based on at least one gesture type and a corresponding relationship between the gesture type and the word includes:

when the recognized gesture type is a target gesture type, acquiring words corresponding to a target video frame between a first target video frame and a second target video frame based on the gesture type corresponding to the target video frame and the corresponding relation between the gesture type and the words, wherein the first target video frame is the target video frame with the target gesture type recognized at this time, and the second target video frame is the target video frame with the target gesture type recognized at the previous time;

and combining at least one word to obtain the target sentence.

and when one gesture type is identified, acquiring words corresponding to the gesture type based on the gesture type and the corresponding relation between the gesture type and the words, and taking the words as target sentences.

In a possible implementation manner, after generating, according to the target sentence, voice data corresponding to the target sentence, the method further includes:

when the recognized gesture type is the target gesture type, carrying out grammar detection on words corresponding to a target video frame between a first target video frame and a second target video frame, wherein the first target video frame is the target video frame of which the target gesture type is recognized at this time, and the second target video frame is the target video frame of which the target gesture type is recognized at the previous time;

and when the grammar detection fails, regenerating a new target sentence based on a word corresponding to the target video frame between the first target video frame and the second target video frame, wherein the new target sentence comprises at least one word.

In a possible implementation manner, generating voice data corresponding to the target sentence according to the target sentence includes any one of the following steps:

when the target video frame comprises the face image, carrying out face recognition on the face image to obtain an age range to which the face image belongs, acquiring tone data corresponding to the age range based on the age range, and generating second voice data based on the tone data, wherein the tone of the second voice data conforms to the age range;

when the target video frame comprises a face image, carrying out face recognition on the face image to obtain a gender type corresponding to the face image, acquiring tone data corresponding to the gender type based on the gender type, and generating third voice data based on the tone data, wherein the tone of the third voice data conforms to the gender type;

determining emotion data corresponding to the change speed based on the change speed of the gesture type, and generating fourth voice data based on the emotion data, wherein the tone of the fourth voice data accords with the change speed.

In one possible implementation manner, generating, according to the target sentence, voice data corresponding to the target sentence includes:

acquiring a pronunciation sequence corresponding to the target sentence based on the character elements in the target sentence and the corresponding relation between the character elements and the pronunciation;

In one possible implementation, obtaining at least one target video frame from a video to be processed includes:

inputting a video to be processed into a convolutional neural network model, and splitting the video to be processed into a plurality of video frames by the convolutional neural network model;

and when the hand image is not included in the video frame, discarding the video frame.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

Fig. 2 is a flowchart illustrating a voice data generating method according to an exemplary embodiment, and as shown in fig. 2, the method may be applied to a computer device, where the computer device may be a terminal such as a mobile phone and a computer, or may be a server associated with an application, and the embodiment takes the server as an execution subject and includes the following steps:

in step S21, the server acquires at least one target video frame from the video to be processed, the target video frame being a video frame including a hand image.

The video to be processed may be a section of complete video uploaded after the terminal finishes shooting, or may be a video shot by the terminal and sent to the server in real time. The video to be processed is formed by connecting static images of one frame and one frame, and each static image is a video frame.

The specific implementation manner of the step S21 may be: after the server acquires the video to be processed, performing hand image detection on each video frame in the video to be processed, determining whether the video frame comprises a hand image, and marking an area where the hand image is located when the video frame comprises the hand image to obtain a target video frame; when the hand image is not included in the video frame, the video frame is discarded. By discarding a part of useless video frames, the number of video frames to be processed subsequently is reduced, the calculation amount of a server is further reduced, and the processing speed is improved.

The specific process of the server determining whether the video frame includes the hand image may be implemented by a first network, which may be an SSD (Single Shot multi box Detector) network, an HMM (Hidden Markov Model) network, or other convolutional neural network. Accordingly, in a possible implementation manner of this step S21, the server splits the video to be processed into a plurality of video frames, and for any video frame, the server acquires feature data of the video frame using the first network, and determines whether the feature data includes target feature data, where the target feature data is feature data corresponding to a hand; when the characteristic data comprises target characteristic data, determining the position of the hand image according to the position of the target characteristic data; marking the position of the hand image through a rectangular frame, and outputting a target video frame with a rectangular frame mark; when the target feature data is not included in the feature data, the video frame is discarded. The video to be processed is analyzed through the convolutional neural network, and the video can be analyzed quickly and accurately.

The target video frames with rectangular frame marks can be as shown in fig. 3, where fig. 3 shows 3 target video frames, and the hand image in each target video frame is marked by a rectangular frame.

The first network can be obtained by training the convolutional neural network by using the training sample. For example, in a stage of training a convolutional neural network by using a training sample, a large number of pictures including hand images may be prepared, and the hand images in the pictures are labeled, that is, regions where the hand images are located in the pictures are labeled by rectangular frames. And training the convolutional neural network by using the marked picture to obtain a trained first network.

It should be noted that, this embodiment is only described with the analysis of the video to be processed by the first network as an example, in some embodiments, the video to be processed may also be analyzed by other methods such as image scanning, and the method for analyzing the video to be processed in the embodiment of the present disclosure is not limited.

In step S22, the server performs gesture recognition on the hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame.

In this embodiment, the timing of the server performing gesture recognition on the hand image of the at least one target video frame may be any of the following timings: (1) after all target video frames of a video to be processed are acquired, gesture recognition is carried out on hand images of the target video frames, and the video frames are divided into two steps to be processed, so that the operation memory is reduced; (2) after a target video frame is acquired, gesture recognition is carried out on a hand image of the target video frame, after the gesture type of the target video frame is acquired, the step of acquiring the next target video frame is executed, and each video frame is thoroughly processed, so that the real-time performance of communication is improved.

In addition, the specific process of identifying the hand image of the at least one target video frame by the server can comprise the following processes: the server performs gesture recognition on the hand image of each target video frame, and acquires the gesture shape of each target video frame based on the hand contour in the hand image in each target video frame; and determining the gesture type corresponding to each target video frame based on the gesture shape of each target video frame and the corresponding relation between the gesture shape and the gesture type.

In addition, the specific process of analyzing the hand image by the server may be implemented by a second network, and the second network may be an SSD network, an HMM network, or another convolutional neural network. Accordingly, in a possible implementation manner of this step S22, the server performs object detection using the first network to obtain a hand image, and performs tracking on the hand image using the second network to obtain a gesture type corresponding to the hand image. That is, in the embodiment of the present disclosure, when the server classifies the gestures using the second network, the server may further perform target detection on the next video frame using the first network, and obtain classification of the gesture types by common processing of the two networks, so as to speed up gesture classification.

The training process of the second network may be: preparing a large number of pictures with different gesture shapes, and classifying and labeling the pictures. For example, all pictures with gesture type "heart to heart" are numbered 1 and all pictures with gesture type "good" are numbered 2. And inputting the marked picture into a convolutional neural network for training to obtain a trained second network.

In addition, the analysis process of the hand image by the server can be realized through the first network. That is, the target detection and the target classification are realized through the same network. The server detects whether the video frame comprises the hand image or not through the first network, after the hand image is detected, gesture recognition is carried out on the hand image to obtain a gesture type corresponding to the hand image, target detection and target classification can be completed through only one network, the algorithm for analyzing the video occupies a small memory, and therefore the terminal calling is easy.

It should be noted that, when the gesture type is recognized through the second network, the input to the second network may be a target video frame or a hand image in the target video frame, which is not limited in this disclosure.

In step S23, when the gesture types of the consecutive target video frames with the target number are the same, the server takes the same gesture type as the gesture type corresponding to the consecutive target video frames.

When the video is shot, a plurality of video frames can be acquired within one second, so that the same gesture action can appear in the plurality of video frames when a user makes the gesture action. During the change process of the gesture action, the user can generate actions corresponding to other gesture types, because the duration of the gesture motion generated in the gesture motion change process is short, and the duration of the sign language motion made by the user is relatively long, to determine which are sign language actions made by the user, which are actions generated by the user during gesture changes, when the gesture types of the continuous target video frames with the target number are the same, the server can take the same gesture type as the gesture type corresponding to the continuous target video frames, the server can only generate one corresponding word or sentence when the gesture action is made by the user, so that the phenomenon that the intermediate gesture generated in the gesture change process is mistakenly recognized is avoided, the user experience is improved, the recognition accuracy is also improved, and the phenomenon that the server generates a plurality of repeated words aiming at one action of the user is also avoided. .

The specific implementation manner of the step S23 may be: after the server acquires one gesture type, the server takes the gesture type as the gesture type to be determined, and then acquires the gesture type of the next target video frame. When the gesture type of the next target video frame is the same as the gesture type to be determined, adding 1 to the continuous times of the gesture type to be determined, and continuing to execute the step of obtaining the gesture type of the next target video frame; when the gesture type of the next video frame is different from the gesture type to be determined, determining whether the continuous times of the gesture type to be determined are larger than the target number, if the continuous times of the gesture type to be determined are not smaller than the target number, determining that the gesture type to be determined is an effective gesture type, and taking the gesture type of the next video frame as the gesture type to be determined; and if the number of times of the gesture type to be determined is less than the target number, determining the gesture type to be determined as an invalid gesture type, and taking the gesture type of the next target video as the gesture type to be determined.

The target number may be any value such as 10, 15, or 20, and the target number may be determined by the number of video frames per second, or the gesture change speed of the user, or other manners, which is not limited in this disclosure.

In step S24, when the recognized gesture type is the target gesture type, based on the gesture type corresponding to the target video frame, the corresponding relationship between the gesture type and the word, the server obtains the word corresponding to the target video frame between the first target video frame and the second target video frame, where the first target video frame is the target video frame in which the target gesture type is recognized this time, and the second target video frame is the target video frame in which the target gesture type is recognized last time.

The target gesture type may be a preset gesture type, and the target gesture is used for indicating that the statement of a sentence is completed. When the target gesture type is detected, it indicates that the user wants to indicate that the sentence has been declared complete. In addition, one gesture type may correspond to at least one word.

The specific process of the server acquiring the words corresponding to the target video frame between the first target video frame and the second target video frame may be as follows: the server obtains gesture types corresponding to a plurality of continuous target video frames, and obtains at least one word corresponding to each gesture type from a database, wherein the database is used for correspondingly storing the gesture types and the at least one word corresponding to the gesture types.

It should be noted that, in the embodiment of the present disclosure, the description is only given by taking an example of representing the completion of a sentence by the target gesture, in some embodiments, a button may be further disposed on the terminal for shooting the video, and the completion of a sentence is represented by clicking the button or by other means.

In step S25, the server combines at least one word to obtain a plurality of sentences.

When the server acquires a word, the word is directly used as a statement. When the server obtains a plurality of words, the specific process of the server generating the statement may be: obtaining a plurality of sentences by sequentially combining a plurality of words; or, retrieving a corpus based on a plurality of words, and obtaining a plurality of sentences in the corpus, wherein the corpus includes a large number of real sentences.

In a possible implementation manner, the server obtains a plurality of statements by sequentially combining a plurality of words, and the specific process may be: the server combines one word corresponding to each gesture type according to the time sequence of the gesture types to obtain one sentence, and because some gesture types correspond to a plurality of words, the server needs to combine each word of the gesture type with words of other gesture types once to obtain a plurality of sentences. Because the language order of the sign language is the same as the language order expression of the spoken language, the words corresponding to the gesture types can be directly arranged and combined according to the time sequence, and the generation speed of the sentences is accelerated on the basis of ensuring the accuracy.

In another possible implementation manner, the server retrieves the corpus based on a plurality of words to obtain a plurality of sentences in the corpus, and the specific process may be: the server locally stores a corpus, and when obtaining a plurality of words, the server combines the words to serve as search words, searches the corpus and obtains a plurality of sentences from the corpus, wherein each sentence comprises a word corresponding to each gesture type. The method and the device ensure the smoothness of the obtained sentences by searching the real sentences in the corpus.

Because some gesture types correspond to a plurality of words, each word corresponding to the gesture type needs to be combined with words corresponding to other gesture types to obtain a plurality of search terms. And acquiring at least one sentence corresponding to each retrieval word from the corpus.

In step S26, the server calculates a score for each sentence, and sets the sentence with the highest score as the target sentence.

The server can calculate the score of each sentence according to the conditions that whether the sentence is smooth or not, whether the word corresponding to each gesture type is included or not, and whether the sequence of the word in the sentence is equal to the occurrence time sequence of the corresponding gesture type or not. According to different generation modes of the sentences, the server can calculate scores of the sentences according to different conditions. In addition, the server may perform the score calculation by combining any one or more of the conditions.

The server is exemplified to combine a plurality of words in order to obtain a plurality of sentences, and the server may calculate a score for each sentence according to the compliance of the sentence, and use the sentence with the highest score as the target sentence. Because some gesture types may correspond to a plurality of words, which may have different semantics, when the selected gesture type corresponds to a word that the user wants to express, the sentence is smooth, and when the selected gesture type corresponds to a word that the user wants to express, the sentence may not be smooth. Through judging the smoothness of the sentences, the words which the user wants to express are obtained from the words corresponding to the gesture types, and the accuracy of sign language translation is improved.

The server can judge whether the sentence is smooth based on an N-gram algorithm, the N-gram algorithm can judge whether every N adjacent words are collocated, the server can determine the collocation degree of every N adjacent words in the sentence based on the N-gram algorithm, and the smoothness degree of the sentence is determined based on the collocation degree of every N adjacent words, wherein N can be any number of 2, 3, 4, 5 and the like, and can also be the number of words included in the sentence. Wherein, the higher the collocation degree of the adjacent words is, the higher the smoothness degree of the sentences is. The smoothness of the sentences can be accurately judged by adopting the N-gram algorithm, so that the sentences meeting the requirements of the user are determined, and the accuracy of sign language translation is further improved.

The method comprises the steps of taking a server as an example to retrieve a corpus based on a plurality of words, acquiring a plurality of sentences in the corpus, and calculating scores of the sentences based on the occurrence time sequence of each gesture type and the sequence of the words in each sentence, wherein the higher the phase speed of the sequence of the gesture type and the sequence of the words corresponding to the gesture type in the sentences is, the higher the score of the sentences is. The sentences in the corpus are real sentences without the problems of word order, logic and the like, the sentences screened from the corpus do not need to verify whether the word order or the logic has the problems or not, and the sentences are real sentences in daily life, so that the communication between normal users can be better simulated, the sign language translation effect is improved, and only the sequence of the words in the sentences needs to be verified whether the sequence of the occurrence time of the gesture type is the same or not, and the judgment process is simplified.

In step S27, the server generates speech data corresponding to the target sentence based on the target sentence.

Wherein, the voice data is the audio data of the target sentence.

The specific implementation process of the step S27 may be: the server acquires a pronunciation sequence corresponding to the target sentence based on the character elements in the target sentence and the corresponding relation between the character elements and the pronunciation, and generates voice data corresponding to the target sentence based on the pronunciation sequence.

The specific process of the server obtaining the pronunciation sequence of the target sentence and generating the voice data corresponding to the target sentence according to the pronunciation sequence may include the following processes: the server processes the target sentence through a text regularization method, converts non-Chinese characters in the target sentence into Chinese characters, and obtains a first target sentence; the server carries out word segmentation processing and part-of-speech tagging on the first target sentence to obtain at least one word segmentation and a part-of-speech result corresponding to the at least one word segmentation; acquiring the pronunciation of each word segmentation result based on the corresponding relation between the part of speech result and the pronunciation of each word segmentation; performing prosody prediction on each word segmentation result through a prosody model based on the pronunciation of each word segmentation result to obtain a pronunciation sequence with a prosody label; the server adopts an acoustic model to predict acoustic parameters of each pronunciation unit in the pronunciation sequence to obtain the acoustic parameters corresponding to each pronunciation unit; and the server converts the acoustic parameters corresponding to each pronunciation unit into corresponding voice data. The acoustic model may adopt an LSTM (Long Short-Term Memory) network model.

The pronunciations of the word segmentation results are processed through the prosodic model, so that the subsequently generated voices are more vivid, normal communication between two users is better simulated, the user experience is enhanced, and the sign language translation effect is improved.

In addition, when generating voice data, it is also possible to output voice data corresponding to the state of the user with reference to the state of the user. In one possible implementation manner, a plurality of expression types and tone information corresponding to the expression types are stored in the server. When the target video frame comprises the face image, the server performs face recognition on the face image to obtain an expression type corresponding to the face image, and generates first voice data based on the expression type, wherein the tone of the first voice data conforms to the expression type. For example, when the server detects that the expression type of the user is happy, first voice data with faster tone is generated.

In another possible implementation manner, a plurality of age ranges and tone color data corresponding to the age ranges are stored in the server. When the target video frame comprises the face image, the server carries out face recognition on the face image to obtain an age range to which the face image belongs, obtains tone data corresponding to the age range based on the age range, and generates second voice data based on the tone data, wherein the tone of the second voice data conforms to the age range. For example, when the server detects that the age range of the user is 5-10 years old, second voice data with relatively young timbre is generated.

In another possible implementation manner, the server stores the gender type and the tone color data corresponding to the gender type. When the target video frame comprises the face image, carrying out face recognition on the face image to obtain a gender type corresponding to the face image, acquiring tone data corresponding to the gender type based on the gender type, and generating third voice data based on the tone data, wherein the tone of the third voice data accords with the gender type. For example, when the server detects that the user is female, third voice data whose tone is female is generated.

In another possible implementation manner, a plurality of change speeds and emotion data corresponding to the change speeds are stored in the server. Based on the change speed of the gesture type, the server determines emotion data corresponding to the change speed, and generates fourth voice data based on the emotion data, wherein the tone of the fourth voice data accords with the change speed. For example, when the gesture change speed of the user is fast, which indicates that the emotion of the user is more excited, fourth voice data with a higher tone is generated.

By integrating the above steps, the voice data generation method provided by the embodiment of the disclosure is as shown in fig. 4, a hearing-impaired person displays a piece of sign language in front of a camera, the camera shoots a video including the sign language, the sign language recognition and analysis are performed on the video through a sign language recognition module to obtain a plurality of gesture types, words corresponding to the gesture types are obtained through a sign language translation module, at least one word is synthesized into a target sentence, voice data of the target sentence is generated through a hearing voice synthesis module, and the voice data is played to a hearing-impaired person, so that normal communication between the hearing-impaired person and the hearing-impaired person is realized.

It should be noted that, any one or more of the above four ways of generating voice data may be selected and combined, and a user may also select a favorite tone or tone to generate voice data.

According to the voice data generation method provided by the embodiment of the disclosure, the gesture type of the user is obtained by performing target detection and tracking on the video including the sign language, the sentence corresponding to the sign language is obtained through the corresponding relation between the gesture type and the words, the voice data of the sentence is generated, the content that the sign language in the video is to be expressed can be known through playing the voice data subsequently, and barrier-free communication between a hearing impaired person and a hearing-aid person is realized. The video to be processed can be shot by a common camera, so that the scheme does not depend on specific equipment, can directly run on terminals such as a mobile phone and a computer, has no extra cost, and can be well popularized among people with hearing impairment.

In addition, effective gestures and invalid gestures are judged by detecting the duration of the gestures, so that the intermediate gestures generated in the gesture change process are prevented from being recognized by mistake, the accuracy of sign language translation is improved, and the user experience is improved.

In addition, after the server acquires the target sentences, the scores of the target sentences are calculated according to certain conditions, and the sentence with the highest score is used as the target sentence, so that the target sentences can better meet the requirements of users, the user experience is improved, and the sign language translation effect is enhanced.

In addition, the server can also generate voice data which is consistent with the state of the user according to the state of the user, so that the voice data better simulates the communication between normal users, and the communication process is more vivid.

The above embodiments shown in fig. 2 to 4 are described by taking an example that after a user finishes a sentence expression, the voice data corresponding to the sentence is generated, and in a possible embodiment, after acquiring the gesture type, the server generates the voice data corresponding to the gesture type in real time. This is further described below on the basis of the embodiment of fig. 5. Fig. 5 is a flowchart illustrating a voice data generation method, as shown in fig. 5, for use in a server, according to an exemplary embodiment, including the steps of:

in step S51, the server acquires at least one target video frame from the video to be processed, where the at least one target video frame is a video frame including a hand image.

In step S52, the server performs gesture recognition on the hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame.

In step S53, when the gesture types of the consecutive target video frames with the target number are the same, the server takes the same gesture type as the gesture type corresponding to the consecutive target video frames.

Steps S51 to S53 are similar to steps S21 to S23, and are not repeated herein.

In step S54, after the server recognizes each gesture type, the server obtains a word corresponding to the gesture type based on the gesture type and the correspondence between the gesture type and the word, and takes the word as a target sentence.

One gesture type corresponds to one word, the word and the gesture type are in one-to-one correspondence, and the word sequence of the sign language is the same as the word sequence of the spoken language of a hearing person, so that after the server determines the gesture type, the only word corresponding to the gesture type can be determined as a target sentence, and the target sentence can accurately express the semantics of the sign language.

In step S55, the server generates speech data corresponding to the target sentence based on the target sentence.

The step S55 is similar to the step S27, and is not repeated here.

In step S56, when the recognized gesture type is the target gesture type, the server performs syntax detection on a word corresponding to a target video frame between a first target video frame and a second target video frame, where the first target video frame is the target video frame in which the target gesture type is recognized this time, and the second target video frame is the target video frame in which the target gesture type was recognized last time.

When a sentence which the user wants to express is expressed by sign language, the server can also arrange the words output by the sentence in real time according to the time sequence to generate a sentence, and grammar detection is carried out on the sentence to determine whether the sentence output in real time is accurate.

In step S57, when the syntax detection fails, the server regenerates a new target sentence based on the word corresponding to the target video frame between the first target video frame and the second target video frame, the new target sentence including at least one word.

That is, when there is a problem in the grammar, the sentence is output again, and the steps S24 to S26 are similar and will not be described again.

It should be noted that, when the syntax detection is passed, the step of performing the analysis processing on the next video frame is continued.

In step S58, the server generates speech data corresponding to the new target sentence based on the new target sentence.

The step S58 is similar to the step S27, and is not repeated here.

According to the voice data generation method provided by the embodiment of the disclosure, after an effective gesture type is determined, the voice data corresponding to the gesture type is output, and through real-time translation, the translation speed is improved, the communication experience between a hearing-impaired person and a hearing-impaired person is also improved, and the spoken language communication between the hearing-impaired person and the hearing-impaired person can be better simulated. And after the output of a sentence is finished, the server also carries out grammar detection on the sentence, and when the grammar of the sentence has a problem, the sentence which conforms to the grammar is regenerated, thereby improving the accuracy of translation.

FIG. 6 is a block diagram illustrating a speech data generation apparatus according to an example embodiment. Referring to fig. 6, the apparatus includes an acquisition unit 601, a recognition unit 602, a sentence generation unit 603, and a voice data generation unit 604.

An acquisition unit 601 configured to perform acquisition of at least one target video frame from a video to be processed, the target video frame being a video frame including a hand image;

the recognition unit 602 is configured to perform gesture recognition on a hand image of the at least one target video frame, so as to obtain a gesture type corresponding to the at least one target video frame;

a sentence generating unit 603 configured to execute a target sentence based on at least one gesture type and a corresponding relationship between the gesture type and a word, where the target sentence includes the word corresponding to the at least one gesture type;

a voice data generating unit 604 configured to generate voice data corresponding to the target sentence according to the target sentence.

The voice data generation device provided by the embodiment of the disclosure obtains the gesture type of the user by performing target detection and tracking on the video including the sign language, obtains the sentence corresponding to the sign language through the corresponding relationship between the gesture type and the words, generates the voice data of the sentence, and can know the content that the sign language in the video is intended to express through playing the voice data subsequently, thereby realizing barrier-free communication between hearing impaired people and hearing-strengthened people. The video to be processed can be shot by a common camera, so that the scheme does not depend on specific equipment, can directly run on terminals such as a mobile phone and a computer, has no extra cost, and can be well popularized among people with hearing impairment.

In one possible implementation, as shown in fig. 7, the identifying unit 602 includes:

a gesture shape acquisition subunit 6021 configured to perform gesture recognition on the hand image of each target video frame, and acquire a gesture shape of each target video frame based on a hand contour in the hand image in the each target video frame;

a gesture type obtaining subunit 6022 configured to perform determination of a gesture type corresponding to each target video frame based on the gesture shape of each target video frame and the corresponding relationship between the gesture shape and the gesture type.

In one possible implementation, as shown in fig. 7, the apparatus further includes:

the determining unit 605 is configured to perform, when the gesture types of the consecutive target video frames having the target number are the same, taking the same gesture type as the gesture type corresponding to the consecutive target video frames.

In one possible implementation, as shown in fig. 7, the statement generation unit 603 includes:

a word obtaining subunit 6031, configured to, when the recognized gesture type is the target gesture type, obtain, based on the gesture type corresponding to the target video frame and the corresponding relationship between the gesture type and the word, a word corresponding to the target video frame between a first target video frame and a second target video frame, where the first target video frame is the target video frame in which the target gesture type is recognized this time, and the second target video frame is the target video frame in which the target gesture type is recognized last time;

a combining subunit 6032 configured to perform combining the at least one word to obtain the target sentence.

In one possible implementation manner, as shown in fig. 7, the sentence generation unit 603 is further configured to, when each gesture type is recognized, obtain a word corresponding to the gesture type based on the gesture type and the corresponding relationship between the gesture type and the word, and take the word as the target sentence.

a syntax detecting unit 606 configured to perform syntax detection on a word corresponding to a target video frame between a first target video frame and a second target video frame when the recognized gesture type is the target gesture type, where the first target video frame is the target video frame in which the target gesture type is recognized this time, and the second target video frame is the target video frame in which the target gesture type was recognized last time;

the sentence generation unit 603 is configured to perform, when the syntax detection fails, regeneration of a new target sentence based on a word corresponding to the target video frame between the first target video frame and the second target video frame, the new target sentence including the at least one word.

In one possible implementation, as shown in fig. 7, the speech data generating unit 603 is configured to perform any one of the following steps:

when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an expression type corresponding to the face image, and generating first voice data based on the expression type, wherein the tone of the first voice data is in accordance with the expression type;

In one possible implementation, as shown in fig. 7, the voice data generating unit 604 includes:

a pronunciation sequence acquisition subunit 6041 configured to execute acquiring a pronunciation sequence corresponding to the target sentence based on the character elements and the correspondence between the character elements and pronunciations in the target sentence;

a voice data acquisition subunit 6042 configured to perform generation of voice data corresponding to the target sentence based on the pronunciation sequence.

In one possible implementation, as shown in fig. 7, the obtaining unit 601 includes:

an input subunit 6011 configured to perform inputting the video to be processed into a convolutional neural network model, where the convolutional neural network model splits the video to be processed into a plurality of video frames;

an annotation subunit 6012 configured to perform, for any video frame, when it is detected that a hand image is included in the video frame, annotating the hand image, and taking the video frame as a target video frame;

a discarding subunit 6013 configured to perform discarding the video frame when it is detected that the hand image is not included in the video frame.

It should be noted that: in the voice data generating device provided in the above embodiment, when generating voice data, only the division of the above functional units is exemplified, and in practical applications, the above functions may be distributed by different functional units according to needs, that is, the internal structure of the voice data generating device may be divided into different functional units to complete all or part of the above described functions. In addition, the voice data generating apparatus and the voice data generating method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 8 is a block diagram of a terminal according to an embodiment of the present disclosure. The terminal 800 is used for executing the steps executed by the terminal in the above embodiments, and may be a portable mobile terminal, such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 800 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 800 includes: a processor 801 and a memory 802.

The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement the speech data generation methods provided by method embodiments herein.

In some embodiments, the terminal 800 may further include: a peripheral interface 803 and at least one peripheral. The processor 801, memory 802 and peripheral interface 803 may be connected by bus or signal lines. Various peripheral devices may be connected to peripheral interface 803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 804, a touch screen display 805, a camera assembly 806, an audio circuit 807, a positioning assembly 808, and a power supply 809.

The peripheral interface 803 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 801 and the memory 802. In some embodiments, the processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 804 converts an electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 804 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to capture touch signals on or above the surface of the display 805. The touch signal may be input to the processor 801 as a control signal for processing. At this point, the display 805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 805 may be one, providing the front panel of the terminal 800; in other embodiments, the display 805 may be at least two, respectively disposed on different surfaces of the terminal 800 or in a folded design; in still other embodiments, the display 805 may be a flexible display disposed on a curved surface or a folded surface of the terminal 800. Even further, the display 805 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 805 can be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 806 is used to capture images or video. Optionally, camera assembly 806 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 801 for processing or inputting the electric signals to the radio frequency circuit 804 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 800. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 807 may also include a headphone jack.

The positioning component 808 is used to locate the current geographic position of the terminal 800 for navigation or LBS (location based Service). The positioning component 808 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, or the graves System in russia, or the galileo System in the european union.

Power supply 809 is used to provide power to various components in terminal 800. The power supply 809 can be ac, dc, disposable or rechargeable. When the power source 809 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyro sensor 812, pressure sensor 813, fingerprint sensor 814, optical sensor 815 and proximity sensor 816.

The acceleration sensor 811 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 800. For example, the acceleration sensor 811 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 801 may control the touch screen 805 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 811. The acceleration sensor 811 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 812 may detect a body direction and a rotation angle of the terminal 800, and the gyro sensor 812 may cooperate with the acceleration sensor 811 to acquire a 3D motion of the user with respect to the terminal 800. From the data collected by the gyro sensor 812, the processor 801 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 813 may be disposed on the side bezel of terminal 800 and/or underneath touch display 805. When the pressure sensor 813 is disposed on the side frame of the terminal 800, the holding signal of the user to the terminal 800 can be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at a lower layer of the touch display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 805. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 814 is used for collecting a fingerprint of the user, and the processor 801 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 801 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 814 may be disposed on the front, back, or side of terminal 800. When a physical button or a vendor Logo is provided on the terminal 800, the fingerprint sensor 814 may be integrated with the physical button or the vendor Logo.

The optical sensor 815 is used to collect the ambient light intensity. In one embodiment, the processor 801 may control the display brightness of the touch screen 805 based on the ambient light intensity collected by the optical sensor 815. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 805 is increased; when the ambient light intensity is low, the display brightness of the touch display 805 is turned down. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 based on the ambient light intensity collected by the optical sensor 815.

A proximity sensor 816, also known as a distance sensor, is typically provided on the front panel of the terminal 800. The proximity sensor 816 is used to collect the distance between the user and the front surface of the terminal 800. In one embodiment, when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 gradually decreases, the processor 801 controls the touch display 805 to switch from the bright screen state to the dark screen state; when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 becomes gradually larger, the processor 801 controls the touch display 805 to switch from the screen-on state to the screen-on state.

Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting of terminal 800 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 9 is a block diagram illustrating a server 900 in accordance with an example embodiment. The server 900 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the memory 902 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 901 to implement the methods provided by the above method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The server 900 may be configured to perform the steps performed by the server in the above-described voice data generation method.

In an exemplary embodiment, there is also provided a computer-readable storage medium, in which instructions, when executed by a processor of a computer device, enable the computer device to perform a voice data generation method provided by an embodiment of the present disclosure.

In an exemplary embodiment, there is also provided a computer program product comprising executable instructions, which when executed by a processor of a computer device, enable the computer device to perform the speech data generation method provided by the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of generating speech data, the method comprising:

acquiring at least one target video frame from a video to be processed, wherein the target video frame is a video frame comprising a hand image; performing gesture recognition on a hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame; obtaining a target sentence based on at least one gesture type and the corresponding relation between the gesture type and the words, wherein the target sentence comprises the words corresponding to the at least one gesture type; generating voice data corresponding to the target sentence according to the target sentence;

the obtaining of the target sentence based on at least one gesture type and the corresponding relationship between the gesture type and the word includes: when the recognized gesture type is a target gesture type, acquiring words corresponding to a target video frame between a first target video frame and a second target video frame based on the at least one gesture type and the corresponding relation between the gesture type and the words, wherein the first target video frame is the target video frame in which the target gesture type is recognized at this time, the second target video frame is the target video frame in which the target gesture type is recognized at the previous time, and the target gesture type is used for representing the completion of the expression of a sentence;

combining the obtained at least one word to obtain the target sentence;

before obtaining the target sentence based on at least one gesture type and the corresponding relationship between the gesture type and the word, the method further includes:

after one gesture type is obtained, the gesture type is used as the gesture type to be determined, and the gesture type of the next target video frame is obtained; when the gesture type of the next target video frame is the same as the gesture type to be determined, adding 1 to the continuous times of the gesture type to be determined, and continuing to execute the step of obtaining the gesture type of the next target video frame; when the gesture type of the next target video frame is different from the gesture type to be determined, determining whether the continuous times of the gesture type to be determined are greater than a target number, if the continuous times of the gesture type to be determined are not less than the target number, determining that the gesture type to be determined is an effective gesture type, taking the same gesture type as the gesture type corresponding to the continuous target video frame, and taking the gesture type of the next target video frame as the gesture type to be determined; if the number of times of occurrence of the gesture type to be determined is smaller than the target number, determining the gesture type to be determined as an invalid gesture type, and taking the gesture type of the next target video as the gesture type to be determined.

2. The method according to claim 1, wherein the performing gesture recognition on the hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame comprises:

3. The method of claim 1, wherein obtaining the target sentence based on at least one gesture type and a corresponding relationship between the gesture type and the word comprises:

4. The method according to claim 3, wherein after generating the voice data corresponding to the target sentence according to the target sentence, the method further comprises:

when the recognized gesture type is the target gesture type, performing grammar detection on words corresponding to a target video frame between the first target video frame and the second target video frame;

when the grammar detection fails, regenerating a new target sentence based on a word corresponding to the target video frame between the first target video frame and the second target video frame, the new target sentence including the at least one word.

5. The method according to claim 1, wherein the generating the voice data corresponding to the target sentence according to the target sentence comprises any one of the following steps:

6. The method of claim 1, wherein the generating the voice data corresponding to the target sentence according to the target sentence comprises:

7. The method according to claim 1, wherein said obtaining at least one target video frame from the video to be processed comprises:

8. An apparatus for generating speech data, the apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire at least one target video frame from a video to be processed, and the target video frame is a video frame comprising a hand image;

a voice data generation unit configured to execute generation of voice data corresponding to the target sentence according to the target sentence;

the sentence generation unit includes: a word obtaining subunit, configured to, when the recognized gesture type is a target gesture type, obtain, based on the at least one gesture type and a corresponding relationship between the gesture type and a word, a word corresponding to a target video frame between a first target video frame and a second target video frame, where the first target video frame is a target video frame in which the target gesture type is recognized this time, the second target video frame is a target video frame in which the target gesture type is recognized last time, and the target gesture type is used to indicate that a sentence expression is completed; the combination subunit is configured to perform combination of the acquired at least one word to obtain the target sentence;

the apparatus is for: after one gesture type is obtained, the gesture type is used as the gesture type to be determined, and the gesture type of the next target video frame is obtained; when the gesture type of the next target video frame is the same as the gesture type to be determined, adding 1 to the continuous times of the gesture type to be determined, and continuing to execute the step of obtaining the gesture type of the next target video frame; when the gesture type of the next target video frame is different from the gesture type to be determined, determining whether the continuous times of the gesture type to be determined are greater than a target number, if the continuous times of the gesture type to be determined are not less than the target number, determining that the gesture type to be determined is an effective gesture type, taking the same gesture type as the gesture type corresponding to the continuous target video frame, and taking the gesture type of the next target video frame as the gesture type to be determined; if the number of times of occurrence of the gesture type to be determined is smaller than the target number, determining the gesture type to be determined as an invalid gesture type, and taking the gesture type of the next target video as the gesture type to be determined.

9. The apparatus of claim 8, wherein the identification unit comprises:

10. The apparatus according to claim 8, wherein the sentence generation unit is further configured to, every time one gesture type is recognized, obtain a word corresponding to the gesture type based on the gesture type and a correspondence between the gesture type and the word, and take the word as the target sentence.

11. The apparatus of claim 10, further comprising:

a grammar detection unit configured to perform grammar detection on a word corresponding to a target video frame between the first target video frame and the second target video frame when the recognized gesture type is the target gesture type;

12. The apparatus according to claim 8, wherein the voice data generating unit is configured to perform any one of the following steps:

13. The apparatus of claim 8, wherein the voice data generating unit comprises:

14. The apparatus of claim 8, wherein the obtaining unit comprises:

15. A terminal, comprising:

one or more processors;

wherein the one or more processors are configured to perform the speech data generation method of any of claims 1 to 7.

16. A server, comprising:

one or more processors;

17. A computer-readable storage medium in which instructions, when executed by a processor of a computer device, enable the computer device to perform the speech data generation method of any one of claims 1 to 7.