CN110322760B - Voice data generation method, device, terminal and storage medium - Google Patents

Voice data generation method, device, terminal and storage medium Download PDF

Info

Publication number
CN110322760B
CN110322760B CN201910611471.9A CN201910611471A CN110322760B CN 110322760 B CN110322760 B CN 110322760B CN 201910611471 A CN201910611471 A CN 201910611471A CN 110322760 B CN110322760 B CN 110322760B
Authority
CN
China
Prior art keywords
video frame
target
gesture type
target video
gesture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910611471.9A
Other languages
Chinese (zh)
Other versions
CN110322760A (en
Inventor
常兵虎
胡玉坤
车浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201910611471.9A priority Critical patent/CN110322760B/en
Publication of CN110322760A publication Critical patent/CN110322760A/en
Application granted granted Critical
Publication of CN110322760B publication Critical patent/CN110322760B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B21/00Teaching, or communicating with, the blind, deaf or mute
    • G09B21/009Teaching or communicating with deaf persons
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The present disclosure relates to a voice data generation method, apparatus, terminal and storage medium, and relates to the field of internet technology, the method includes: acquiring at least one target video frame from a video to be processed; performing gesture recognition on a hand image of at least one target video frame to obtain a gesture type corresponding to the at least one target video frame; obtaining a target sentence based on at least one gesture type and the corresponding relation between the gesture type and the words, wherein the target sentence comprises the words corresponding to the at least one gesture type; and generating voice data corresponding to the target sentence according to the target sentence. The content that the sign language in the video is required to express can be known by playing the voice data, and barrier-free communication between the hearing impaired people and the hearing-impaired people is realized. The video to be processed can be shot by a common camera, the scheme does not depend on specific equipment, can be directly operated on terminals such as a mobile phone and a computer, has no extra cost, and can be better popularized among people with hearing disabilities.

Description

Voice data generation method, device, terminal and storage medium
Technical Field
The present disclosure relates to the field of internet technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for generating voice data.
Background
The number of people with hearing impairment in China exceeds 2000 million, and the people can only communicate with other people through sign language or characters in daily life, but most people cannot understand the sign language well, so that the hearing impairment can only communicate with other people through handwriting or inputting characters on electronic equipment, and the communication efficiency is greatly reduced.
At present, hearing-impaired people can also realize normal communication with other users through some motion sensing devices, and the motion sensing devices are provided with a depth camera, and the motion sensing devices acquire gesture actions of the users through the depth camera, analyze the gesture actions to acquire character information corresponding to the gesture actions, and display the acquired character information on a screen.
However, the somatosensory device is large in size and cannot be carried by a hearing-impaired person, so that the hearing-impaired person cannot normally communicate with other persons by the aid of the scheme.
Disclosure of Invention
The present disclosure provides a voice data generating method, device, terminal and storage medium, to at least solve the problem of difficult communication between a hearing impaired person and a hearing-impaired person in the related art. The technical scheme of the disclosure is as follows:
according to a first aspect of embodiments of the present disclosure, there is provided a voice data generation method, including:
acquiring at least one target video frame from a video to be processed, wherein the target video frame is a video frame comprising a hand image;
performing gesture recognition on a hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame;
obtaining a target sentence based on at least one gesture type and the corresponding relation between the gesture type and the words, wherein the target sentence comprises the words corresponding to the at least one gesture type;
and generating voice data corresponding to the target sentence according to the target sentence.
In one possible implementation manner, the performing gesture recognition on the hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame includes:
performing gesture recognition on a hand image of each target video frame, and acquiring a gesture shape of each target video frame based on a hand contour in the hand image in each target video frame;
and determining the gesture type corresponding to each target video frame based on the gesture shape of each target video frame and the corresponding relation between the gesture shape and the gesture type.
In a possible implementation manner, before obtaining the target sentence based on at least one gesture type and a corresponding relationship between the gesture type and the word, the method further includes:
and when the gesture types of the continuous target video frames with the target number are the same, taking the same gesture type as the gesture type corresponding to the continuous target video frames.
In one possible implementation manner, the obtaining a target sentence based on at least one gesture type and a corresponding relationship between the gesture type and a word includes:
when the recognized gesture type is a target gesture type, acquiring words corresponding to a target video frame between a first target video frame and a second target video frame based on the gesture type corresponding to the target video frame and the corresponding relation between the gesture type and the words, wherein the first target video frame is the target video frame recognized the target gesture type at this time, and the second target video frame is the target video frame recognized the target gesture type at the previous time;
and combining the at least one word to obtain the target sentence.
In one possible implementation manner, the obtaining a target sentence based on at least one gesture type and a corresponding relationship between the gesture type and a word includes:
and when one gesture type is identified, acquiring words corresponding to the gesture type based on the gesture type and the corresponding relation between the gesture type and the words, and taking the words as the target sentence.
In a possible implementation manner, after the generating, according to the target sentence, voice data corresponding to the target sentence, the method further includes:
when the recognized gesture type is a target gesture type, carrying out grammar detection on words corresponding to a target video frame between a first target video frame and a second target video frame, wherein the first target video frame is the target video frame in which the target gesture type is recognized at this time, and the second target video frame is the target video frame in which the target gesture type is recognized at the previous time;
when the grammar detection fails, regenerating a new target sentence based on a word corresponding to the target video frame between the first target video frame and the second target video frame, wherein the new target sentence comprises the at least one word.
In a possible implementation manner, the generating, according to the target sentence, voice data corresponding to the target sentence includes any one of the following steps:
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an expression type corresponding to the face image, and generating first voice data based on the expression type, wherein the tone of the first voice data conforms to the expression type;
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an age range to which the face image belongs, acquiring tone data corresponding to the age range based on the age range, and generating second voice data based on the tone data, wherein the tone of the second voice data conforms to the age range;
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain a gender type corresponding to the face image, acquiring tone data corresponding to the gender type based on the gender type, and generating third voice data based on the tone data, wherein the tone of the third voice data accords with the gender type;
determining emotion data corresponding to the change speed based on the change speed of the gesture type, and generating fourth voice data based on the emotion data, wherein the tone of the fourth voice data conforms to the change speed.
In one possible implementation manner, the generating, according to the target sentence, voice data corresponding to the target sentence includes:
acquiring a pronunciation sequence corresponding to the target sentence based on the character elements in the target sentence and the corresponding relation between the character elements and pronunciations;
and generating voice data corresponding to the target sentence based on the pronunciation sequence.
In one possible implementation, the obtaining at least one target video frame from a video to be processed includes:
inputting the video to be processed into a convolutional neural network, and splitting the video to be processed into a plurality of video frames by the convolutional neural network;
for any video frame, when detecting that the video frame comprises a hand image, marking the hand image, and taking the video frame as a target video frame;
when detecting that no hand image is included in the video frame, discarding the video frame.
According to a second aspect of the embodiments of the present disclosure, there is provided a voice data generation apparatus including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire at least one target video frame from a video to be processed, and the target video frame is a video frame comprising a hand image;
the recognition unit is configured to perform gesture recognition on a hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame;
the sentence generating unit is configured to execute corresponding relations between at least one gesture type and the gesture type and words to obtain a target sentence, and the target sentence comprises the words corresponding to the at least one gesture type;
and the voice data generation unit is configured to generate voice data corresponding to the target statement according to the target statement.
In one possible implementation, the identification unit includes:
a gesture shape acquisition subunit configured to perform gesture recognition on a hand image of each target video frame, and acquire a gesture shape of each target video frame based on a hand contour in the hand image in the each target video frame;
a gesture type obtaining subunit configured to perform determining a gesture type corresponding to each target video frame based on the gesture shape of each target video frame and the corresponding relationship between the gesture shape and the gesture type.
In one possible implementation, the apparatus further includes:
the determining unit is configured to execute that the gesture types of continuous target video frames with the target number are the same, and take the same gesture type as the gesture type corresponding to the continuous target video frames.
In one possible implementation, the statement generation unit includes:
the word acquisition subunit is configured to, when the recognized gesture type is a target gesture type, acquire a word corresponding to a target video frame between a first target video frame and a second target video frame based on the gesture type corresponding to the target video frame and the corresponding relationship between the gesture type and the word, wherein the first target video frame is the target video frame in which the target gesture type is recognized this time, and the second target video frame is the target video frame in which the target gesture type is recognized last time;
a combining subunit configured to perform combining the at least one word to obtain the target sentence.
In a possible implementation manner, the sentence generation unit is further configured to, when each gesture type is recognized, obtain a word corresponding to the gesture type based on the gesture type and a correspondence between the gesture type and the word, and use the word as the target sentence.
In one possible implementation, the apparatus further includes:
the grammar detection unit is configured to execute grammar detection on words corresponding to a target video frame between a first target video frame and a second target video frame when the recognized gesture type is the target gesture type, wherein the first target video frame is the target video frame of which the target gesture type is recognized at this time, and the second target video frame is the target video frame of which the target gesture type is recognized at the previous time;
the sentence generating unit is configured to perform, when the syntax detection fails, regeneration of a new target sentence based on a word corresponding to a target video frame between the first target video frame and the second target video frame, the new target sentence including the at least one word.
In one possible implementation, the voice data generating unit is configured to perform any one of the following steps:
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an expression type corresponding to the face image, and generating first voice data based on the expression type, wherein the tone of the first voice data conforms to the expression type;
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an age range to which the face image belongs, acquiring tone data corresponding to the age range based on the age range, and generating second voice data based on the tone data, wherein the tone of the second voice data conforms to the age range;
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain a gender type corresponding to the face image, acquiring tone data corresponding to the gender type based on the gender type, and generating third voice data based on the tone data, wherein the tone of the third voice data accords with the gender type;
determining emotion data corresponding to the change speed based on the change speed of the gesture type, and generating fourth voice data based on the emotion data, wherein the tone of the fourth voice data conforms to the change speed.
In one possible implementation, the voice data generating unit includes:
a pronunciation sequence acquisition subunit configured to execute acquisition of a pronunciation sequence corresponding to the target sentence based on the character elements in the target sentence and the corresponding relationship between the character elements and pronunciations;
and the voice data acquisition subunit is configured to generate the voice data corresponding to the target sentence based on the pronunciation sequence.
In one possible implementation, the obtaining unit includes:
an input subunit configured to perform input of the video to be processed into a convolutional neural network, the convolutional neural network splitting the video to be processed into a plurality of video frames;
the annotation subunit is configured to perform annotation on a hand image when detecting that the hand image is included in any video frame, and take the video frame as a target video frame;
a discarding subunit configured to perform discarding the video frame when it is detected that no hand image is included in the video frame.
According to a third aspect of the embodiments of the present disclosure, there is provided a terminal, including:
one or more processors;
one or more memories for storing the one or more processor-executable instructions;
wherein the one or more processors are configured to perform the method of generating speech data of any of the above objective aspects.
According to a fourth aspect of embodiments of the present disclosure, there is provided a server, including:
one or more processors;
one or more memories for storing the one or more processor-executable instructions;
wherein the one or more processors are configured to perform the method of generating speech data of any of the above objective aspects.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions of the storage medium, when executed by a processor of a computer device, enable the computer device to perform the voice data generating method of any one of the above-mentioned object aspects.
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer program product comprising executable instructions that, when executed by a processor of a computer device, enable the computer device to perform the method of speech data generation as defined in any one of the above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
according to the voice data generation method, the device, the terminal and the storage medium provided by the embodiment of the disclosure, the gesture type of the user is obtained by performing target detection and tracking on the video including the sign language, the sentence corresponding to the sign language is obtained through the corresponding relation between the gesture type and the words, the voice data of the sentence is generated, the content which the sign language in the video is required to express can be known through subsequently playing the voice data, and barrier-free communication between a hearing impaired person and a hearing-strengthened person is realized. The video to be processed can be shot by a common camera, so that the scheme does not depend on specific equipment, can directly run on terminals such as a mobile phone and a computer, has no extra cost, and can be well popularized among people with hearing impairment.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a flow chart illustrating a method of voice data generation in accordance with an exemplary embodiment;
FIG. 2 is a flow chart illustrating a method of voice data generation in accordance with an exemplary embodiment;
FIG. 3 is a diagram illustrating a target video frame in accordance with an exemplary embodiment;
FIG. 4 is a flow chart illustrating a method of voice data generation in accordance with an exemplary embodiment;
FIG. 5 is a flow chart illustrating another method of speech data generation in accordance with an exemplary embodiment;
FIG. 6 is a block diagram illustrating a speech data generation apparatus according to an exemplary embodiment;
FIG. 7 is a block diagram illustrating another speech data generation apparatus in accordance with an illustrative embodiment;
FIG. 8 is a block diagram illustrating a terminal in accordance with an exemplary embodiment;
FIG. 9 is a block diagram illustrating a server in accordance with an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "object," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The embodiment of the disclosure can be applied to any scene needing sign language translation.
For example, in a live broadcast scene, the anchor can be a hearing-impaired person, the terminal shoots a video of the anchor, the video is uploaded to a server associated with live broadcast software, the server analyzes and processes the sign language video, the sign language in the video is translated into voice data, the voice data is sent to the watching terminal, and the watching terminal plays the voice data, so that the semantic meaning which the anchor wants to express is known, and the normal communication between the anchor and the watching user is realized.
For example, in a scene of face-to-face communication between a hearing impaired person and a hearing impaired person, the hearing impaired person can shoot own sign language video through a terminal such as a mobile phone, analyze and process the sign language video through the terminal, translate sign language in the video into voice data, and play the voice data, so that other people can quickly know the semantic meaning which the user wants to express.
In addition to the above scenarios, the method provided by the embodiment of the present disclosure may also be applied to other scenarios that a user watches a video shot by a hearing-impaired person, and a watching terminal translates a sign language in the video into voice data, and the like.
Fig. 1 is a flowchart illustrating a voice data generating method according to an exemplary embodiment, and as shown in fig. 1, the voice data generating method may be applied to a computer device, where the computer device may be a terminal such as a mobile phone, a computer, or the like, or may be a server associated with an application, and includes the following steps:
in step S11, at least one target video frame is acquired from the video to be processed, the target video frame being a video frame including a hand image.
In step S12, gesture recognition is performed on the hand image of the at least one target video frame, so as to obtain a gesture type corresponding to the at least one target video frame.
In step S13, a target sentence is obtained based on the at least one gesture type and the corresponding relationship between the gesture type and the word, where the target sentence includes the word corresponding to the at least one gesture type.
In step S14, speech data corresponding to the target sentence is generated from the target sentence.
According to the voice data generation method, the device, the terminal and the storage medium provided by the embodiment of the disclosure, the gesture type of the user is obtained by performing target detection and tracking on the video including the sign language, the sentence corresponding to the sign language is obtained through the corresponding relation between the gesture type and the words, the voice data of the sentence is generated, the content which the sign language in the video is required to express can be known through subsequently playing the voice data, and barrier-free communication between a hearing impaired person and a hearing-strengthened person is realized. The video to be processed can be shot by a common camera, so that the scheme does not depend on specific equipment, can directly run on terminals such as a mobile phone and a computer, has no extra cost, and can be well popularized among people with hearing impairment.
In one possible implementation manner, performing gesture recognition on a hand image of at least one target video frame to obtain a gesture type corresponding to the at least one target video frame includes:
performing gesture recognition on the hand image of each target video frame, and acquiring a gesture shape of each target video frame based on a hand contour in the hand image in each target video frame;
and determining the gesture type corresponding to each target video frame based on the gesture shape of each target video frame and the corresponding relation between the gesture shape and the gesture type.
In one possible implementation manner, before obtaining the target sentence based on at least one gesture type and a corresponding relationship between the gesture type and the word, the method further includes:
and when the gesture types of the continuous target video frames with the target number are the same, taking the same gesture type as the gesture type corresponding to the continuous target video frames.
In one possible implementation manner, obtaining the target sentence based on at least one gesture type and a corresponding relationship between the gesture type and the word includes:
when the recognized gesture type is a target gesture type, acquiring words corresponding to a target video frame between a first target video frame and a second target video frame based on the gesture type corresponding to the target video frame and the corresponding relation between the gesture type and the words, wherein the first target video frame is the target video frame with the target gesture type recognized at this time, and the second target video frame is the target video frame with the target gesture type recognized at the previous time;
and combining at least one word to obtain the target sentence.
In one possible implementation manner, obtaining the target sentence based on at least one gesture type and a corresponding relationship between the gesture type and the word includes:
and when one gesture type is identified, acquiring words corresponding to the gesture type based on the gesture type and the corresponding relation between the gesture type and the words, and taking the words as target sentences.
In a possible implementation manner, after generating, according to the target sentence, voice data corresponding to the target sentence, the method further includes:
when the recognized gesture type is the target gesture type, carrying out grammar detection on words corresponding to a target video frame between a first target video frame and a second target video frame, wherein the first target video frame is the target video frame of which the target gesture type is recognized at this time, and the second target video frame is the target video frame of which the target gesture type is recognized at the previous time;
and when the grammar detection fails, regenerating a new target sentence based on a word corresponding to the target video frame between the first target video frame and the second target video frame, wherein the new target sentence comprises at least one word.
In a possible implementation manner, generating voice data corresponding to the target sentence according to the target sentence includes any one of the following steps:
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an expression type corresponding to the face image, and generating first voice data based on the expression type, wherein the tone of the first voice data conforms to the expression type;
when the target video frame comprises the face image, carrying out face recognition on the face image to obtain an age range to which the face image belongs, acquiring tone data corresponding to the age range based on the age range, and generating second voice data based on the tone data, wherein the tone of the second voice data conforms to the age range;
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain a gender type corresponding to the face image, acquiring tone data corresponding to the gender type based on the gender type, and generating third voice data based on the tone data, wherein the tone of the third voice data conforms to the gender type;
determining emotion data corresponding to the change speed based on the change speed of the gesture type, and generating fourth voice data based on the emotion data, wherein the tone of the fourth voice data accords with the change speed.
In one possible implementation manner, generating, according to the target sentence, voice data corresponding to the target sentence includes:
acquiring a pronunciation sequence corresponding to the target sentence based on the character elements in the target sentence and the corresponding relation between the character elements and the pronunciation;
and generating voice data corresponding to the target sentence based on the pronunciation sequence.
In one possible implementation, obtaining at least one target video frame from a video to be processed includes:
inputting a video to be processed into a convolutional neural network model, and splitting the video to be processed into a plurality of video frames by the convolutional neural network model;
for any video frame, when detecting that the video frame comprises a hand image, marking the hand image, and taking the video frame as a target video frame;
and when the hand image is not included in the video frame, discarding the video frame.
All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
Fig. 2 is a flowchart illustrating a voice data generating method according to an exemplary embodiment, and as shown in fig. 2, the method may be applied to a computer device, where the computer device may be a terminal such as a mobile phone and a computer, or may be a server associated with an application, and the embodiment takes the server as an execution subject and includes the following steps:
in step S21, the server acquires at least one target video frame from the video to be processed, the target video frame being a video frame including a hand image.
The video to be processed may be a section of complete video uploaded after the terminal finishes shooting, or may be a video shot by the terminal and sent to the server in real time. The video to be processed is formed by connecting static images of one frame and one frame, and each static image is a video frame.
The specific implementation manner of the step S21 may be: after the server acquires the video to be processed, performing hand image detection on each video frame in the video to be processed, determining whether the video frame comprises a hand image, and marking an area where the hand image is located when the video frame comprises the hand image to obtain a target video frame; when the hand image is not included in the video frame, the video frame is discarded. By discarding a part of useless video frames, the number of video frames to be processed subsequently is reduced, the calculation amount of a server is further reduced, and the processing speed is improved.
The specific process of the server determining whether the video frame includes the hand image may be implemented by a first network, which may be an SSD (Single Shot multi box Detector) network, an HMM (Hidden Markov Model) network, or other convolutional neural network. Accordingly, in a possible implementation manner of this step S21, the server splits the video to be processed into a plurality of video frames, and for any video frame, the server acquires feature data of the video frame using the first network, and determines whether the feature data includes target feature data, where the target feature data is feature data corresponding to a hand; when the characteristic data comprises target characteristic data, determining the position of the hand image according to the position of the target characteristic data; marking the position of the hand image through a rectangular frame, and outputting a target video frame with a rectangular frame mark; when the target feature data is not included in the feature data, the video frame is discarded. The video to be processed is analyzed through the convolutional neural network, and the video can be analyzed quickly and accurately.
The target video frames with rectangular frame marks can be as shown in fig. 3, where fig. 3 shows 3 target video frames, and the hand image in each target video frame is marked by a rectangular frame.
The first network can be obtained by training the convolutional neural network by using the training sample. For example, in a stage of training a convolutional neural network by using a training sample, a large number of pictures including hand images may be prepared, and the hand images in the pictures are labeled, that is, regions where the hand images are located in the pictures are labeled by rectangular frames. And training the convolutional neural network by using the marked picture to obtain a trained first network.
It should be noted that, this embodiment is only described with the analysis of the video to be processed by the first network as an example, in some embodiments, the video to be processed may also be analyzed by other methods such as image scanning, and the method for analyzing the video to be processed in the embodiment of the present disclosure is not limited.
In step S22, the server performs gesture recognition on the hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame.
In this embodiment, the timing of the server performing gesture recognition on the hand image of the at least one target video frame may be any of the following timings: (1) after all target video frames of a video to be processed are acquired, gesture recognition is carried out on hand images of the target video frames, and the video frames are divided into two steps to be processed, so that the operation memory is reduced; (2) after a target video frame is acquired, gesture recognition is carried out on a hand image of the target video frame, after the gesture type of the target video frame is acquired, the step of acquiring the next target video frame is executed, and each video frame is thoroughly processed, so that the real-time performance of communication is improved.
In addition, the specific process of identifying the hand image of the at least one target video frame by the server can comprise the following processes: the server performs gesture recognition on the hand image of each target video frame, and acquires the gesture shape of each target video frame based on the hand contour in the hand image in each target video frame; and determining the gesture type corresponding to each target video frame based on the gesture shape of each target video frame and the corresponding relation between the gesture shape and the gesture type.
In addition, the specific process of analyzing the hand image by the server may be implemented by a second network, and the second network may be an SSD network, an HMM network, or another convolutional neural network. Accordingly, in a possible implementation manner of this step S22, the server performs object detection using the first network to obtain a hand image, and performs tracking on the hand image using the second network to obtain a gesture type corresponding to the hand image. That is, in the embodiment of the present disclosure, when the server classifies the gestures using the second network, the server may further perform target detection on the next video frame using the first network, and obtain classification of the gesture types by common processing of the two networks, so as to speed up gesture classification.
The training process of the second network may be: preparing a large number of pictures with different gesture shapes, and classifying and labeling the pictures. For example, all pictures with gesture type "heart to heart" are numbered 1 and all pictures with gesture type "good" are numbered 2. And inputting the marked picture into a convolutional neural network for training to obtain a trained second network.
In addition, the analysis process of the hand image by the server can be realized through the first network. That is, the target detection and the target classification are realized through the same network. The server detects whether the video frame comprises the hand image or not through the first network, after the hand image is detected, gesture recognition is carried out on the hand image to obtain a gesture type corresponding to the hand image, target detection and target classification can be completed through only one network, the algorithm for analyzing the video occupies a small memory, and therefore the terminal calling is easy.
It should be noted that, when the gesture type is recognized through the second network, the input to the second network may be a target video frame or a hand image in the target video frame, which is not limited in this disclosure.
In step S23, when the gesture types of the consecutive target video frames with the target number are the same, the server takes the same gesture type as the gesture type corresponding to the consecutive target video frames.
When the video is shot, a plurality of video frames can be acquired within one second, so that the same gesture action can appear in the plurality of video frames when a user makes the gesture action. During the change process of the gesture action, the user can generate actions corresponding to other gesture types, because the duration of the gesture motion generated in the gesture motion change process is short, and the duration of the sign language motion made by the user is relatively long, to determine which are sign language actions made by the user, which are actions generated by the user during gesture changes, when the gesture types of the continuous target video frames with the target number are the same, the server can take the same gesture type as the gesture type corresponding to the continuous target video frames, the server can only generate one corresponding word or sentence when the gesture action is made by the user, so that the phenomenon that the intermediate gesture generated in the gesture change process is mistakenly recognized is avoided, the user experience is improved, the recognition accuracy is also improved, and the phenomenon that the server generates a plurality of repeated words aiming at one action of the user is also avoided. .
The specific implementation manner of the step S23 may be: after the server acquires one gesture type, the server takes the gesture type as the gesture type to be determined, and then acquires the gesture type of the next target video frame. When the gesture type of the next target video frame is the same as the gesture type to be determined, adding 1 to the continuous times of the gesture type to be determined, and continuing to execute the step of obtaining the gesture type of the next target video frame; when the gesture type of the next video frame is different from the gesture type to be determined, determining whether the continuous times of the gesture type to be determined are larger than the target number, if the continuous times of the gesture type to be determined are not smaller than the target number, determining that the gesture type to be determined is an effective gesture type, and taking the gesture type of the next video frame as the gesture type to be determined; and if the number of times of the gesture type to be determined is less than the target number, determining the gesture type to be determined as an invalid gesture type, and taking the gesture type of the next target video as the gesture type to be determined.
The target number may be any value such as 10, 15, or 20, and the target number may be determined by the number of video frames per second, or the gesture change speed of the user, or other manners, which is not limited in this disclosure.
In step S24, when the recognized gesture type is the target gesture type, based on the gesture type corresponding to the target video frame, the corresponding relationship between the gesture type and the word, the server obtains the word corresponding to the target video frame between the first target video frame and the second target video frame, where the first target video frame is the target video frame in which the target gesture type is recognized this time, and the second target video frame is the target video frame in which the target gesture type is recognized last time.
The target gesture type may be a preset gesture type, and the target gesture is used for indicating that the statement of a sentence is completed. When the target gesture type is detected, it indicates that the user wants to indicate that the sentence has been declared complete. In addition, one gesture type may correspond to at least one word.
The specific process of the server acquiring the words corresponding to the target video frame between the first target video frame and the second target video frame may be as follows: the server obtains gesture types corresponding to a plurality of continuous target video frames, and obtains at least one word corresponding to each gesture type from a database, wherein the database is used for correspondingly storing the gesture types and the at least one word corresponding to the gesture types.
It should be noted that, in the embodiment of the present disclosure, the description is only given by taking an example of representing the completion of a sentence by the target gesture, in some embodiments, a button may be further disposed on the terminal for shooting the video, and the completion of a sentence is represented by clicking the button or by other means.
In step S25, the server combines at least one word to obtain a plurality of sentences.
When the server acquires a word, the word is directly used as a statement. When the server obtains a plurality of words, the specific process of the server generating the statement may be: obtaining a plurality of sentences by sequentially combining a plurality of words; or, retrieving a corpus based on a plurality of words, and obtaining a plurality of sentences in the corpus, wherein the corpus includes a large number of real sentences.
In a possible implementation manner, the server obtains a plurality of statements by sequentially combining a plurality of words, and the specific process may be: the server combines one word corresponding to each gesture type according to the time sequence of the gesture types to obtain one sentence, and because some gesture types correspond to a plurality of words, the server needs to combine each word of the gesture type with words of other gesture types once to obtain a plurality of sentences. Because the language order of the sign language is the same as the language order expression of the spoken language, the words corresponding to the gesture types can be directly arranged and combined according to the time sequence, and the generation speed of the sentences is accelerated on the basis of ensuring the accuracy.
In another possible implementation manner, the server retrieves the corpus based on a plurality of words to obtain a plurality of sentences in the corpus, and the specific process may be: the server locally stores a corpus, and when obtaining a plurality of words, the server combines the words to serve as search words, searches the corpus and obtains a plurality of sentences from the corpus, wherein each sentence comprises a word corresponding to each gesture type. The method and the device ensure the smoothness of the obtained sentences by searching the real sentences in the corpus.
Because some gesture types correspond to a plurality of words, each word corresponding to the gesture type needs to be combined with words corresponding to other gesture types to obtain a plurality of search terms. And acquiring at least one sentence corresponding to each retrieval word from the corpus.
In step S26, the server calculates a score for each sentence, and sets the sentence with the highest score as the target sentence.
The server can calculate the score of each sentence according to the conditions that whether the sentence is smooth or not, whether the word corresponding to each gesture type is included or not, and whether the sequence of the word in the sentence is equal to the occurrence time sequence of the corresponding gesture type or not. According to different generation modes of the sentences, the server can calculate scores of the sentences according to different conditions. In addition, the server may perform the score calculation by combining any one or more of the conditions.
The server is exemplified to combine a plurality of words in order to obtain a plurality of sentences, and the server may calculate a score for each sentence according to the compliance of the sentence, and use the sentence with the highest score as the target sentence. Because some gesture types may correspond to a plurality of words, which may have different semantics, when the selected gesture type corresponds to a word that the user wants to express, the sentence is smooth, and when the selected gesture type corresponds to a word that the user wants to express, the sentence may not be smooth. Through judging the smoothness of the sentences, the words which the user wants to express are obtained from the words corresponding to the gesture types, and the accuracy of sign language translation is improved.
The server can judge whether the sentence is smooth based on an N-gram algorithm, the N-gram algorithm can judge whether every N adjacent words are collocated, the server can determine the collocation degree of every N adjacent words in the sentence based on the N-gram algorithm, and the smoothness degree of the sentence is determined based on the collocation degree of every N adjacent words, wherein N can be any number of 2, 3, 4, 5 and the like, and can also be the number of words included in the sentence. Wherein, the higher the collocation degree of the adjacent words is, the higher the smoothness degree of the sentences is. The smoothness of the sentences can be accurately judged by adopting the N-gram algorithm, so that the sentences meeting the requirements of the user are determined, and the accuracy of sign language translation is further improved.
The method comprises the steps of taking a server as an example to retrieve a corpus based on a plurality of words, acquiring a plurality of sentences in the corpus, and calculating scores of the sentences based on the occurrence time sequence of each gesture type and the sequence of the words in each sentence, wherein the higher the phase speed of the sequence of the gesture type and the sequence of the words corresponding to the gesture type in the sentences is, the higher the score of the sentences is. The sentences in the corpus are real sentences without the problems of word order, logic and the like, the sentences screened from the corpus do not need to verify whether the word order or the logic has the problems or not, and the sentences are real sentences in daily life, so that the communication between normal users can be better simulated, the sign language translation effect is improved, and only the sequence of the words in the sentences needs to be verified whether the sequence of the occurrence time of the gesture type is the same or not, and the judgment process is simplified.
In step S27, the server generates speech data corresponding to the target sentence based on the target sentence.
Wherein, the voice data is the audio data of the target sentence.
The specific implementation process of the step S27 may be: the server acquires a pronunciation sequence corresponding to the target sentence based on the character elements in the target sentence and the corresponding relation between the character elements and the pronunciation, and generates voice data corresponding to the target sentence based on the pronunciation sequence.
The specific process of the server obtaining the pronunciation sequence of the target sentence and generating the voice data corresponding to the target sentence according to the pronunciation sequence may include the following processes: the server processes the target sentence through a text regularization method, converts non-Chinese characters in the target sentence into Chinese characters, and obtains a first target sentence; the server carries out word segmentation processing and part-of-speech tagging on the first target sentence to obtain at least one word segmentation and a part-of-speech result corresponding to the at least one word segmentation; acquiring the pronunciation of each word segmentation result based on the corresponding relation between the part of speech result and the pronunciation of each word segmentation; performing prosody prediction on each word segmentation result through a prosody model based on the pronunciation of each word segmentation result to obtain a pronunciation sequence with a prosody label; the server adopts an acoustic model to predict acoustic parameters of each pronunciation unit in the pronunciation sequence to obtain the acoustic parameters corresponding to each pronunciation unit; and the server converts the acoustic parameters corresponding to each pronunciation unit into corresponding voice data. The acoustic model may adopt an LSTM (Long Short-Term Memory) network model.
The pronunciations of the word segmentation results are processed through the prosodic model, so that the subsequently generated voices are more vivid, normal communication between two users is better simulated, the user experience is enhanced, and the sign language translation effect is improved.
In addition, when generating voice data, it is also possible to output voice data corresponding to the state of the user with reference to the state of the user. In one possible implementation manner, a plurality of expression types and tone information corresponding to the expression types are stored in the server. When the target video frame comprises the face image, the server performs face recognition on the face image to obtain an expression type corresponding to the face image, and generates first voice data based on the expression type, wherein the tone of the first voice data conforms to the expression type. For example, when the server detects that the expression type of the user is happy, first voice data with faster tone is generated.
In another possible implementation manner, a plurality of age ranges and tone color data corresponding to the age ranges are stored in the server. When the target video frame comprises the face image, the server carries out face recognition on the face image to obtain an age range to which the face image belongs, obtains tone data corresponding to the age range based on the age range, and generates second voice data based on the tone data, wherein the tone of the second voice data conforms to the age range. For example, when the server detects that the age range of the user is 5-10 years old, second voice data with relatively young timbre is generated.
In another possible implementation manner, the server stores the gender type and the tone color data corresponding to the gender type. When the target video frame comprises the face image, carrying out face recognition on the face image to obtain a gender type corresponding to the face image, acquiring tone data corresponding to the gender type based on the gender type, and generating third voice data based on the tone data, wherein the tone of the third voice data accords with the gender type. For example, when the server detects that the user is female, third voice data whose tone is female is generated.
In another possible implementation manner, a plurality of change speeds and emotion data corresponding to the change speeds are stored in the server. Based on the change speed of the gesture type, the server determines emotion data corresponding to the change speed, and generates fourth voice data based on the emotion data, wherein the tone of the fourth voice data accords with the change speed. For example, when the gesture change speed of the user is fast, which indicates that the emotion of the user is more excited, fourth voice data with a higher tone is generated.
By integrating the above steps, the voice data generation method provided by the embodiment of the disclosure is as shown in fig. 4, a hearing-impaired person displays a piece of sign language in front of a camera, the camera shoots a video including the sign language, the sign language recognition and analysis are performed on the video through a sign language recognition module to obtain a plurality of gesture types, words corresponding to the gesture types are obtained through a sign language translation module, at least one word is synthesized into a target sentence, voice data of the target sentence is generated through a hearing voice synthesis module, and the voice data is played to a hearing-impaired person, so that normal communication between the hearing-impaired person and the hearing-impaired person is realized.
It should be noted that, any one or more of the above four ways of generating voice data may be selected and combined, and a user may also select a favorite tone or tone to generate voice data.
According to the voice data generation method provided by the embodiment of the disclosure, the gesture type of the user is obtained by performing target detection and tracking on the video including the sign language, the sentence corresponding to the sign language is obtained through the corresponding relation between the gesture type and the words, the voice data of the sentence is generated, the content that the sign language in the video is to be expressed can be known through playing the voice data subsequently, and barrier-free communication between a hearing impaired person and a hearing-aid person is realized. The video to be processed can be shot by a common camera, so that the scheme does not depend on specific equipment, can directly run on terminals such as a mobile phone and a computer, has no extra cost, and can be well popularized among people with hearing impairment.
In addition, effective gestures and invalid gestures are judged by detecting the duration of the gestures, so that the intermediate gestures generated in the gesture change process are prevented from being recognized by mistake, the accuracy of sign language translation is improved, and the user experience is improved.
In addition, after the server acquires the target sentences, the scores of the target sentences are calculated according to certain conditions, and the sentence with the highest score is used as the target sentence, so that the target sentences can better meet the requirements of users, the user experience is improved, and the sign language translation effect is enhanced.
In addition, the server can also generate voice data which is consistent with the state of the user according to the state of the user, so that the voice data better simulates the communication between normal users, and the communication process is more vivid.
The above embodiments shown in fig. 2 to 4 are described by taking an example that after a user finishes a sentence expression, the voice data corresponding to the sentence is generated, and in a possible embodiment, after acquiring the gesture type, the server generates the voice data corresponding to the gesture type in real time. This is further described below on the basis of the embodiment of fig. 5. Fig. 5 is a flowchart illustrating a voice data generation method, as shown in fig. 5, for use in a server, according to an exemplary embodiment, including the steps of:
in step S51, the server acquires at least one target video frame from the video to be processed, where the at least one target video frame is a video frame including a hand image.
In step S52, the server performs gesture recognition on the hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame.
In step S53, when the gesture types of the consecutive target video frames with the target number are the same, the server takes the same gesture type as the gesture type corresponding to the consecutive target video frames.
Steps S51 to S53 are similar to steps S21 to S23, and are not repeated herein.
In step S54, after the server recognizes each gesture type, the server obtains a word corresponding to the gesture type based on the gesture type and the correspondence between the gesture type and the word, and takes the word as a target sentence.
One gesture type corresponds to one word, the word and the gesture type are in one-to-one correspondence, and the word sequence of the sign language is the same as the word sequence of the spoken language of a hearing person, so that after the server determines the gesture type, the only word corresponding to the gesture type can be determined as a target sentence, and the target sentence can accurately express the semantics of the sign language.
In step S55, the server generates speech data corresponding to the target sentence based on the target sentence.
The step S55 is similar to the step S27, and is not repeated here.
In step S56, when the recognized gesture type is the target gesture type, the server performs syntax detection on a word corresponding to a target video frame between a first target video frame and a second target video frame, where the first target video frame is the target video frame in which the target gesture type is recognized this time, and the second target video frame is the target video frame in which the target gesture type was recognized last time.
When a sentence which the user wants to express is expressed by sign language, the server can also arrange the words output by the sentence in real time according to the time sequence to generate a sentence, and grammar detection is carried out on the sentence to determine whether the sentence output in real time is accurate.
In step S57, when the syntax detection fails, the server regenerates a new target sentence based on the word corresponding to the target video frame between the first target video frame and the second target video frame, the new target sentence including at least one word.
That is, when there is a problem in the grammar, the sentence is output again, and the steps S24 to S26 are similar and will not be described again.
It should be noted that, when the syntax detection is passed, the step of performing the analysis processing on the next video frame is continued.
In step S58, the server generates speech data corresponding to the new target sentence based on the new target sentence.
The step S58 is similar to the step S27, and is not repeated here.
According to the voice data generation method provided by the embodiment of the disclosure, after an effective gesture type is determined, the voice data corresponding to the gesture type is output, and through real-time translation, the translation speed is improved, the communication experience between a hearing-impaired person and a hearing-impaired person is also improved, and the spoken language communication between the hearing-impaired person and the hearing-impaired person can be better simulated. And after the output of a sentence is finished, the server also carries out grammar detection on the sentence, and when the grammar of the sentence has a problem, the sentence which conforms to the grammar is regenerated, thereby improving the accuracy of translation.
All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
FIG. 6 is a block diagram illustrating a speech data generation apparatus according to an example embodiment. Referring to fig. 6, the apparatus includes an acquisition unit 601, a recognition unit 602, a sentence generation unit 603, and a voice data generation unit 604.
An acquisition unit 601 configured to perform acquisition of at least one target video frame from a video to be processed, the target video frame being a video frame including a hand image;
the recognition unit 602 is configured to perform gesture recognition on a hand image of the at least one target video frame, so as to obtain a gesture type corresponding to the at least one target video frame;
a sentence generating unit 603 configured to execute a target sentence based on at least one gesture type and a corresponding relationship between the gesture type and a word, where the target sentence includes the word corresponding to the at least one gesture type;
a voice data generating unit 604 configured to generate voice data corresponding to the target sentence according to the target sentence.
The voice data generation device provided by the embodiment of the disclosure obtains the gesture type of the user by performing target detection and tracking on the video including the sign language, obtains the sentence corresponding to the sign language through the corresponding relationship between the gesture type and the words, generates the voice data of the sentence, and can know the content that the sign language in the video is intended to express through playing the voice data subsequently, thereby realizing barrier-free communication between hearing impaired people and hearing-strengthened people. The video to be processed can be shot by a common camera, so that the scheme does not depend on specific equipment, can directly run on terminals such as a mobile phone and a computer, has no extra cost, and can be well popularized among people with hearing impairment.
In one possible implementation, as shown in fig. 7, the identifying unit 602 includes:
a gesture shape acquisition subunit 6021 configured to perform gesture recognition on the hand image of each target video frame, and acquire a gesture shape of each target video frame based on a hand contour in the hand image in the each target video frame;
a gesture type obtaining subunit 6022 configured to perform determination of a gesture type corresponding to each target video frame based on the gesture shape of each target video frame and the corresponding relationship between the gesture shape and the gesture type.
In one possible implementation, as shown in fig. 7, the apparatus further includes:
the determining unit 605 is configured to perform, when the gesture types of the consecutive target video frames having the target number are the same, taking the same gesture type as the gesture type corresponding to the consecutive target video frames.
In one possible implementation, as shown in fig. 7, the statement generation unit 603 includes:
a word obtaining subunit 6031, configured to, when the recognized gesture type is the target gesture type, obtain, based on the gesture type corresponding to the target video frame and the corresponding relationship between the gesture type and the word, a word corresponding to the target video frame between a first target video frame and a second target video frame, where the first target video frame is the target video frame in which the target gesture type is recognized this time, and the second target video frame is the target video frame in which the target gesture type is recognized last time;
a combining subunit 6032 configured to perform combining the at least one word to obtain the target sentence.
In one possible implementation manner, as shown in fig. 7, the sentence generation unit 603 is further configured to, when each gesture type is recognized, obtain a word corresponding to the gesture type based on the gesture type and the corresponding relationship between the gesture type and the word, and take the word as the target sentence.
In one possible implementation, as shown in fig. 7, the apparatus further includes:
a syntax detecting unit 606 configured to perform syntax detection on a word corresponding to a target video frame between a first target video frame and a second target video frame when the recognized gesture type is the target gesture type, where the first target video frame is the target video frame in which the target gesture type is recognized this time, and the second target video frame is the target video frame in which the target gesture type was recognized last time;
the sentence generation unit 603 is configured to perform, when the syntax detection fails, regeneration of a new target sentence based on a word corresponding to the target video frame between the first target video frame and the second target video frame, the new target sentence including the at least one word.
In one possible implementation, as shown in fig. 7, the speech data generating unit 603 is configured to perform any one of the following steps:
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an expression type corresponding to the face image, and generating first voice data based on the expression type, wherein the tone of the first voice data is in accordance with the expression type;
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an age range to which the face image belongs, acquiring tone data corresponding to the age range based on the age range, and generating second voice data based on the tone data, wherein the tone of the second voice data conforms to the age range;
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain a gender type corresponding to the face image, acquiring tone data corresponding to the gender type based on the gender type, and generating third voice data based on the tone data, wherein the tone of the third voice data conforms to the gender type;
determining emotion data corresponding to the change speed based on the change speed of the gesture type, and generating fourth voice data based on the emotion data, wherein the tone of the fourth voice data conforms to the change speed.
In one possible implementation, as shown in fig. 7, the voice data generating unit 604 includes:
a pronunciation sequence acquisition subunit 6041 configured to execute acquiring a pronunciation sequence corresponding to the target sentence based on the character elements and the correspondence between the character elements and pronunciations in the target sentence;
a voice data acquisition subunit 6042 configured to perform generation of voice data corresponding to the target sentence based on the pronunciation sequence.
In one possible implementation, as shown in fig. 7, the obtaining unit 601 includes:
an input subunit 6011 configured to perform inputting the video to be processed into a convolutional neural network model, where the convolutional neural network model splits the video to be processed into a plurality of video frames;
an annotation subunit 6012 configured to perform, for any video frame, when it is detected that a hand image is included in the video frame, annotating the hand image, and taking the video frame as a target video frame;
a discarding subunit 6013 configured to perform discarding the video frame when it is detected that the hand image is not included in the video frame.
It should be noted that: in the voice data generating device provided in the above embodiment, when generating voice data, only the division of the above functional units is exemplified, and in practical applications, the above functions may be distributed by different functional units according to needs, that is, the internal structure of the voice data generating device may be divided into different functional units to complete all or part of the above described functions. In addition, the voice data generating apparatus and the voice data generating method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
Fig. 8 is a block diagram of a terminal according to an embodiment of the present disclosure. The terminal 800 is used for executing the steps executed by the terminal in the above embodiments, and may be a portable mobile terminal, such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 800 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.
In general, the terminal 800 includes: a processor 801 and a memory 802.
The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement the speech data generation methods provided by method embodiments herein.
In some embodiments, the terminal 800 may further include: a peripheral interface 803 and at least one peripheral. The processor 801, memory 802 and peripheral interface 803 may be connected by bus or signal lines. Various peripheral devices may be connected to peripheral interface 803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 804, a touch screen display 805, a camera assembly 806, an audio circuit 807, a positioning assembly 808, and a power supply 809.
The peripheral interface 803 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 801 and the memory 802. In some embodiments, the processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.
The Radio Frequency circuit 804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 804 converts an electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 804 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to capture touch signals on or above the surface of the display 805. The touch signal may be input to the processor 801 as a control signal for processing. At this point, the display 805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 805 may be one, providing the front panel of the terminal 800; in other embodiments, the display 805 may be at least two, respectively disposed on different surfaces of the terminal 800 or in a folded design; in still other embodiments, the display 805 may be a flexible display disposed on a curved surface or a folded surface of the terminal 800. Even further, the display 805 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 805 can be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The camera assembly 806 is used to capture images or video. Optionally, camera assembly 806 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The audio circuit 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 801 for processing or inputting the electric signals to the radio frequency circuit 804 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 800. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 807 may also include a headphone jack.
The positioning component 808 is used to locate the current geographic position of the terminal 800 for navigation or LBS (location based Service). The positioning component 808 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, or the graves System in russia, or the galileo System in the european union.
Power supply 809 is used to provide power to various components in terminal 800. The power supply 809 can be ac, dc, disposable or rechargeable. When the power source 809 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyro sensor 812, pressure sensor 813, fingerprint sensor 814, optical sensor 815 and proximity sensor 816.
The acceleration sensor 811 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 800. For example, the acceleration sensor 811 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 801 may control the touch screen 805 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 811. The acceleration sensor 811 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 812 may detect a body direction and a rotation angle of the terminal 800, and the gyro sensor 812 may cooperate with the acceleration sensor 811 to acquire a 3D motion of the user with respect to the terminal 800. From the data collected by the gyro sensor 812, the processor 801 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
Pressure sensors 813 may be disposed on the side bezel of terminal 800 and/or underneath touch display 805. When the pressure sensor 813 is disposed on the side frame of the terminal 800, the holding signal of the user to the terminal 800 can be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at a lower layer of the touch display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 805. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 814 is used for collecting a fingerprint of the user, and the processor 801 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 801 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 814 may be disposed on the front, back, or side of terminal 800. When a physical button or a vendor Logo is provided on the terminal 800, the fingerprint sensor 814 may be integrated with the physical button or the vendor Logo.
The optical sensor 815 is used to collect the ambient light intensity. In one embodiment, the processor 801 may control the display brightness of the touch screen 805 based on the ambient light intensity collected by the optical sensor 815. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 805 is increased; when the ambient light intensity is low, the display brightness of the touch display 805 is turned down. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 based on the ambient light intensity collected by the optical sensor 815.
A proximity sensor 816, also known as a distance sensor, is typically provided on the front panel of the terminal 800. The proximity sensor 816 is used to collect the distance between the user and the front surface of the terminal 800. In one embodiment, when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 gradually decreases, the processor 801 controls the touch display 805 to switch from the bright screen state to the dark screen state; when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 becomes gradually larger, the processor 801 controls the touch display 805 to switch from the screen-on state to the screen-on state.
Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting of terminal 800 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
Fig. 9 is a block diagram illustrating a server 900 in accordance with an example embodiment. The server 900 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the memory 902 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 901 to implement the methods provided by the above method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
The server 900 may be configured to perform the steps performed by the server in the above-described voice data generation method.
In an exemplary embodiment, there is also provided a computer-readable storage medium, in which instructions, when executed by a processor of a computer device, enable the computer device to perform a voice data generation method provided by an embodiment of the present disclosure.
In an exemplary embodiment, there is also provided a computer program product comprising executable instructions, which when executed by a processor of a computer device, enable the computer device to perform the speech data generation method provided by the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (17)

1. A method of generating speech data, the method comprising:
acquiring at least one target video frame from a video to be processed, wherein the target video frame is a video frame comprising a hand image; performing gesture recognition on a hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame; obtaining a target sentence based on at least one gesture type and the corresponding relation between the gesture type and the words, wherein the target sentence comprises the words corresponding to the at least one gesture type; generating voice data corresponding to the target sentence according to the target sentence;
the obtaining of the target sentence based on at least one gesture type and the corresponding relationship between the gesture type and the word includes: when the recognized gesture type is a target gesture type, acquiring words corresponding to a target video frame between a first target video frame and a second target video frame based on the at least one gesture type and the corresponding relation between the gesture type and the words, wherein the first target video frame is the target video frame in which the target gesture type is recognized at this time, the second target video frame is the target video frame in which the target gesture type is recognized at the previous time, and the target gesture type is used for representing the completion of the expression of a sentence;
combining the obtained at least one word to obtain the target sentence;
before obtaining the target sentence based on at least one gesture type and the corresponding relationship between the gesture type and the word, the method further includes:
after one gesture type is obtained, the gesture type is used as the gesture type to be determined, and the gesture type of the next target video frame is obtained; when the gesture type of the next target video frame is the same as the gesture type to be determined, adding 1 to the continuous times of the gesture type to be determined, and continuing to execute the step of obtaining the gesture type of the next target video frame; when the gesture type of the next target video frame is different from the gesture type to be determined, determining whether the continuous times of the gesture type to be determined are greater than a target number, if the continuous times of the gesture type to be determined are not less than the target number, determining that the gesture type to be determined is an effective gesture type, taking the same gesture type as the gesture type corresponding to the continuous target video frame, and taking the gesture type of the next target video frame as the gesture type to be determined; if the number of times of occurrence of the gesture type to be determined is smaller than the target number, determining the gesture type to be determined as an invalid gesture type, and taking the gesture type of the next target video as the gesture type to be determined.
2. The method according to claim 1, wherein the performing gesture recognition on the hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame comprises:
performing gesture recognition on a hand image of each target video frame, and acquiring a gesture shape of each target video frame based on a hand contour in the hand image in each target video frame;
and determining the gesture type corresponding to each target video frame based on the gesture shape of each target video frame and the corresponding relation between the gesture shape and the gesture type.
3. The method of claim 1, wherein obtaining the target sentence based on at least one gesture type and a corresponding relationship between the gesture type and the word comprises:
and when one gesture type is identified, acquiring words corresponding to the gesture type based on the gesture type and the corresponding relation between the gesture type and the words, and taking the words as the target sentence.
4. The method according to claim 3, wherein after generating the voice data corresponding to the target sentence according to the target sentence, the method further comprises:
when the recognized gesture type is the target gesture type, performing grammar detection on words corresponding to a target video frame between the first target video frame and the second target video frame;
when the grammar detection fails, regenerating a new target sentence based on a word corresponding to the target video frame between the first target video frame and the second target video frame, the new target sentence including the at least one word.
5. The method according to claim 1, wherein the generating the voice data corresponding to the target sentence according to the target sentence comprises any one of the following steps:
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an expression type corresponding to the face image, and generating first voice data based on the expression type, wherein the tone of the first voice data conforms to the expression type;
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an age range to which the face image belongs, acquiring tone data corresponding to the age range based on the age range, and generating second voice data based on the tone data, wherein the tone of the second voice data conforms to the age range;
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain a gender type corresponding to the face image, acquiring tone data corresponding to the gender type based on the gender type, and generating third voice data based on the tone data, wherein the tone of the third voice data accords with the gender type;
determining emotion data corresponding to the change speed based on the change speed of the gesture type, and generating fourth voice data based on the emotion data, wherein the tone of the fourth voice data conforms to the change speed.
6. The method of claim 1, wherein the generating the voice data corresponding to the target sentence according to the target sentence comprises:
acquiring a pronunciation sequence corresponding to the target sentence based on the character elements in the target sentence and the corresponding relation between the character elements and pronunciations;
and generating voice data corresponding to the target sentence based on the pronunciation sequence.
7. The method according to claim 1, wherein said obtaining at least one target video frame from the video to be processed comprises:
inputting the video to be processed into a convolutional neural network, and splitting the video to be processed into a plurality of video frames by the convolutional neural network;
for any video frame, when detecting that the video frame comprises a hand image, marking the hand image, and taking the video frame as a target video frame;
when detecting that no hand image is included in the video frame, discarding the video frame.
8. An apparatus for generating speech data, the apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire at least one target video frame from a video to be processed, and the target video frame is a video frame comprising a hand image;
the recognition unit is configured to perform gesture recognition on a hand image of the at least one target video frame to obtain a gesture type corresponding to the at least one target video frame;
the sentence generating unit is configured to execute corresponding relations between at least one gesture type and the gesture type and words to obtain a target sentence, and the target sentence comprises the words corresponding to the at least one gesture type;
a voice data generation unit configured to execute generation of voice data corresponding to the target sentence according to the target sentence;
the sentence generation unit includes: a word obtaining subunit, configured to, when the recognized gesture type is a target gesture type, obtain, based on the at least one gesture type and a corresponding relationship between the gesture type and a word, a word corresponding to a target video frame between a first target video frame and a second target video frame, where the first target video frame is a target video frame in which the target gesture type is recognized this time, the second target video frame is a target video frame in which the target gesture type is recognized last time, and the target gesture type is used to indicate that a sentence expression is completed; the combination subunit is configured to perform combination of the acquired at least one word to obtain the target sentence;
the apparatus is for: after one gesture type is obtained, the gesture type is used as the gesture type to be determined, and the gesture type of the next target video frame is obtained; when the gesture type of the next target video frame is the same as the gesture type to be determined, adding 1 to the continuous times of the gesture type to be determined, and continuing to execute the step of obtaining the gesture type of the next target video frame; when the gesture type of the next target video frame is different from the gesture type to be determined, determining whether the continuous times of the gesture type to be determined are greater than a target number, if the continuous times of the gesture type to be determined are not less than the target number, determining that the gesture type to be determined is an effective gesture type, taking the same gesture type as the gesture type corresponding to the continuous target video frame, and taking the gesture type of the next target video frame as the gesture type to be determined; if the number of times of occurrence of the gesture type to be determined is smaller than the target number, determining the gesture type to be determined as an invalid gesture type, and taking the gesture type of the next target video as the gesture type to be determined.
9. The apparatus of claim 8, wherein the identification unit comprises:
a gesture shape acquisition subunit configured to perform gesture recognition on a hand image of each target video frame, and acquire a gesture shape of each target video frame based on a hand contour in the hand image in the each target video frame;
a gesture type obtaining subunit configured to perform determining a gesture type corresponding to each target video frame based on the gesture shape of each target video frame and the corresponding relationship between the gesture shape and the gesture type.
10. The apparatus according to claim 8, wherein the sentence generation unit is further configured to, every time one gesture type is recognized, obtain a word corresponding to the gesture type based on the gesture type and a correspondence between the gesture type and the word, and take the word as the target sentence.
11. The apparatus of claim 10, further comprising:
a grammar detection unit configured to perform grammar detection on a word corresponding to a target video frame between the first target video frame and the second target video frame when the recognized gesture type is the target gesture type;
the sentence generating unit is configured to perform, when the syntax detection fails, regeneration of a new target sentence based on a word corresponding to a target video frame between the first target video frame and the second target video frame, the new target sentence including the at least one word.
12. The apparatus according to claim 8, wherein the voice data generating unit is configured to perform any one of the following steps:
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an expression type corresponding to the face image, and generating first voice data based on the expression type, wherein the tone of the first voice data conforms to the expression type;
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain an age range to which the face image belongs, acquiring tone data corresponding to the age range based on the age range, and generating second voice data based on the tone data, wherein the tone of the second voice data conforms to the age range;
when the target video frame comprises a face image, carrying out face recognition on the face image to obtain a gender type corresponding to the face image, acquiring tone data corresponding to the gender type based on the gender type, and generating third voice data based on the tone data, wherein the tone of the third voice data accords with the gender type;
determining emotion data corresponding to the change speed based on the change speed of the gesture type, and generating fourth voice data based on the emotion data, wherein the tone of the fourth voice data conforms to the change speed.
13. The apparatus of claim 8, wherein the voice data generating unit comprises:
a pronunciation sequence acquisition subunit configured to execute acquisition of a pronunciation sequence corresponding to the target sentence based on the character elements in the target sentence and the corresponding relationship between the character elements and pronunciations;
and the voice data acquisition subunit is configured to generate the voice data corresponding to the target sentence based on the pronunciation sequence.
14. The apparatus of claim 8, wherein the obtaining unit comprises:
an input subunit configured to perform input of the video to be processed into a convolutional neural network, the convolutional neural network splitting the video to be processed into a plurality of video frames;
the annotation subunit is configured to perform annotation on a hand image when detecting that the hand image is included in any video frame, and take the video frame as a target video frame;
a discarding subunit configured to perform discarding the video frame when it is detected that no hand image is included in the video frame.
15. A terminal, comprising:
one or more processors;
one or more memories for storing the one or more processor-executable instructions;
wherein the one or more processors are configured to perform the speech data generation method of any of claims 1 to 7.
16. A server, comprising:
one or more processors;
one or more memories for storing the one or more processor-executable instructions;
wherein the one or more processors are configured to perform the speech data generation method of any of claims 1 to 7.
17. A computer-readable storage medium in which instructions, when executed by a processor of a computer device, enable the computer device to perform the speech data generation method of any one of claims 1 to 7.
CN201910611471.9A 2019-07-08 2019-07-08 Voice data generation method, device, terminal and storage medium Active CN110322760B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910611471.9A CN110322760B (en) 2019-07-08 2019-07-08 Voice data generation method, device, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910611471.9A CN110322760B (en) 2019-07-08 2019-07-08 Voice data generation method, device, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN110322760A CN110322760A (en) 2019-10-11
CN110322760B true CN110322760B (en) 2020-11-03

Family

ID=68123138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910611471.9A Active CN110322760B (en) 2019-07-08 2019-07-08 Voice data generation method, device, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN110322760B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110716648B (en) * 2019-10-22 2021-08-24 上海商汤智能科技有限公司 Gesture control method and device
CN110826441B (en) * 2019-10-25 2022-10-28 深圳追一科技有限公司 Interaction method, interaction device, terminal equipment and storage medium
CN110730360A (en) * 2019-10-25 2020-01-24 北京达佳互联信息技术有限公司 Video uploading and playing methods and devices, client equipment and storage medium
CN111144287B (en) * 2019-12-25 2023-06-09 Oppo广东移动通信有限公司 Audiovisual auxiliary communication method, device and readable storage medium
CN111354362A (en) * 2020-02-14 2020-06-30 北京百度网讯科技有限公司 Method and device for assisting hearing-impaired communication
CN113031464B (en) * 2021-03-22 2022-11-22 北京市商汤科技开发有限公司 Device control method, device, electronic device and storage medium
CN113656644B (en) * 2021-07-26 2024-03-15 北京达佳互联信息技术有限公司 Gesture language recognition method and device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096467A (en) * 2010-12-28 2011-06-15 赵剑桥 Light-reflecting type mobile sign language recognition system and finger-bending measurement method

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101605399A (en) * 2008-06-13 2009-12-16 英华达(上海)电子有限公司 A kind of portable terminal and method that realizes Sign Language Recognition
CN101527092A (en) * 2009-04-08 2009-09-09 西安理工大学 Computer assisted hand language communication method under special session context
CN103116576A (en) * 2013-01-29 2013-05-22 安徽安泰新型包装材料有限公司 Voice and gesture interactive translation device and control method thereof
CN108846378A (en) * 2018-07-03 2018-11-20 百度在线网络技术(北京)有限公司 Sign Language Recognition processing method and processing device
CN109063624A (en) * 2018-07-26 2018-12-21 深圳市漫牛医疗有限公司 Information processing method, system, electronic equipment and computer readable storage medium
CN109446876B (en) * 2018-08-31 2020-11-06 百度在线网络技术(北京)有限公司 Sign language information processing method and device, electronic equipment and readable storage medium
CN109858357A (en) * 2018-12-27 2019-06-07 深圳市赛亿科技开发有限公司 A kind of gesture identification method and system
CN109934091A (en) * 2019-01-17 2019-06-25 深圳壹账通智能科技有限公司 Auxiliary manner of articulation, device, computer equipment and storage medium based on image recognition

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096467A (en) * 2010-12-28 2011-06-15 赵剑桥 Light-reflecting type mobile sign language recognition system and finger-bending measurement method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于视觉的连续手语识别系统的研究;陈小柏;《中国优秀硕士学位论文全文数据库》;20140531;第14-40页 *

Also Published As

Publication number Publication date
CN110322760A (en) 2019-10-11

Similar Documents

Publication Publication Date Title
CN110322760B (en) Voice data generation method, device, terminal and storage medium
CN108615526B (en) Method, device, terminal and storage medium for detecting keywords in voice signal
CN111564152B (en) Voice conversion method and device, electronic equipment and storage medium
CN110933330A (en) Video dubbing method and device, computer equipment and computer-readable storage medium
CN110556127B (en) Method, device, equipment and medium for detecting voice recognition result
CN110572716B (en) Multimedia data playing method, device and storage medium
CN110992927B (en) Audio generation method, device, computer readable storage medium and computing equipment
CN111524501B (en) Voice playing method, device, computer equipment and computer readable storage medium
CN112735429B (en) Method for determining lyric timestamp information and training method of acoustic model
CN111105788B (en) Sensitive word score detection method and device, electronic equipment and storage medium
CN112116904B (en) Voice conversion method, device, equipment and storage medium
CN111370025A (en) Audio recognition method and device and computer storage medium
CN111081277B (en) Audio evaluation method, device, equipment and storage medium
CN110798327B (en) Message processing method, device and storage medium
CN113220590A (en) Automatic testing method, device, equipment and medium for voice interaction application
CN111428079B (en) Text content processing method, device, computer equipment and storage medium
CN109829067B (en) Audio data processing method and device, electronic equipment and storage medium
CN110837557B (en) Abstract generation method, device, equipment and medium
CN113409770A (en) Pronunciation feature processing method, pronunciation feature processing device, pronunciation feature processing server and pronunciation feature processing medium
CN110337030B (en) Video playing method, device, terminal and computer readable storage medium
CN112786025B (en) Method for determining lyric timestamp information and training method of acoustic model
CN114360494A (en) Rhythm labeling method and device, computer equipment and storage medium
CN111091807B (en) Speech synthesis method, device, computer equipment and storage medium
CN113744736A (en) Command word recognition method and device, electronic equipment and storage medium
CN113362836A (en) Vocoder training method, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OL01 Intention to license declared