CN113835522A - Sign language video generation, translation and customer service method, device and readable medium - Google Patents

Sign language video generation, translation and customer service method, device and readable medium Download PDF

Info

Publication number
CN113835522A
CN113835522A CN202111060002.6A CN202111060002A CN113835522A CN 113835522 A CN113835522 A CN 113835522A CN 202111060002 A CN202111060002 A CN 202111060002A CN 113835522 A CN113835522 A CN 113835522A
Authority
CN
China
Prior art keywords
sign language
video data
language
action
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111060002.6A
Other languages
Chinese (zh)
Inventor
胡立
綦金玮
王琪
张邦
潘攀
徐盈辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Damo Institute Hangzhou Technology Co Ltd
Original Assignee
Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Damo Institute Hangzhou Technology Co Ltd filed Critical Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority to CN202111060002.6A priority Critical patent/CN113835522A/en
Publication of CN113835522A publication Critical patent/CN113835522A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B21/00Teaching, or communicating with, the blind, deaf or mute
    • G09B21/04Devices for conversing with the deaf-blind
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/51Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing
    • H04M3/5166Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing in combination with interactive voice response systems or voice portals, e.g. as front-ends

Abstract

The embodiment of the application provides sign language video generation, translation and customer service methods, equipment and a readable medium, wherein the method comprises the following steps: acquiring voice data acquired by an audio input assembly; analyzing the voice data, and determining sign language parameters according to an analysis result, wherein the sign language parameters comprise limb action parameters and face action parameters; driving limb actions of the virtual image according to the limb action parameters, driving face actions of the virtual image according to the face action parameters, and generating corresponding sign language video data; outputting sign language video data containing the avatar. The rendering of the virtual image executes the body action and the face action of the sign language, so that a real, continuous and natural sign language expression process can be better restored, the expression effect of the virtual image is improved, sign language video data of the virtual image is generated and output, and sign language video data can be conveniently obtained.

Description

Sign language video generation, translation and customer service method, device and readable medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a sign language video generating method, a sign language translating method, a sign language customer service method, a terminal device, and a machine-readable medium.
Background
The communication between hearing-impaired people and deaf-mute people is usually performed by sign language (sign language), which is a kind of hand language in which hearing-impaired or non-speech people interact and communicate with each other.
However, in daily life, there are very few people who can grasp sign language, so that it is difficult for hearing impaired people, deaf-mute people, and the like to communicate with other people, and the daily life is affected.
Disclosure of Invention
The embodiment of the application provides a sign language video generation method, which is used for conveniently generating video data containing virtual images.
Correspondingly, the embodiment of the application also provides a sign language translation method, a sign language customer service method, a sign language communication method, a sign language teaching method, electronic equipment and a machine readable medium, which are used for ensuring the realization and application of the method.
In order to solve the above problem, an embodiment of the present application discloses a sign language video generating method, including: acquiring voice data acquired by an audio input assembly; analyzing the voice data, and determining sign language parameters according to an analysis result, wherein the sign language parameters comprise limb action parameters and face action parameters; driving limb actions of the virtual image according to the limb action parameters, driving face actions of the virtual image according to the face action parameters, and generating corresponding sign language video data; outputting sign language video data containing the avatar.
Optionally, the analyzing the voice data and determining the sign language parameters according to the analysis result includes: performing voice recognition on the voice data, and determining corresponding text data; determining a sign language vocabulary sequence according to text data, and acquiring a limb action parameter corresponding to the sign language vocabulary sequence; and determining keywords and emotion information according to the text data, and determining that the keywords and the emotion information are matched with facial action parameters.
Optionally, the driving the body movement of the avatar according to the body movement parameters, and driving the face movement of the avatar according to the face movement parameters to generate corresponding sign language video data includes: determining a limb action sequence of the virtual image according to the limb action parameters, and determining transitional action between two limb actions; connecting the limb action sequence according to the transitional action, and driving the virtual image to execute the corresponding limb action; determining a lip language action based on the lip language action parameters and determining a facial expression based on the facial expression parameters; driving the avatar to perform a lip language action and a corresponding facial expression; and fusing the limb actions, lip language actions and facial expressions of the virtual image according to the time information to generate corresponding sign language video data.
Optionally, the method further includes: and analyzing the speech rate information according to the speech data, and adjusting the action speed of the virtual image according to the speech rate information.
Optionally, the voice data collected by the audio input component includes: calling an audio input component through the sign language translation page, and collecting voice data; the outputting sign language video data of the avatar includes: and playing sign language video data of the virtual image on the sign language translation page.
Optionally, the method further includes: displaying an indication element in the sign language translation page, wherein the indication element is used for indicating input and output states; the indication element comprises at least one of: text indication elements, dynamic indication elements, color indication elements.
Optionally, the method further includes: determining that service information corresponds to service sign language video data containing an avatar, wherein the content type of the service information comprises at least one of the following types: prompt information, scene commonly used phrases; and when the service condition is detected to be met, playing the service sign language video data in the sign language translation page.
Optionally, the method further includes: calling an image acquisition component on a sign language page, and acquiring video data of a user through the image acquisition component; the outputting sign language video data containing the avatar, comprising: and displaying the collected video data and the sign language video data of the virtual image through the sign language page.
Optionally, the method further includes: detecting the collected video data when sign language video data of the virtual image is played on a sign language page; and when the preset gesture is detected in the collected video data, pausing the playing of the sign language video data containing the virtual image.
Optionally, the method further includes: when the sign language video data of the virtual image is played to a target position, displaying a display element corresponding to the target position in a sign language page, wherein the target position is determined according to a keyword, and the display element comprises at least one of the following elements: background elements, image elements and emotion elements.
The embodiment of the application also discloses a sign language teaching method, which comprises the following steps: providing a sign language teaching page; acquiring first sign language video data through an image acquisition assembly, and displaying the first sign language video data in a sign language input area of the sign language teaching page, wherein the first sign language video data is video data of sign language users executing sign languages according to the target teaching information; uploading the first hand language video data; receiving second gesture language video data corresponding to the target teaching information, wherein the second gesture language video data are generated by executing gesture language actions according to an avatar, the gesture language actions of the avatar drive limb actions according to limb action parameters and drive face actions according to face action parameters, and the limb action parameters and the face action parameters are determined according to the target teaching information; and displaying the second sign language video data in a sign language output area of the sign language teaching page so that sign language users can learn sign languages.
Optionally, the method further includes: displaying an error prompt in the first finger language video data; and/or amplifying target sign language actions in the second sign language video data so as to prompt the sign language user.
The embodiment of the application also discloses a sign language translation method, which comprises the following steps: providing a sign language translation page; acquiring first hand language video data through an image acquisition assembly, and displaying the first hand language video data in a hand language input area of the hand language translation page; acquiring sign language translation information corresponding to the first sign language video data, and outputting the sign language translation information through the sign language translation page; voice data is collected through an audio input assembly; acquiring second hand language video data correspondingly synthesized by the acquired voice data, wherein the second hand language video data is generated by executing sign language actions for the virtual image, the sign language actions of the virtual image drive limb actions according to the limb action parameters and drive face actions according to the face action parameters, and the limb action parameters and the face action parameters are analyzed and determined according to the voice data; and displaying the second sign language video data in a sign language output area of the sign language translation page.
Optionally, the method further includes: displaying an indication element in the sign language translation page, wherein the indication element is used for indicating the input and output states of the sign language; the indication element comprises at least one of: text indication elements, dynamic indication elements, color indication elements.
The embodiment of the application discloses a sign language customer service method, which comprises the following steps: providing a sign language customer service page; acquiring first hand language video data through an image acquisition assembly, and displaying the first hand language video data in a hand language input area of the hand language customer service page; determining sign language translation information corresponding to the first sign language video data so as to output the sign language translation information in a customer service page; receiving second hand language video data synthesized according to service reply information of customer service, wherein the second hand language video data is generated by executing sign language actions for the virtual image, the sign language actions of the virtual image drive limb actions according to limb action parameters and drive face actions according to face action parameters, and the limb action parameters and the face action parameters are analyzed and determined according to the service reply information; and displaying the second sign language video data in a sign language output area of the sign language customer service page.
The embodiment of the application discloses a sign language communication method, which comprises the following steps: providing a video communication page; acquiring first video data through an image acquisition assembly, and displaying the first video data in a local end display area of a video call page, wherein the first video data comprises first finger language video data; displaying sign language translation information of the first sign language video data in a home terminal display area of the video call page; receiving second hand language video data synthesized according to communication information of an opposite terminal, wherein the second hand language video data is generated by executing sign language actions for the virtual image, the sign language actions of the virtual image drive limb actions according to the limb action parameters and drive face actions according to the face action parameters, and the limb action parameters and the face action parameters are analyzed and determined according to the communication information; and displaying the second phrase video data in an opposite-end display area of the video call page.
The embodiment of the application discloses electronic equipment, includes: a processor; and a memory having executable code stored thereon, which when executed, causes the processor to perform a method as in any one of the embodiments of the present application.
The embodiments of the present application disclose one or more machine-readable media having executable code stored thereon, which when executed, causes a processor to perform a method as any one of the embodiments of the present application.
Compared with the prior art, the embodiment of the application has the following advantages:
in this application embodiment, can gather speech data, then based on speech data analysis limbs action parameter and face action parameter, then can be based on limbs action parameter drive avatar's limbs action, and basis face action parameter drive avatar's face action renders avatar and carries out the limbs action and the face action of this sign language, can restore a true, continuous, natural sign language expression process better, promotes avatar's expression effect, generates avatar's sign language video data and output, convenient sign language video data that acquires.
Drawings
FIG. 1A is a schematic diagram of a sign language video generation scenario according to an embodiment of the present application;
FIG. 1B is a flow chart of steps of an embodiment of a sign language video generation method of the present application;
FIG. 2 is a diagram illustrating an example of an avatar action generation process according to an embodiment of the present application;
FIG. 3 is a flow chart of the steps of an embodiment of a bi-directional sign language translation method of the present application;
FIG. 4 is a diagram illustrating an example of a sign language translation page according to an embodiment of the present application;
FIG. 5A is a flow chart of steps of an embodiment of a sign language customer service method of the present application;
FIG. 5B is a diagram illustrating another sign language translation scenario according to an embodiment of the present application;
FIGS. 6A and 6B are schematic diagrams of examples of an indicating element according to embodiments of the present application;
FIG. 7 is an interaction diagram of an embodiment of a method of unobstructed communication of the present application;
FIG. 8 is a flow chart of steps of an embodiment of a sign language teaching method of the present application;
fig. 9 is a schematic structural diagram of an apparatus according to an embodiment of the present application.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.
The embodiment of the application can be applied to scenes that the virtual image generates sign language videos, for example, the virtual image executes sign language, the virtual image is used as a host, a customer service and other scenes needing sign language service, and in the sign language videos of the virtual image, the virtual image can be driven to execute corresponding various body actions and face actions based on requirements, the body actions include actions executed by limbs, bodies and the like, and the face actions include various sign language related actions such as expressions and lip actions. The virtual image can simulate a real person to execute sign language actions, and can drive limb actions and face actions of the virtual image based on multi-modal characteristics such as semantic characteristics, emotion characteristics, speech speed characteristics and the like, so that the virtual image can execute diversified actions. In the scene of sign language translation, live broadcast, customer service and the like, the corresponding action can be executed by driving the virtual image based on interactive messages, reply messages and the like in real time, and video data is generated, so that interaction is realized.
The method is applied to various scenes needing sign language translation. For example, in a scene where target users such as hearing-impaired people and deaf-dumb people perform face-to-face communication such as shopping, medical treatment and legal services, the embodiment of the application can provide sign language translation services, can provide a sign language translation page, can collect voice data to be translated, and then translates and outputs sign language video data of a virtual image executing sign language actions. According to the embodiment of the application, a third-party user is not needed to be used as translation, data such as voice and text of natural language are automatically recognized, semantics are analyzed, sign language parameters including body action parameters and face action parameters are obtained, then the virtual image is driven to execute sign language action and generate corresponding sign language video data, and therefore target users such as hearing-impaired people and deaf-mute people can see corresponding sign language and understand words of non-sign language users. The user can use various electronic devices such as a mobile phone, a tablet, a computer and the like to execute the translation method of the embodiment of the application. The natural language is a language that can be understood as naturally evolving with culture, that is, a language output by pronunciation. Such as Chinese, English, French, Japanese, etc., or a dialect of a language, such as Guangdong, Minnan, Shanghai, etc.
The electronic equipment of the embodiment of the application can be provided with an image acquisition component, a display component, an audio input/output component and the like, such as a camera, a display, a microphone, a sound and other components. Therefore, the collection and playing of image, video and audio data can be carried out. In the embodiment of the application, voice data can be collected through an audio input assembly such as a microphone, then semantics and speech speed are analyzed, multi-modal information such as emotion is combined to determine action parameters, and the virtual image is driven to execute limb actions and facial expressions and generate corresponding video data. The avatar is a user that simulates a human body based on parameters such as the form and function of the human body through information technology, for example, a character model is established based on 3D technology in combination with parameters such as the form of the human body, and the avatar obtained through simulation technology may also be referred to as a digital person, a virtual character, and the like. The virtual image can drive to execute actions based on various parameters of human body forms, limbs, postures and the like, so that the actions are simulated, and the corresponding video data are generated to interact through the virtual image execution actions.
In the embodiment of the application, a sign language action database of the virtual image is preset, the action database comprises sign language parameters and attribute information of the virtual image, the sign language parameters can comprise body action parameters and face action parameters, the face expression parameters comprise lip language action parameters and face expression parameters, and the sign language parameters can be corresponding parameters such as body bones, body muscles, face bones, face muscles and expressions, so that rich action information can be provided for the virtual image. The driving parameters such as the bones, the muscles and the like can be set based on the information such as the bones, the muscles and the like of the human body, so that the action of the virtual image is more consistent with the action of a real user, and the complexity and the richness of the action are improved. For example, the facial skeleton driving parameters and facial expression parameters are determined expression parameters simulating facial bones, muscles and the like, and for example, the limb skeleton and limb muscle driving parameters can simulate the skeleton and muscle movement of limbs, so that sign language actions are obtained. The attribute information is attribute information of the action, such as a label of the action, so that the action can be searched based on the label aiming at the corresponding vocabulary, and the quick matching of the action is realized. The attribute information may also include other detail parameters, such as emotional tags, e.g., a more intense emotion may correspond to an increased magnitude of motion, etc., and the motion parameters may be adjusted based on the detail parameters. For example, in a sign language translation scenario, the attribute information of the motion parameter may correspond to a sign language motion tag, and a sign language motion may correspond to a series of motion parameters, so that a sign language vocabulary may drive a series of sign language motions to implement. The motion execution is a process of motion change, and therefore the attribute information of the motion parameter may include a detail parameter of the motion change, such as a motion range of a skeletal drive, a time range, and the like. For example, if the call-up action is a hand-lifting and swinging action, the corresponding driving parameters of the bones such as the arm and the hand, and the corresponding information such as the motion range and the time can be corresponded. Also, one action may be performed after another action, and thus continuous execution of actions may also be achieved based on the position adjustment input of the previous action.
Referring to fig. 1A, a scene for driving an avatar to perform sign language action and generate video data is shown, which may perform the steps of fig. 1B as follows.
Step 102, acquiring voice data collected by the audio input assembly.
Voice data is collected on the terminal device through the audio input unit. Accordingly, the input mode can be determined based on the requirement, such as input through voice in a face-to-face communication scene of sign language users and non-sign language users. As another example, during a video interaction between a sign language user and a non-sign language user, voice data may be received or extracted from the video.
And 104, analyzing the voice data, and determining sign language parameters according to the analysis result, wherein the sign language parameters comprise limb action parameters and face action parameters.
As shown in fig. 2, corresponding text data can be acquired by voice recognition for voice data. And then Natural Language Processing (NLP) is carried out on the text data to obtain a corresponding sign language vocabulary sequence. And then, determining sign language parameters based on the sign language vocabulary sequence, for example, obtaining sign language parameters corresponding to the vocabulary from a sign language action database, such as limb action parameters. When sign language is expressed, certain keywords can be assisted by lip language, so that for text data, the keywords can be extracted and emotion information can be analyzed, the keywords are related to sign language vocabularies, and then facial action parameters are determined based on the keywords and the emotion information, the facial action parameters comprise lip language action parameters and facial expression parameters, and information such as virtual lip language and facial expression can be correspondingly determined. And analyzing whether each sign language vocabulary is a keyword or not through a keyword recognition algorithm. The keyword recognition algorithm can train a recognition model through the labeled text data by adopting modes such as neural network, machine learning and the like. Therefore, text data is input, keywords corresponding to the lip language are output, and on the keywords, the time point needing to generate the lip language is found through the timestamp. The emotion type in the text can be analyzed through the existing emotion recognition algorithm to obtain emotion information. Therefore, the analyzing the voice data and determining sign language parameters according to the analysis result comprises: performing text analysis on the information to be processed to determine corresponding text data; determining a sign language vocabulary sequence according to text data, and acquiring a limb action parameter corresponding to the sign language vocabulary sequence; and determining keywords and emotion information according to the text data, and determining that the keywords and the emotion information are matched with facial action parameters.
In the embodiment of the application, the speech speed analysis can be further performed on the speech data to determine the speech speed parameter, and the limb movement of the virtual image is adjusted based on the speech speed parameter. The speech analysis model can be used for analyzing the morphemes based on time and the like among the morphemes in the speech data, and can also be provided with a speech speed analysis model, such as a speech analysis model trained based on models of a neural network, machine learning and the like, and the speech analysis model can be used for training the expression speed of words and phrases in a large number of real person sign language videos. After the speech rate parameter is obtained, the expression speed of the limb movement can be adjusted based on the speech rate parameter.
And 106, driving the limb action of the virtual image according to the limb action parameters, driving the face action of the virtual image according to the face action parameters, and generating corresponding sign language video data.
Wherein, the limb action of the virtual image is driven according to the limb action parameters, comprising: determining a limb action sequence of the virtual image according to the limb action parameters, and determining transitional action between two limb actions; and connecting the limb action sequence according to the transitional action, and driving the virtual image to execute the corresponding limb action. As shown in fig. 2, for a limb action, a sequence of limb action parameters may be determined based on the vocabulary sequence, thereby obtaining a limb action sequence. And transitional actions between any two limb actions can be determined, so that the connection of the limb actions is smoother and more natural. And generating transition actions among the vocabularies by an action connection algorithm aiming at the body actions corresponding to the plurality of sign language vocabularies obtained by query. One way may be to build a neural network model, inputting the body movements a and b of the two sign language words to be connected, and outputting as an intermediate transitional movement, so that the movement from movement a to b continues to follow naturally. The model is obtained by training with continuous sign language motion data. And then connecting the limb action sequence according to the transition action, and driving the virtual image to execute the corresponding limb action.
The driving the face motion of the avatar according to the face motion parameters includes: a lip action may be determined based on the lip action parameters and a facial expression based on the facial expression parameters, and then the avatar may be driven to perform the lip action and the corresponding facial expression. In the embodiment of the application, each limb action, lip language action, facial expression and the like can determine the timestamp based on the corresponding vocabulary, keywords, emotion information and the like, so that the virtual image is driven to execute the corresponding limb action, lip language action, facial expression and the like at the corresponding time point by combining the timestamp, and the actions are fused to form the overall action of the virtual image and generate video data. Therefore, the limb action sequence of the virtual image can be determined according to the limb action parameters, and the transition action between two limb actions is determined, wherein each limb action determines a time stamp according to the corresponding vocabulary; connecting the limb action sequence according to the transitional action, driving the virtual image to execute the corresponding limb action according to the time stamp, and determining the time stamp information of each limb action after obtaining the continuous limb action parameters. E.g., which time point vocabulary 1 starts, which time point vocabulary 1 ends, which time point vocabulary 2 starts, and which time point vocabulary 2 ends … …. Wherein the linking of actions can be implemented based on a corresponding action linking algorithm. The lip language action and the time stamp can be determined based on the lip language action parameters, the facial expression and the time stamp can be determined based on the facial expression parameters, then the virtual image is driven to execute the lip language action and the corresponding facial expression according to the time stamp, the virtual image can be determined to execute the lip language based on a lip language driving algorithm, and the virtual image is driven to generate the facial expression based on an expression generating algorithm. And the limb action, the lip language action and the facial expression for virtual use are fused according to the time stamp, so that the overall action of the virtual image is realized. In the embodiment of the application, corresponding algorithms can be set for realizing the limb actions, the lip language actions and the facial expressions, corresponding action driving models can be trained based on the neural network, the machine learning and other modes, and the actions can be executed based on the action driving models.
And step 108, outputting sign language video data containing the virtual image.
Accordingly, sign language video data of the avatar may be determined according to the voice data, and then the sign language video data of the avatar may be output. A sign language page can be provided, and voice data can be acquired through the sign language page; and playing sign language video data of the virtual image on the sign language page. The sign language page is a sign language video display page, wherein the name of the sign language page can be adjusted based on an application scene, if the sign language page is the sign language translation page in a sign language translation scene, if the sign language page is the sign language customer service page in a customer service scene, and if the sign language page is the sign language teaching page in a sign language teaching scene, the name can be specifically adjusted according to requirements.
Therefore, voice data can be collected, the body action parameters and the face action parameters are analyzed based on the voice data, then the body action of the virtual image can be driven according to the body action parameters, the face action of the virtual image is driven according to the face action parameters, the virtual image is rendered to execute the body action and the face action of the sign language, a real, continuous and natural sign language expression process can be better restored, the expression effect of the virtual image is improved, sign language video data of the virtual image is generated and output, and sign language video data can be conveniently obtained.
On the basis of the embodiment, the image acquisition equipment such as a camera and the like can acquire sign language video data, and then sign language identification is carried out on the sign language video data through the sign language identification model, so that automatic translation aiming at the sign language is realized, and corresponding sign language translation information is obtained. And then, the terminal equipment is adopted to output sign language translation information, so that a non-sign language user can understand the meaning expressed by the sign language user conveniently. For example, a sign language user uses a mobile phone to perform sign language translation, an image acquisition device such as a camera of the mobile phone acquires sign language video data, can also acquire synthesized sign language video data, and can display the sign language video data on the mobile phone, so that the sign language user can conveniently check own sign language state, and the sign language user can conveniently acquire translated sign language video data to understand words of other users. According to the embodiment of the application, a third-party user is not needed to be used as translation, sign language of target users such as hearing-impaired people and deaf-dumb people is automatically recognized, and translated voice, text and other data are output; correspondingly, data such as languages, texts and the like can be received, translated into sign language, the virtual image is determined to execute the sign language, and the sign language video of the virtual image is played to a target user, so that interaction between the target user of the sign language and a non-sign language user can be realized.
In the embodiment of the application, the sign language page can also be a page in various forms, for example, the sign language page can display sign language video data of a generated virtual image and can also display collected video data of a user. Therefore, in a bidirectional translation scene, sign language video data of a user can be collected and then translated into natural language to be output, and sign language video data of a virtual image generated after the natural language is translated into sign language can also be collected and then output on a sign language page. Therefore, in the embodiment of the application, the image acquisition component can be called on the sign language page, and the video data of the user is acquired through the image acquisition component; the outputting sign language video data containing the avatar, comprising: and displaying the collected video data and the sign language video data of the virtual image through the sign language page. When the sign language page is started to be displayed, an image acquisition component of the terminal, such as a camera, can be called to acquire videos, and acquired video data are displayed on the sign language page. The captured video data may be video data containing a user, such as video data of a user performing sign language with sign language, and the like. The collected video data and the sign language video data of the virtual image can be correspondingly displayed on the sign language page, and other required information can be displayed, such as text data of natural language corresponding to the sign language video data, various prompt information, indication information and the like, and can be determined according to requirements.
In the embodiment of the application, when the sign language video data of the avatar is played, some gestures can be preset to pause the playing of the sign language video data containing the avatar, so that a user can understand and timely know the requirements of the user. The preset gesture may be a default pause gesture, or a user-defined pause gesture, or a gesture expressing the intentions of stopping, pausing and the like in sign language, and specifically may be set according to the requirement, for example, the preset gesture is a stop gesture in which one hand is horizontal and the other hand is vertically located below the horizontal hand, or the preset gesture is a gesture in which the palm changes from opening to fist making and the like. Detecting the collected video data when the sign language video data of the virtual image is played on the sign language page; and when the preset gesture is detected in the collected video data, pausing the playing of the sign language video data containing the virtual image. And then, sign language detection can be carried out on the collected video data, and sign language translation processing is executed after the sign language is detected. After the sign language of the user is judged and collected, the hand language action can be translated, and the corresponding text data and/or voice data of the natural language can be obtained and used as translation information to be output on a sign language page, and the sign language video data to be replied subsequently can be determined based on the sign language.
In the embodiment of the present application, the keywords may also be configured based on the keywords corresponding to displayable elements in the page, for example, the keywords related to the background, such as geographic keywords like sea, grassland, scenic spots, etc., the keywords like sunset, stars, and weather, and the keywords related to the background may correspond to corresponding background elements. But also keywords related to the character, such as an avatar, apparel for the avatar, etc., which correspond to the character elements. But also keywords related to emotion, such as happy, angry, and heart injury keywords, which correspond to emotional elements. In generating the sign language video data, a target position may be set based on the keyword, such as adding an anchor point, a logo, and the like at a time point corresponding to the avatar expression keyword to determine the target position, so as to display a display element corresponding to the target position in a sign language page when the sign language video data of the avatar is played to the target position, wherein the target position is determined according to the keyword, and the display element includes at least one of: background elements, image elements and emotion elements. For example, the avatar expresses happy and happy avatars in sign language, and the effect of smiling face, flower spreading and the like can be achieved in the sign language page. For another example, when speaking the current sea-related content, the background of the avatar in the sign language video data may be replaced by sea, or displayed on the screen using sea-related elements, so as to make the viewing user understand the current content better. In the embodiment of the application, the display elements can be preset in the sign language video data and can also be set locally in the terminal equipment, so that when the sign language video data played locally reaches the target position, the display elements are called to be displayed, and if the display elements are superposed to the sign language video data to be displayed, the display elements can be determined according to requirements.
On the basis of the above embodiments, the embodiments of the present application also provide an example of bidirectional translation.
Referring to FIG. 3, a flowchart illustrating the steps of an embodiment of a bidirectional sign language translation method of the present application is shown.
Step 300, providing a sign language translation page, wherein the sign language translation page comprises: a sign language input area (or first area) and a sign language output area (or second area).
The service end can provide a sign language translation page, and the sign language translation page is used for executing sign language translation. Thus, in some embodiments, sign language video data may be displayed in the sign language translation page. For example, when the camera collects sign language video data, the collected sign language video data is displayed in a sign language translation page. In the embodiment of the application, prompt information can be displayed in the sign language translation page, for example, the prompt information aiming at the shooting position is used for reminding a sign language user, the shooting of sign language videos is carried out in a specified area, and the problem that the translation is inaccurate due to incomplete shooting is avoided. The prompt information aiming at the shooting position comprises at least one of the following text prompt information, line prompt information and the like.
For more accurately identifying the sign language of the sign language user, a sign language identification area can be arranged on the sign language translation page, and the sign language identification area can enable the sign language of the sign language user to be located in an acquisition area of the image acquisition assembly, so that the identification failure rate is reduced. And correspondingly, prompting information of the sign language recognition area can be set so as to prompt the input position area. The prompt message of the sign language identification area can be a message in various forms, such as a text prompt message, which prompts the sign language user to make a posture, locate in the middle of the acquisition area, and the like. The sign language can also be line prompt information, for example, the line prompt information is presented as a humanized area to prompt the area where the body of the user is located, so that the acquisition of the sign language is ensured, or various information are combined, and the user body is prompted to be located in a broken line frame through a text.
In step 310, a first finger language video data is captured by the image capture component. The first sign language video data of the sign language user can be collected through image collection components such as a local camera, for example, the sign language video data of the sign language user can be collected through a front camera of a mobile phone. The sign language video data includes at least a face image and a sign language image. Wherein the facial image and the sign language image are used for sign language recognition. The sign language video data can be identified by semantically translating sentences as a reference and identifying sign languages sentence by sentence.
Step 312, displaying the collected first sign language video data in the sign language input area.
And step 314, acquiring sign language translation information corresponding to the first hand language video data. The sign language video data can be subjected to sign language recognition according to a sign language recognition model, and corresponding sign language translation information is determined, wherein the sign language translation information is determined according to a sign language recognition result of an image frame set corresponding to a sentence break node, and the sentence break node is obtained by performing sentence break detection on the sign language video data. The sign language translation information includes sign language recognition text and/or sign language translation speech.
The embodiment of the application can detect and translate the collected first hand language video data in real time. Feature extraction and sentence break detection can be synchronously performed, wherein the feature extraction can extract sign language features, such as structural features of sign language, from each image frame of sign language video data. And may place the extracted sign language features into a buffer queue. The sentence-break detection module can detect each frame image in the hand language video data, sequentially judge whether sign language action exists in each frame image, and determine that sentence-break nodes exist if the image frames without sign language action meet sentence-break conditions. After the sentence break node is detected, the sign language feature set in the buffer queue can be input into the time sequence model, and the buffer queue is emptied. Feature extraction and sentence break detection then continues until the acquisition of sign language video data is finished, which may mean that no sign language action is detected continuously. For a sign language feature set input to a time sequence model from a buffer queue, corresponding sign language vocabularies can be detected based on the time sequence model, and the time sequence of the sign language vocabularies can be determined, so that sign language vocabulary sequences are output to input the sign language vocabulary sequences into a conversion model, wherein the conversion model can be used for converting the sign language vocabulary sequences into natural language texts. In the embodiment of the application, an error correction model can be further included, and the error correction model can detect the hand recognition text and judge whether the hand recognition text is a correct natural language sentence. If not, error correction is carried out, the sign language recognition text is adjusted to be a sentence in a natural language, the sign language recognition text can be input into a TTS model, and the sign language recognition text is converted into voice translation information. And obtaining sign language translation information corresponding to the sign language video data. The feature extraction and recognition processes aiming at the sign language video data can be finished at a terminal device or a server side, the feature extraction can also be carried out at the terminal device side, then the sign language recognition is carried out at the server side, and finally the translation result is fed back to the terminal device side, and the determination can be specifically carried out according to the requirements.
In the embodiment of the application, sign language features can be extracted through various feature extraction models, and the feature extraction models can be trained models of various machine learning, neural networks and the like. In some other examples, the feature extraction model may also be a sign language visual structured model used to extract sign language structured features from sign language video data. The sign language visual structured model can be used for feature extraction and model training based on visual structured information. The structural information may be information describing or expressing a transaction or an object, for example, the visual structural information may be information describing visual structural features, such as the shape, contour, color, texture, and the like of the object, and the specific structural features may be determined according to the application scenario. In the embodiment Of the application, visual structured elements can be extracted based on sign language video data, and the visual structured elements refer to fine-grained structured visual cue information related to sign language, such as Region Of Interest (ROI), human body posture key point (pos), fuzzy classification information Of hand regions, and the like. Then, a multi-task convolutional neural network can be adopted to simultaneously perform tasks such as object detection, attitude estimation, fuzzy detection and the like.
In one example, structured elements such as nodes, connections, components, etc. of sign language can be structurally modeled and identified based on a spatial structured model. The spatial information required by the spatial structural model comprises spatial structural elements such as nodes, connections and components of the space, and the three spatial structural elements can be analyzed through the spatial structural model. The nodes (nodes) comprise motion nodes and position nodes, and the position nodes are used for describing image coordinates Node (x, y) of the nodes in the 2D space. The motion node is used for expressing the image coordinates of the node in a 2D space and the offset from a reference node, wherein the reference node refers to a reference node corresponding to the motion node, for example, a node of the corresponding motion node at a static position is a reference node, such as a reference node of an elbow, a reference node of a wrist and the like. The connection (Joint) describes the 2D space vector relationship between the moving nodes, such as the angle and distance between the moving nodes. The component (Part) comprises sign language related components, such as three components of a head (R0), a left hand (R1) and a right hand (R2). The parts contain rich information, for example, the head contains various facial organs and expression expressions, and the left hand and the right hand can express different gestures, orientations and other information. For the space structure model, the image can be quantized in a 2D space, the positions of nodes in the 2D space are defined, and the like. And learning the relation of each node in the space by combining the information such as the weight of each node in all the nodes, and the like, such as describing the spatial structural characteristics through the nodes, the connection among the nodes and the components. Therefore, the dominant characteristics in the sign language video data can be obtained based on the structural model, and the sign language can be described more accurately. The sign language visual structural model can learn the vector relation and the spatial feature expression among key points, connections and components in the 2D image space based on the spatial structural model. In the embodiment of the application, the time sequence order of the sign language can be determined by the time sequence model for the data set of the sign language features, for example, the time sequence spatial feature modeling is performed based on the time sequence structured model to obtain stable sign language time sequence features.
After a data set of sign language features is obtained, sign language recognition can be carried out on the basis of the sign language feature set to obtain a corresponding sign language vocabulary sequence; analyzing the sign language vocabulary sequence according to natural language rules, and determining a sign language identification text corresponding to the natural language; and generating sign language translation information according to the sign language identification text. The method comprises the steps of carrying out sign language recognition on a hand language feature set to obtain corresponding sign language vocabularies, determining the time sequence of the sign language vocabularies based on the time sequence relation of the features to obtain a sign language vocabulary sequence, and then analyzing the sign language vocabulary sequence according to natural language rules, wherein the corresponding natural language rules can be determined based on different languages, so that the semantics of the sign language vocabularies are organized to obtain sign language recognition texts corresponding to the corresponding natural languages. In the embodiment of the application, the feature extraction and recognition processes for the sign language video data can be completed at the terminal equipment or the server side, the feature extraction can also be performed at the terminal equipment side, then the sign language recognition is performed at the server side, and finally the translation result is fed back to the terminal equipment side, which can be specifically determined according to the requirements.
And step 316, outputting sign language translation information through the sign language translation page. Therefore, the acquisition, the recognition and the translation of sign language data can be realized, the meaning of the sign language is output, and the sign language user can know the meaning of the sign language user. The sign language recognition text can be displayed in the sign language translation page on the terminal equipment side, and sign language translation voice can also be played through the terminal equipment and can be specifically determined according to requirements.
Step 320, voice data is collected via the audio input component. The non-sign language user can input through voice, for example, if the voice says 'do you take the medical insurance card' in the medical scene, the corresponding equipment can collect voice data through an audio input component such as a microphone. Then, second phrase video data synthesized corresponding to the collected voice data can be obtained, where the second phrase video data is video data in which the avatar executes the phrase according to the semantic of the voice data, and the method specifically includes steps 422 and 424. In other examples, the input may be performed through text, and this example takes voice input as an example, and if the input is text input, step 424 may be performed.
Step 322, analyzing the voice data, and determining sign language parameters according to the analysis result.
And 324, driving the limb action of the virtual image according to the limb action parameters, and driving the face action of the virtual image according to the face action parameters to generate second phrase video data containing the virtual image.
Performing voice recognition on the voice data, and determining corresponding text data; determining a sign language vocabulary sequence according to text data, and acquiring a limb action parameter corresponding to the sign language vocabulary sequence; and determining keywords and emotion information according to the text data, and determining that the keywords and the emotion information are matched with facial action parameters. Determining a limb action sequence of the virtual image according to the limb action parameters, and determining transitional action between two limb actions; connecting the limb action sequence according to the transitional action, and driving the virtual image to execute the corresponding limb action; determining a lip language action based on the lip language action parameters and determining a facial expression based on the facial expression parameters; driving the avatar to perform a lip language action and a corresponding facial expression; and fusing the limb actions, lip language actions and facial expressions of the virtual image according to the time information to generate corresponding sign language video data. And analyzing the speech rate information according to the voice data, and adjusting the action speed of the virtual image according to the speech rate information. The details are similar to the corresponding processes in the above embodiments, and therefore, the description is omitted.
Step 326, displaying the second sign language video data in the sign language output area of the sign language translation page. Therefore, the sign language user can watch the second sign language video data of the virtual image displayed in the sign language output area, so that the words of other users, such as words of non-sign language users, can be known, and communication can be realized. In addition, the text data corresponding to the input can also be displayed in the sign language translation page, such as in the sign language output area of the sign language translation page.
In the embodiment of the present application, the sign language user may also be referred to as a first user, and the non-sign language user may also be referred to as a second user.
Users using sign language are generally classified into several cases, such as hearing-impaired users with hearing impairment, speech-impaired users without speech, or users in which both cases exist, which may be called deaf-dumb users. In the embodiments of the present application, a user who executes sign language is referred to as a sign language user. The sign language translation page can be set based on specific conditions, for example, for a barrier user, only sign language translation can be provided, for hearing barrier users, deaf-mute users and the like, sign language translation aiming at natural language can be provided in the sign language translation page, namely, the natural language is translated into the sign language, an avatar is driven to carry out the sign language, and video data are synthesized, so that bidirectional communication between the sign language user and other users is facilitated. The setting can be specifically set according to the requirements of users, and the embodiment of the application is not limited to this. Wherein for a scenario of bi-directional translation, the sign language translation page includes a sign language input area and a sign language output area, such as the example of a sign language translation page shown in fig. 4. The sign language input area is used for displaying the collected sign language video data, the user executing the sign language in the collected sign language video data is a real user, and the sign language output area is used for displaying the synthesized sign language video data. Playing the sign language video data in a sign language input area of the sign language translation page; and playing synthesized sign language video data in a sign language output area of the sign language translation page, wherein the synthesized sign language video data is video data for executing sign language by adopting an avatar, and the sign language executed by the avatar is determined according to input information. For non-sign language users, information can be input in a voice or text mode, semantic analysis can be carried out on corresponding input information, the information is translated into sign language based on the semantic, the virtual image is driven to execute the sign language, the hand action and/or the facial expression of the virtual image are/is driven, corresponding sign language video data are synthesized, and then the sign language video data can be displayed in a sign language output area, namely, the sign language users can watch the sign language and understand the meaning expressed by other users in the mode of driving the virtual image to execute the sign language. Thus, by the above example of translating sign language into natural voice and translating natural voice into sign language, the sign language translation page according to the embodiment of the present application can provide automatic translation of sign language, and for a sign language video, a translated natural language can be output by voice, text, and the like, and for the sign language translated by the natural language, an avatar can be driven to execute and display as a corresponding sign language video, so that a sign language user can view the sign language video data. The synthesized sign language video data in the embodiment of the present application is sign language video data synthesized by an avatar (also referred to as a digital person). The avatar is a user obtained by simulating a human body based on parameters such as the form and function of the human body through an information technology, for example, a character is modeled based on a 3D technology in combination with parameters such as the form of the human body, and the avatar obtained through the simulation technology may also be called a digital person. The virtual image can drive to execute actions based on various parameters of human body shapes, limbs, postures and the like, so that sign language actions are simulated, sign languages are executed through the virtual image, and corresponding video data are generated to perform sign language interaction.
In the embodiment of the application, barrier-free interaction aiming at sign language can be applied to various scenes. For example, in a scene of face-to-face communication with sign language users, such as various communication processes of sign language users for registration, payment, medicine taking, inquiry and the like in a medical scene; the method is also applied to face-to-face shopping exchange in shopping scenes such as shopping malls, supermarkets and markets of sign language users; as well as for providing legal service scenarios for sign language users. The barrier-free communication can also be applied to the communication process of sign language users and remote users, and the sign language users can conveniently communicate remotely. For example, in a shopping scenario, a merchant may provide sign language translation services through a device, and when a sign language user enters a shopping environment, such as a store or other merchant, a translation control in a lead page may be triggered to enter a sign language translation page. In another example, in a medical registration scenario, a hospital may provide the device in a registration window, and a sign language user may trigger a translation instruction into a sign language translation page by himself.
In some scenarios, the sign language used by the sign language users may also be different, for example, there is a difference between sign languages in different countries, or there is a certain difference between a natural sign language and a standard sign language, and so on, so that the barrier-free interaction of the embodiment of the present application may also provide a sign language translation service between sign language users using different sign languages, thereby facilitating communication between sign language users. For the translation of different sign language users, sign language video data can be respectively collected by the front camera and the rear camera of one device, and the sign language video data can be transmitted and processed based on a server after being respectively collected by different devices, so that interaction is realized.
On the basis of the above embodiments, the embodiments of the present application further provide a customer service scenario in which a sign language user interacts with a non-sign language user, as shown in fig. 5A and 5B.
Step 502, providing a sign language customer service page.
The service page may provide a sign language translation entry to the user so that the sign language service page may be entered based on the sign language translation entry.
And step 504, acquiring first hand language video data through an image acquisition component, and displaying the first hand language video data in a hand language input area of the hand language customer service page.
The method comprises the steps of collecting sign language video data at a sign language user (first equipment) side, collecting text data at a non-sign language user (second equipment) side, synthesizing synthesized sign language video data based on the text data, and sending the synthesized sign language video data to the first equipment, so that an avatar can watch the synthesized sign language video data. Correspondingly, sign language identification text obtained by translating the collected sign language video data of the sign language user is fed back to the second equipment of the customer service. Wherein the device provides a sign language translation page, the sign language translation page comprising: a sign language input area and a sign language output area. Taking the first device as the device of the sign language user and the second device as the device of the non-sign language user as an example, the translation page is a customer service page, such as a customer service page of a shopping application program, a service page of a medical consultation page, and the like. The first device collects first finger language video data through the image collection assembly. The first device displays the captured first finger language video data in a finger language input area. And the first equipment uploads the acquired first hand language video data to a server.
Step 506, determining sign language translation information corresponding to the first sign language video data so as to output the sign language translation information in a customer service page.
The embodiment of the application can detect and translate the collected first hand language video data in real time. The sign language translation information can be determined according to sign language recognition results of the image frame set corresponding to the sentence break nodes, and the sentence break nodes are obtained by carrying out sentence break detection on the hand language video data. Feature extraction and sentence break detection can be performed synchronously, wherein the feature extraction can extract sign language features, such as structural features of sign language, from each image frame of sign language video data. And may place the extracted sign language features into a buffer queue. The sentence-break detection module can detect each frame image in the hand language video data, sequentially judge whether sign language action exists in each frame image, and determine that sentence-break nodes exist if the image frames without sign language action meet sentence-break conditions. After the sentence break node is detected, the sign language feature set in the buffer queue can be input into the time sequence model, and the buffer queue is emptied. Feature extraction and sentence break detection then continues until the acquisition of sign language video data is finished, which may mean that no sign language action is detected continuously. For a sign language feature set input to a time sequence model from a buffer queue, corresponding sign language vocabularies can be detected based on the time sequence model, and the time sequence of the sign language vocabularies can be determined, so that sign language vocabulary sequences are output to input the sign language vocabulary sequences into a conversion model, wherein the conversion model can be used for converting the sign language vocabulary sequences into natural language texts. In the embodiment of the application, an error correction model can be further included, and the error correction model can detect the hand recognition text and judge whether the hand recognition text is a correct natural language sentence. If not, error correction is carried out, the sign language recognition text is adjusted to be a sentence in a natural language, the sign language recognition text can be input into a TTS model, and the sign language recognition text is converted into voice translation information. And obtaining sign language translation information corresponding to the sign language video data. The feature extraction and recognition processes aiming at the sign language video data can be finished at a terminal device or a server side, the feature extraction can also be carried out at the terminal device side, then the sign language recognition is carried out at the server side, and finally the translation result is fed back to the terminal device side, and the determination can be specifically carried out according to the requirements.
Step 508, receiving second language video data synthesized according to the service reply information of the customer service, wherein the second language video data is generated by driving the virtual image to execute the action according to the action parameter, and the action parameter is analyzed and determined according to the service reply information.
The service reply message can be voice data or text data, and voice recognition can be carried out on the voice data to determine the corresponding text data. Aiming at text data, a sign language vocabulary sequence can be determined, and a limb action parameter corresponding to the sign language vocabulary sequence is obtained; and determining keywords and emotion information according to the text data, and determining that the keywords and the emotion information are matched with facial action parameters. Determining a limb action sequence of the virtual image according to the limb action parameters, and determining transitional action between two limb actions; connecting the limb action sequence according to the transitional action, and driving the virtual image to execute the corresponding limb action; determining a lip language action based on the lip language action parameters and determining a facial expression based on the facial expression parameters; driving the avatar to perform a lip language action and a corresponding facial expression; and fusing the limb actions, lip language actions and facial expressions of the virtual image according to the time information to generate corresponding sign language video data.
And 510, displaying the second sign language video data in a sign language output area of the sign language customer service page.
The second device receives service reply information, such as text data of the service reply, and uploads the text data to the server. And the server performs semantic recognition according to the text data and synthesizes second language video data. And determining sign language parameters according to the text data, and generating second sign language video data containing the virtual image according to the sign language parameters. And the server side sends the second sign language video data to the first equipment, so that the sign language user can watch the corresponding sign language service and can provide the required service for the sign language user.
In an embodiment of the application, the sign language translation page may provide a language selection control, and the language selection control is used for selecting a target language. The target language may include various sign languages and various natural languages. Different sign languages of different countries have certain differences, so that sign language selection controls can be provided, and the sign language selection controls are used for selecting different kinds of sign languages, such as Chinese sign language, English sign language and the like, wherein the different kinds of sign languages can be understood as sign languages of different countries, and can also comprise standard sign languages and natural sign languages, and the natural sign languages refer to sign languages formed naturally. The language selection control may also include a natural language selection control for selecting a translated natural language, such as chinese, english, french, and dialects, for example, thereby facilitating use by various users. Displaying language selectable items in response to triggering of a language selection control in the sign language translation page; in response to a trigger for a language selectable item, a selected target language is determined.
In the embodiment of the application, the required input and output modes can be adjusted based on requirements, for example, an input adjustment control and an output adjustment control are arranged on a page, and different input and output modes can be switched based on the adjustment of the corresponding control. In addition, the switching of input and output modes can be triggered through gestures. Wherein, can be according to first gesture operation, adjust the input mode, the input mode includes: a voice input mode, a text input mode and/or a video input mode; adjusting an output mode according to the second gesture operation, wherein the output mode comprises the following steps: a voice output mode, a text output mode and/or a video output mode. The gesture of this embodiment may be a default gesture, may also be a custom gesture, and may further use a sign language indicating switching as a first gesture operation and a second gesture operation of switching, so that after the gesture operation is detected, an input and output mode may be adjusted based on the gesture operation, for example, switching from sign language input to voice input, or adjusting from voice output to text data, and the like, may be determined based on a requirement. Responding to an output adjusting instruction, and adjusting an output mode of the sign language translation information, wherein the output mode comprises the following steps: a voice output mode, a text output mode and/or a video output mode. The output adjusting instruction can be generated based on the second gesture operation, and can also be generated based on the trigger of the output mode adjusting control provided by the page.
The embodiment of the application can be applied to various service scenes, so that the sign language translation page can also provide various service information, and the information types of the service information comprise: at least one of service text data, service voice data, and service sign language video data; the content type of the service information includes at least one of: prompt information and scene commonly used phrases. That is, the service information may be output in the form of sign language, voice, text, etc., and the content corresponding to the service information may be various kinds of prompt information, commonly used words of scenes, etc.
The service information includes prompt information, and the prompt information may be prompt information of various events, such as waiting prompt information, failure prompt information, operation prompt information, and the like. For example, the waiting prompt message may prompt the sign language user in the form of sign language video, text, etc. to wait for the translation or input data, or prompt the other user in the form of voice, text, etc. to wait for the translation or input data. For the fault prompt information, the corresponding user can be prompted through various forms such as voice, text, sign language video and the like, and the current fault, such as the content of network problems, incapability of translation, translation failure and the like, occurs. The operation prompt information can prompt corresponding user to execute operations such as starting translation, ending translation, switching languages and the like through various forms such as voice, text, sign language video and the like. And prompting input, for example, a sign language user is separated from a sign language recognition area, prompting can be performed, and other users can also be prompted if the voice is relatively small.
The scene commonly used phrases can be related to the translated scene, for example, in a shopping scene, the scene commonly used phrases can be commonly used phrases related to shopping, such as welcome phrases, price replies, commodity introductions, shopping inquiries and the like; also as in the medical scene, commonly used terms for symptoms, insurance, etc.; and as in legal service scenarios, for queries about basic information of users, etc. In short, the common expressions of the scene can be predetermined based on the actually applied scene, and corresponding data such as text, voice, sign language video and the like can be obtained.
The service information is information in the scene service, such as information with a relatively high use frequency, and necessary prompt information. Therefore, the service information can be stored locally in the device in advance, and each service information can correspond to service conditions, such as prompting conditions, scene conditions and the like, and is determined by combining with a specific use scene, and when the service conditions are detected to be met, the corresponding service information is output.
On the basis of the above embodiments, the embodiments of the present application may further determine scene information, and determine scene parameters based on the scene information, so as to assist sign language translation through the scene parameters. And can determine the required service information such as common words of scenes based on the scene information and scene parameters. For example, scene parameters, such as names, tags, attributes, etc., of scenes may be determined based on the scene information, and sign language translation may be assisted based on the parameters, such as invoking a corresponding sign language database, etc. The determined scene information comprises at least one of the following information: analyzing the background of the collected sign language video data to determine corresponding scene information; for the collected sign language video data, the background, such as outdoor or indoor, shopping mall or tourist attractions, can be analyzed through visual processing, so that corresponding scene information is determined based on the analyzed background. Acquiring environment sound data through an audio input assembly, and determining corresponding scene information according to the environment sound data; the environment sound can be analyzed according to the collected voice data, video data and the like, the current environment of the user is determined, and corresponding scene information is obtained. Analyzing the collected voice data to determine corresponding scene information; the analysis of the collected voice data may include content analysis, ambient sound analysis, and the like, to determine scene information. Acquiring position information, and determining scene information according to the position information; the method can also obtain the position information from the terminal equipment, and determine the corresponding scene information based on the position information, for example, determine the current position in a school, a hospital, a market, etc. based on the position information, and determine the corresponding scene information. Determining a target page before the translation page, and determining scene information according to the target page; the page turning page can be entered from other pages, so that a page before entering the translation page can be taken as a target page, and then scene information is analyzed based on the target page, for example, the target page is a payment page, a shopping page, a customer service page of a shopping application, and the like, and corresponding scene information can be determined. Determining an operated application program, and determining scene information according to the operated application program; it is also possible to detect an application that has been run in the device and determine scene information based on the type, function, etc. of the application. Such as shopping applications, social applications, instant messaging applications, etc., wherein the executed applications include the application in which the sign language translation page is located, and other applications that are executed in the background or foreground, which may be determined based on the requirements. And acquiring time information, and determining scene information according to the time information. Scene information, such as day, night, working day, holiday, and the like, can also be determined based on the time information, and is specifically determined according to requirements.
In the embodiment of the application, the scene parameters can be obtained by integrating the scene information determined by the dimensions, so that the processes of sign language translation, sign language synthesis and the like can be assisted based on the scene parameters.
In the embodiment of the application, the sign language translation page further comprises an exit control, and an exit instruction is received according to the triggering of the exit control in the sign language translation page; and closing the sign language translation page according to the quit instruction. If shopping is finished, registration for medical treatment is carried out, and the like, an ending control can be triggered, the sign language translation page is closed, and the guidance page is returned. Thereby providing sign language services to users in various scenarios and assisting in interaction with sign language users.
In the embodiment of the application, each area is further provided with an indication element, and the indication element is used for indicating the input and output states of the current area. The display device can be implemented in various forms, for example, the indication element is an interface icon, and the input state and the output state are indicated by different colors, for example, the input state is red, the output state is green, and the idle state without input and output is gray. If the indication element is a dynamic element, different input and output states can be indicated through dynamic indication effects. An example of such a dynamic element is an indicator light. The indicator light can indicate different input and output states through different apertures. For example, when inputting or outputting, the aperture is dynamically enlarged or reduced to indicate that input or output is currently performed. And may also be prompted in conjunction with different colors, text, etc. Indication primary colors can be respectively arranged in the sign language input area and the sign language output area, so that the input and output states of the area are indicated, and the input and output states of other areas are indicated. Or displaying an indication primary color in the translation page, and prompting the current input and output user through different colors, dynamic effects, characters and the like. Accordingly, an indication element for indicating an input, output state may be displayed in the translation page; the indication element comprises at least one of: text indication elements, dynamic indication elements, color indication elements. As in the example of fig. 6A, the substeps thereof reveal a dynamic effect of the indication element corresponding to the breathing light pattern, and the indication element can reveal the dynamic effect by stepwise enlargement and reduction of the aperture when having an input or an output indicating that the input or the output is being performed. When the user inputs the information, the information is displayed as "A" and the color is adjusted from dark to light, and when the user inputs the information, the information is displayed as "B" and the color is adjusted from light to dark. In one example, as shown in fig. 6B, an indicator element of a breathing light pattern is provided, which is gray in an idle state, and lights up when there is input or output, and is displayed as a breathing light pattern. And in the scene of bidirectional translation, the user who inputs or outputs can be represented by displaying characters on the indication element, such as that "A" represents the user A, "B" represents the user B, and "C" represents the virtual image, so that the user who performs input or output can be visually indicated. For example, when it is detected that the user a performs input or output, "a" may be displayed by the indication element and indicate that the user a is inputting or outputting by a dynamic change or a color change. As another example, when it is detected that the counterpart performs input or output, "B" or "C" may be displayed through the indication element, and the counterpart user B is being input or the avatar C is being output through a dynamic change or a color change. As another example, when the avatar outputs sign language, the indication element on the second interface may display information such as a short name, a nickname, a code number, and the like of the avatar, such as "nine", and indicate that the avatar is outputting sign language through dynamic change or color change.
In the embodiment of the application, the sign language page turning page further comprises an indication tag, and the indication tag can be used for indicating an input state, conversation time, service information and the like. In one example, the indicator tab may be located at the interface of the sign language input area and the sign language output area and may be used to indicate various desired information. For example, the service information is displayed on an indication label, and various kinds of prompt information, scene commonly used words, recommendation information corresponding to scenes, and the like can be indicated. Various types of information may also be presented, such as prompting for input status, in conjunction with an indication element, and the duration of the current translation. The indication label can display different information through different colors, icons, characters and the like, and can prompt through corresponding switching patterns when different information is switched, such as various patterns of turning switching, zooming switching, shutter switching and the like, so that the change of the information can be prompted. Displaying an indication tag in the sign language translation page; switching between different indication labels by setting patterns.
The following provides an embodiment for barrier-free communication in sign language based on device and server interaction, providing a video communication page with sign language translation functionality upon which a remote user can effect barrier-free communication, wherein the two users can be sign language users and non-sign language users.
Referring to fig. 7, an interaction diagram of another barrier-free communication method embodiment of the present application is shown. As shown in fig. 7, both sign language users and non-sign language users interact through video, where sign language video data is collected on the sign language user (first device) side and voice data is collected on the non-sign language user (second device) side. The following steps can be specifically executed:
step 700, a device provides a video communication page, where the video communication page includes: the home terminal display area and the opposite terminal display area take the home terminal display area as a sign language input area and the opposite terminal display area as a sign language output area as an example. Take the first device as the device of the sign language user and the second device as the device of the non-sign language user as an example. For example, the sign language translation page is a video communication page of an Instant Messaging (IM) application.
In step 702, a first device acquires first video data through an image acquisition component. The first video data comprises first finger language video data.
Step 704, the first device displays the first video data in the home terminal display area of the video call page.
Step 706, the first device uploads the collected first finger language video data to the server.
And 708, the service side performs sign language recognition on the sign language video data according to the sign language recognition model and determines corresponding sign language translation information, wherein the sign language translation information is determined according to a sign language recognition result of the image frame set corresponding to the sentence break node, and the sentence break node is obtained by performing sentence break detection on the sign language video data. The sign language recognition and translation process is similar to that in the above embodiment, and therefore, the description is not repeated, and reference may be made to the corresponding discussion in the above embodiment.
And step 710, the server side issues the collected first sign language video data and sign language translation information. The service end can send at least one of sign language translation voice and sign language recognition text synthesized in the sign language translation information to the first device. The determination as to the fed back data may be based on various situations such as setting of sign language user, network situation, etc. to determine whether to return sign language translation information. For the second device, the server may return at least one of synthesized sign language interpreted speech and sign language recognized text so that the user of the second device can understand what the sign language user has expressed. Of course, the collected sign language video data may also be fed back to the second device based on settings, network conditions, etc.
If the communication scene is applied to a scene of unidirectionally translating the sign language into the natural language, the server side feeds back the sign language video data and the sign language translation information to the second device side, so that the sign language video data can be displayed in the second device and corresponding sign language translation information can be output, and interaction can be carried out between a sign language user and a non-sign language user. For example, the sign language user is a non-sign language user, and can understand the words of the non-sign language user, but cannot speak but needs to communicate with the sign language.
If the communication scene is to be translated in two directions of sign language and natural language, the natural language of the non-sign language user is translated into the sign language, and the following steps can be executed:
at step 712, the audio input component of the second device collects voice data.
And 714, uploading the acquired voice data to a server by the second device.
If the second device collects video data, the video data can be directly transmitted to the server, and the server can separate voice data from the video data for translation.
And step 716, the server generates second voice video data according to the collected voice data. The second finger language video data is generated by executing sign language actions for the virtual image, the sign language actions of the virtual image drive limb actions according to the limb action parameters and drive face actions according to the face action parameters, and the limb action parameters and the face action parameters are analyzed and determined according to the communication information.
Performing voice recognition on the voice data, and determining corresponding text data; determining a sign language vocabulary sequence according to text data, and acquiring a limb action parameter corresponding to the sign language vocabulary sequence; and determining keywords and emotion information according to the text data, and determining that the keywords and the emotion information are matched with facial action parameters. Determining a limb action sequence of the virtual image according to the limb action parameters, and determining transitional action between two limb actions; connecting the limb action sequence according to the transitional action, and driving the virtual image to execute the corresponding limb action; determining a lip language action based on the lip language action parameters and determining a facial expression based on the facial expression parameters; driving the avatar to perform a lip language action and a corresponding facial expression; and fusing the limb actions, lip language actions and facial expressions of the virtual image according to the time information to generate corresponding sign language video data. And analyzing the speech rate information according to the voice data, and adjusting the action speed of the virtual image according to the speech rate information. The details are similar to the corresponding processes in the above embodiments, and therefore, the description is omitted.
Step 718, the server sends the second phrase video data to the first device.
And the server side sends the synthesized second language video data to the first equipment. Text data, collected voice data may also be sent to the first device. And for the second device, whether to feed back the synthesized sign language video data, text data, collected voice data may be determined based on settings, network conditions, and the like.
And step 720, the first device displays the second sign language video data in the sign language output area.
So that the sign language user can perform barrier-free communication with the non-sign language user through the sign language translation page.
In the embodiment of the application, sign language video data is translated, and in the translation process, a sign language recognition result can be fed back to a sign language user, so that the sign language user can confirm whether the sign language data is accurate or not, if the sign language data is not accurate, the text can be adjusted based on a corresponding adjusting control, and corresponding candidate suggestions can be given during adjustment. In addition, in the process of translating the natural language into the sign language, after the sign language video data of the virtual image is displayed to the sign language user, the sign language video data can also prompt that the sign language video data is output completely, and confirm whether the sign language user understands the meaning of the previous virtual image sign language, if not, a translation adjusting control can be given, and corresponding candidate texts can be given, so that the sign language video data of the virtual image can be adjusted based on the candidate texts, and the accuracy of translation is improved.
On the basis of the above embodiments, the present application embodiment further provides a sign language teaching method, as shown in fig. 8.
Step 802, providing a sign language teaching page.
And step 804, displaying target teaching information on the sign language teaching page.
Step 806, collecting first sign language video data through an image collection component, and displaying the first sign language video data in a sign language input area of the sign language teaching page, wherein the first sign language video data is video data of sign language users executing sign language according to the target teaching information.
The sign language teaching page includes a sign language input area and a sign language output area for displaying standard sign language of the avatar for teaching collation. Therefore, target teaching information can be displayed on the sign language teaching page, and the target teaching information can be text data, and voice data can also be adopted in some examples. The target teaching information is information that a user needs to input sign language. The corresponding user can execute sign language based on the target teaching information, and the equipment collects first sign language video data of the user through the image collection assembly.
And 808, uploading the first hand language video data.
Step 810, receiving synthesized second hand language video data corresponding to the first hand language video data, wherein the second hand language video data is generated by executing a hand language action for the virtual image, the hand language action of the virtual image drives a limb action according to the limb action parameter and drives a face action according to the face action parameter, and the limb action parameter and the face action parameter are analyzed and determined according to the target teaching information.
And step 812, displaying the second sign language video data in a sign language output area of the sign language teaching page so that sign language users can learn sign language.
The first hand language video data can be subjected to sentence break detection in real time, the sign language feature set corresponding to the sentence break node is uploaded to the server side, the server side can perform detection and time sequence recognition based on the sign language feature set, a sign language word sequence is obtained and then converted into a natural language sentence based on a natural language rule, a sign language recognition text is obtained, and the hand language recognition text is corrected by combining an error correction module.
The target teaching information can be text data or voice data, and voice recognition can be carried out on the voice data to determine corresponding text data; analyzing and determining a sign language vocabulary sequence for text data, and acquiring a limb action parameter corresponding to the sign language vocabulary sequence; and determining keywords and emotion information according to the text data, and determining that the keywords and the emotion information are matched with facial action parameters. Determining a limb action sequence of the virtual image according to the limb action parameters, and determining transitional action between two limb actions; connecting the limb action sequence according to the transitional action, and driving the virtual image to execute the corresponding limb action; determining a lip language action based on the lip language action parameters and determining a facial expression based on the facial expression parameters; driving the avatar to perform a lip language action and a corresponding facial expression; and fusing the limb actions, lip language actions and facial expressions of the virtual image according to the time information to generate corresponding sign language video data. And analyzing the speech rate information according to the voice data, and adjusting the action speed of the virtual image according to the speech rate information. The details are similar to the corresponding processes in the above embodiments, and therefore, the description is omitted. And generating sign language teaching video data in the above mode in advance based on the target teaching information.
If the sign language action of the sign language user in the first video data has problems, such as wrong sign language action, nonstandard sign language action and the like, the second sign language video data of the virtual image can be compared with the first sign language video data to determine sign language information to be corrected. Then, a correction mark can be added to the second hand language video data or the first hand language video data based on the sign language information to be corrected, for example, the correction mark is added to the first hand language video data, so that an error prompt is displayed based on the added correction mark, and the sign language action is wrong or irregular. And a correction mark can be added into the second hand language video data, for example, a sign language action aiming at a problem in the first hand language video data is processed in the second hand language video data, so that a hand language user is prompted, corresponding prompt information can be displayed, such as an error position, a correct execution manner and the like, so that the first hand language video data and the standard second hand language video data can be displayed on the equipment for comparison. The user may also determine a sign language action that requires correction based on the correction indicia in the sign language video data.
The embodiment of the application can extract multi-modal characteristics from texts and voices, and is more in line with driving signals of real limb actions. And (4) decomposing action types, and fusing the deliberate intent action based on the query and the emotional natural action based on the depth model. Only the motion data of a limited number of sign language words need to be recorded and refined, and the digital sign language motions of any word forming and continuous sentences are adaptively synthesized, including the motions of the face and limbs.
Through the speed of speech analysis and action linking algorithm, a real, continuous and natural sign language expression process can be better restored, and the expression effect of the virtual image is improved. 2. The uniform time stamp is generated by the body action result, the lip language and expression result is synchronously produced, and the expressive force of the virtual image sign language is better improved
The embodiment of the application is based on image acquisition components such as cameras, and any other equipment does not need to be worn, so that sign language data of sign languages can be acquired, and sign language identification is completed. In the above processing process, the gesture language action is analyzed in real time by using an AI (Artificial Intelligence) visual algorithm, the gesture language words are recognized, and a large number of gesture language words do not need to be recorded in advance as matched materials. In the embodiment of the application, the sign language recognition algorithm supports sign languages with various characteristics, and can capture other sign language characteristics including faces and limbs, so that the sign languages can be better understood, and the accuracy of sign language recognition is improved. The sign language can be recognized and translated in real time by taking sentences as units based on the sentence break model, and the translation efficiency is improved. The method can adjust the natural language of the sign language based on the natural language NLP model, can filter wrong translation results and correct errors by combining with an error correction module, can convert text to voice based on the NLP translation model, and is convenient for a user to acquire translation information through multiple ways. According to the method, the learning capability of the sign language recognition network is enhanced explicitly by adopting a structural element extraction, structural modeling and learning method aiming at the visual image, and the final recognition precision is improved. And the detailed structural elements can provide customized technical services, such as automatic sentence break, specific action category analysis and the like, so that the accuracy is improved.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.
On the basis of the above embodiments, the present embodiment further provides a sign language video generating device, which is applied to a terminal device.
And the acquisition module is used for acquiring the voice data acquired by the audio input assembly. The generating module is used for analyzing the voice data and determining sign language parameters according to an analysis result, wherein the sign language parameters comprise limb action parameters and face action parameters; and driving the limb action of the virtual image according to the limb action parameters, and driving the face action of the virtual image according to the face action parameters to generate corresponding sign language video data. And the output module is used for outputting sign language video data containing the virtual image.
The generating module is used for carrying out voice recognition on the voice data and determining corresponding text data; determining a sign language vocabulary sequence according to text data, and acquiring a limb action parameter corresponding to the sign language vocabulary sequence; and determining keywords and emotion information according to the text data, and determining that the keywords and the emotion information are matched with facial action parameters.
The generation module is used for determining a limb action sequence of the virtual image according to the limb action parameters and determining transitional action between two limb actions; connecting the limb action sequence according to the transitional action, and driving the virtual image to execute the corresponding limb action; determining a lip language action based on the lip language action parameters and determining a facial expression based on the facial expression parameters; driving the avatar to perform a lip language action and a corresponding facial expression; and fusing the limb actions, lip language actions and facial expressions of the virtual image according to the time information to generate corresponding sign language video data.
The generating module is further configured to analyze speech rate information according to the speech data, and adjust the action speed of the virtual image according to the speech rate information.
The output module is also used for providing a sign language translation page; playing sign language video data of the virtual image on the sign language translation page; and the acquisition module is used for calling an audio input component through the sign language translation page and acquiring voice data.
The output module is further used for displaying an indication element in the sign language translation page, wherein the indication element is used for indicating input and output states; the indication element comprises at least one of: text indication elements, dynamic indication elements, color indication elements.
The output module may be further configured to determine that the service information corresponds to service sign language video data including an avatar, and a content type of the service information includes at least one of: prompt information, scene commonly used phrases; and when the service condition is detected to be met, playing the service sign language video data in the sign language translation page.
Under a sign language scenario: the acquisition module is used for acquiring information to be translated; the sign language video generation module is used for determining sign language video data of the virtual image according to the information to be translated, wherein the sign language video data is generated by driving the virtual image to execute actions according to action parameters, and the action parameters are generated according to the analysis result of the information to be translated; and the output module is used for outputting the sign language video data of the virtual image.
The acquisition module is used for acquiring information to be translated through the sign language translation page; and the output module is used for playing the sign language video data of the virtual image on the sign language translation page. The output module is used for displaying an indication element in the translation page, wherein the indication element is used for indicating input and output states; the indication element comprises at least one of: text indication elements, dynamic indication elements, color indication elements.
In an alternative embodiment, a bidirectional sign language translation apparatus is provided: the output module is used for providing a sign language translation page; displaying first sign language video data in a sign language input area of the sign language translation page; acquiring sign language translation information corresponding to the first hand language video data, wherein the sign language translation information is determined according to a sign language identification result of an image frame set corresponding to a sentence break node, and the sentence break node is obtained by performing sentence break detection on the hand language video data; outputting the sign language translation information through the sign language translation page; acquiring second hand language video data correspondingly synthesized by the acquired voice data, wherein the second hand language video data is generated by executing sign language actions for the virtual image, the sign language actions of the virtual image drive limb actions according to the limb action parameters and drive face actions according to the face action parameters, and the limb action parameters and the face action parameters are analyzed and determined according to the voice data; and displaying the second sign language video data in a sign language output area of the sign language translation page.
The acquisition module is used for acquiring first finger language video data through the image acquisition assembly; voice data is collected through the audio input assembly.
In an alternative embodiment, there is provided a sign language customer service device: the output module is used for providing a sign language customer service page; displaying the first sign language video data in a sign language input area of the sign language customer service page; determining sign language translation information corresponding to the first hand language video data to output the sign language translation information in a customer service page, wherein the sign language translation information is determined according to a sign language identification result of an image frame set corresponding to a sentence break node, and the sentence break node is obtained by performing sentence break detection on the hand language video data; receiving second hand language video data synthesized according to service reply information of customer service, wherein the second hand language video data is generated by executing sign language actions for the virtual image, the sign language actions of the virtual image drive limb actions according to limb action parameters and drive face actions according to face action parameters, and the limb action parameters and the face action parameters are analyzed and determined according to the service reply information; and displaying the second sign language video data in a sign language output area of the sign language customer service page.
The acquisition module is used for acquiring first finger language video data through the image acquisition assembly.
In an alternative embodiment, there is provided a sign language communication device: the output module is used for providing a video communication page; displaying the first video data in a local display area of the video call page, wherein the first video data comprises first finger language video data; displaying sign language translation information of the first sign language video data in a local display area of the video call page, wherein the sign language translation information is determined according to a sign language identification result of an image frame set corresponding to a sentence break node, and the sentence break node is obtained by performing sentence break detection on the sign language video data; receiving second hand language video data synthesized according to communication information of an opposite terminal, wherein the second hand language video data is generated by executing a hand language action for an avatar, the hand language action of the avatar drives a limb action according to a limb action parameter and drives a face action according to a face action parameter, the limb action parameter and the face action parameter are analyzed and determined according to the communication information, and the communication information comprises at least one of text information, voice information and video information; and displaying the second phrase video data in an opposite-end display area of the video call page.
The acquisition module is used for acquiring first video data through the image acquisition assembly.
In an alternative embodiment, there is provided a sign language teaching apparatus: the output module is used for providing a sign language teaching page; displaying target teaching information on the sign language teaching page; displaying the first sign language video data in a sign language input area of the sign language teaching page, wherein the first sign language video data is video data of sign language users executing sign language according to the target teaching information; receiving sign language translation information corresponding to the first sign language video data and synthesized second sign language video data, wherein the sign language translation information is determined according to sign language identification results of image frame sets corresponding to sentence break nodes, the sentence break nodes are obtained by performing sentence break detection on the sign language video data, the second sign language video data is generated by executing sign language actions according to virtual images, the sign language actions of the virtual images drive limb actions according to limb action parameters and drive face actions according to the face action parameters, and the limb action parameters and the face action parameters are determined according to the target teaching information; and displaying the second sign language video data in a sign language output area of the sign language teaching page so that sign language users can learn sign languages.
Optionally, the output module is further configured to display an error prompt in the first finger language video data; and/or amplifying target sign language actions in the second sign language video data so as to prompt the sign language user.
The acquisition module is used for acquiring first finger language video data through the image acquisition assembly and uploading the first finger language video data.
In conclusion, after the sign language video data is collected, sentence break detection can be performed on frame images of the sign language video data in real time, and each frame image is detected in real time, so that semantic translation can be performed on the sign language video data based on sentences, sign language recognition results of image frame sets corresponding to sentence break nodes are determined, sign language translation information is determined according to the sign language recognition results, real-time translation of sign languages is achieved, then the sign language translation information is output, and sign language translation is performed conveniently. .
The embodiment of the application is based on image acquisition components such as cameras, and any other equipment does not need to be worn, so that sign language data of sign languages can be acquired, and sign language identification is completed. In the above processing process, the gesture language action is analyzed in real time by using an AI (Artificial Intelligence) visual algorithm, the gesture language words are recognized, and a large number of gesture language words do not need to be recorded in advance as matched materials. In the embodiment of the application, the sign language recognition algorithm supports sign languages with various characteristics, and can capture other sign language characteristics including faces and limbs, so that the sign languages can be better understood, and the accuracy of sign language recognition is improved. The sign language can be recognized and translated in real time by taking sentences as units based on the sentence break model, and the translation efficiency is improved. The method can adjust the natural language of the sign language based on the natural language NLP model, can filter wrong translation results and correct errors by combining with an error correction module, can convert text to voice based on the NLP translation model, and is convenient for a user to acquire translation information through multiple ways.
The present application further provides a non-transitory, readable storage medium, where one or more modules (programs) are stored, and when the one or more modules are applied to a device, the device may execute instructions (instructions) of method steps in this application.
Embodiments of the present application provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an electronic device to perform the methods as described in one or more of the above embodiments. In the embodiment of the present application, the electronic device includes various types of devices such as a terminal device and a server (cluster).
Embodiments of the present disclosure may be implemented as an apparatus, which may include electronic devices such as a terminal device, a server (cluster), etc. within a data center, using any suitable hardware, firmware, software, or any combination thereof, in a desired configuration. Fig. 9 schematically illustrates an example apparatus 900 that may be used to implement various embodiments described herein.
For one embodiment, fig. 9 illustrates an example apparatus 900 having one or more processors 902, a control module (chipset) 904 coupled to at least one of the processor(s) 902, a memory 906 coupled to the control module 904, a non-volatile memory (NVM)/storage 908 coupled to the control module 904, one or more input/output devices 910 coupled to the control module 904, and a network interface 912 coupled to the control module 904.
The processor 902 may include one or more single-core or multi-core processors, and the processor 902 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 900 can be a terminal device, a server (cluster), or the like as described in this embodiment.
In some embodiments, apparatus 900 may include one or more computer-readable media (e.g., memory 906 or NVM/storage 908) having instructions 914 and one or more processors 902 in combination with the one or more computer-readable media and configured to execute instructions 914 to implement modules to perform the actions described in this disclosure.
For one embodiment, control module 904 may include any suitable interface controllers to provide any suitable interface to at least one of the processor(s) 902 and/or any suitable device or component in communication with control module 904.
The control module 904 may include a memory controller module to provide an interface to the memory 906. The memory controller module may be a hardware module, a software module, and/or a firmware module.
The memory 906 may be used, for example, to load and store data and/or instructions 914 for the device 900. For one embodiment, memory 906 may comprise any suitable volatile memory, such as suitable DRAM. In some embodiments, the memory 906 may comprise a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).
For one embodiment, the control module 904 may include one or more input/output controllers to provide an interface to the NVM/storage 908 and input/output device(s) 910.
For example, NVM/storage 908 may be used to store data and/or instructions 914. NVM/storage 908 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).
NVM/storage 908 may include storage resources that are physically part of the device on which apparatus 900 is installed, or it may be accessible by the device and need not be part of the device. For example, NVM/storage 908 may be accessible over a network via input/output device(s) 910.
Input/output device(s) 910 may provide an interface for apparatus 900 to communicate with any other suitable device, input/output devices 910 may include communication components, audio components, sensor components, and so forth. Network interface 912 may provide an interface for device 900 to communicate over one or more networks, and device 900 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as access to a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.
For one embodiment, at least one of the processor(s) 902 may be packaged together with logic for one or more controller(s) (e.g., memory controller module) of the control module 904. For one embodiment, at least one of the processor(s) 902 may be packaged together with logic for one or more controller(s) of the control module 904 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 902 may be integrated on the same die with logic for one or more controller(s) of the control module 904. For one embodiment, at least one of the processor(s) 902 may be integrated on the same die with logic of one or more controllers of the control module 904 to form a system on a chip (SoC).
In various embodiments, the apparatus 900 may be, but is not limited to being: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, apparatus 900 may have more or fewer components and/or different architectures. For example, in some embodiments, device 900 includes one or more cameras, keyboards, Liquid Crystal Display (LCD) screens (including touch screen displays), non-volatile memory ports, multiple antennas, graphics chips, Application Specific Integrated Circuits (ASICs), and speakers.
The detection device can adopt a main control chip as a processor or a control module, sensor data, position information and the like are stored in a memory or an NVM/storage device, a sensor group can be used as an input/output device, and a communication interface can comprise a network interface.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
Detailed descriptions are given above to a sign language video method, a sign language customer service method, a sign language communication method, a sign language teaching method, a terminal device and a machine-readable medium provided by the present application, and specific examples are applied herein to explain the principle and the implementation of the present application, and the descriptions of the above embodiments are only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (12)

1. A sign language video generation method, the method comprising:
acquiring voice data acquired by an audio input assembly;
analyzing the voice data, and determining sign language parameters according to an analysis result, wherein the sign language parameters comprise limb action parameters and face action parameters;
driving limb actions of the virtual image according to the limb action parameters, driving face actions of the virtual image according to the face action parameters, and generating corresponding sign language video data;
outputting sign language video data containing the avatar.
2. The method of claim 1, wherein analyzing the speech data to determine sign language parameters according to the analysis result comprises:
performing voice recognition on the voice data, and determining corresponding text data;
determining a sign language vocabulary sequence according to text data, and acquiring a limb action parameter corresponding to the sign language vocabulary sequence;
and determining keywords and emotion information according to the text data, and determining that the keywords and the emotion information are matched with facial action parameters.
3. The method of claim 2, wherein driving the body motion of the avatar according to the body motion parameters and driving the face motion of the avatar according to the face motion parameters to generate corresponding sign language video data comprises:
determining a limb action sequence of the virtual image according to the limb action parameters, and determining transitional action between two limb actions;
connecting the limb action sequence according to the transitional action, and driving the virtual image to execute the corresponding limb action;
determining a lip language action based on the lip language action parameters and determining a facial expression based on the facial expression parameters;
driving the avatar to perform a lip language action and a corresponding facial expression;
and fusing the limb actions, lip language actions and facial expressions of the virtual image according to the time information to generate corresponding sign language video data.
4. The method of claim 3, further comprising:
and analyzing the speech rate information according to the speech data, and adjusting the action speed of the virtual image according to the speech rate information.
5. The method of claim 1, further comprising:
calling an image acquisition component on a sign language page, and acquiring video data of a user through the image acquisition component;
the outputting sign language video data containing the avatar, comprising: and displaying the collected video data and the sign language video data of the virtual image through the sign language page.
6. The method of claim 5, further comprising:
detecting the collected video data when sign language video data of the virtual image is played on a sign language page;
and when the preset gesture is detected in the collected video data, pausing the playing of the sign language video data containing the virtual image.
7. The method of claim 1, further comprising:
when the sign language video data of the virtual image is played to a target position, displaying a display element corresponding to the target position in a sign language page, wherein the target position is determined according to a keyword, and the display element comprises at least one of the following elements: background elements, image elements and emotion elements.
8. A sign language translation method, the method comprising:
providing a sign language translation page;
acquiring first hand language video data through an image acquisition assembly, and displaying the first hand language video data in a hand language input area of the hand language translation page;
acquiring sign language translation information corresponding to the first sign language video data, and outputting the sign language translation information through the sign language translation page;
voice data is collected through an audio input assembly;
acquiring second hand language video data correspondingly synthesized by the acquired voice data, wherein the second hand language video data is generated by executing sign language actions for the virtual image, the sign language actions of the virtual image drive limb actions according to the limb action parameters and drive face actions according to the face action parameters, and the limb action parameters and the face action parameters are analyzed and determined according to the voice data;
and displaying the second sign language video data in a sign language output area of the sign language translation page.
9. The method of claim 8, further comprising:
when a preset gesture is detected in the collected first video data, the playing of the second language video data is paused;
and if the fact that sign language action exists in the first sign language video data is detected, triggering translation processing of the first sign language video data.
10. A sign language customer service method is characterized by comprising the following steps:
providing a sign language customer service page;
acquiring first hand language video data through an image acquisition assembly, and displaying the first hand language video data in a hand language input area of the hand language customer service page;
determining sign language translation information corresponding to the first sign language video data so as to output the sign language translation information in a customer service page;
receiving second hand language video data synthesized according to service reply information of customer service, wherein the second hand language video data is generated by executing sign language actions for the virtual image, the sign language actions of the virtual image drive limb actions according to limb action parameters and drive face actions according to face action parameters, and the limb action parameters and the face action parameters are analyzed and determined according to the service reply information;
and displaying the second sign language video data in a sign language output area of the sign language customer service page.
11. An electronic device, comprising: a processor; and
a memory having executable code stored thereon that, when executed, causes the processor to perform the method of any of claims 1-10.
12. One or more machine-readable media having executable code stored thereon that, when executed, causes a processor to perform the method of any of claims 1-10.
CN202111060002.6A 2021-09-10 2021-09-10 Sign language video generation, translation and customer service method, device and readable medium Pending CN113835522A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111060002.6A CN113835522A (en) 2021-09-10 2021-09-10 Sign language video generation, translation and customer service method, device and readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111060002.6A CN113835522A (en) 2021-09-10 2021-09-10 Sign language video generation, translation and customer service method, device and readable medium

Publications (1)

Publication Number Publication Date
CN113835522A true CN113835522A (en) 2021-12-24

Family

ID=78958923

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111060002.6A Pending CN113835522A (en) 2021-09-10 2021-09-10 Sign language video generation, translation and customer service method, device and readable medium

Country Status (1)

Country Link
CN (1) CN113835522A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708648A (en) * 2022-03-07 2022-07-05 杭州易视通科技有限公司 Sign language recognition method and system based on artificial intelligence
CN114895817A (en) * 2022-05-24 2022-08-12 北京百度网讯科技有限公司 Interactive information processing method, and training method and device of network model
CN115239855A (en) * 2022-06-23 2022-10-25 安徽福斯特信息技术有限公司 Virtual sign language anchor generation method, device and system based on mobile terminal
CN115457981A (en) * 2022-09-05 2022-12-09 安徽康佳电子有限公司 Method for facilitating hearing-impaired person to watch video and television based on method
CN115484493A (en) * 2022-09-09 2022-12-16 深圳市小溪流科技有限公司 Real-time intelligent streaming media system for converting IPTV audio and video into virtual sign language video in real time
CN116843805A (en) * 2023-06-19 2023-10-03 上海奥玩士信息技术有限公司 Method, device, equipment and medium for generating virtual image containing behaviors
WO2023197949A1 (en) * 2022-04-15 2023-10-19 华为技术有限公司 Chinese translation method and electronic device
CN116977499A (en) * 2023-09-21 2023-10-31 粤港澳大湾区数字经济研究院(福田) Combined generation method of facial and body movement parameters and related equipment
WO2024008047A1 (en) * 2022-07-04 2024-01-11 阿里巴巴(中国)有限公司 Digital human sign language broadcasting method and apparatus, device, and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108427910A (en) * 2018-01-30 2018-08-21 浙江凡聚科技有限公司 Deep-neural-network AR sign language interpreters learning method, client and server
CN109960813A (en) * 2019-03-18 2019-07-02 维沃移动通信有限公司 A kind of interpretation method, mobile terminal and computer readable storage medium
CN110647636A (en) * 2019-09-05 2020-01-03 深圳追一科技有限公司 Interaction method, interaction device, terminal equipment and storage medium
CN110728191A (en) * 2019-09-16 2020-01-24 北京华捷艾米科技有限公司 Sign language translation method, and MR-based sign language-voice interaction method and system
CN110807388A (en) * 2019-10-25 2020-02-18 深圳追一科技有限公司 Interaction method, interaction device, terminal equipment and storage medium
CN110826441A (en) * 2019-10-25 2020-02-21 深圳追一科技有限公司 Interaction method, interaction device, terminal equipment and storage medium
CN110931042A (en) * 2019-11-14 2020-03-27 北京欧珀通信有限公司 Simultaneous interpretation method and device, electronic equipment and storage medium
CN112650831A (en) * 2020-12-11 2021-04-13 北京大米科技有限公司 Virtual image generation method and device, storage medium and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108427910A (en) * 2018-01-30 2018-08-21 浙江凡聚科技有限公司 Deep-neural-network AR sign language interpreters learning method, client and server
CN109960813A (en) * 2019-03-18 2019-07-02 维沃移动通信有限公司 A kind of interpretation method, mobile terminal and computer readable storage medium
CN110647636A (en) * 2019-09-05 2020-01-03 深圳追一科技有限公司 Interaction method, interaction device, terminal equipment and storage medium
CN110728191A (en) * 2019-09-16 2020-01-24 北京华捷艾米科技有限公司 Sign language translation method, and MR-based sign language-voice interaction method and system
CN110807388A (en) * 2019-10-25 2020-02-18 深圳追一科技有限公司 Interaction method, interaction device, terminal equipment and storage medium
CN110826441A (en) * 2019-10-25 2020-02-21 深圳追一科技有限公司 Interaction method, interaction device, terminal equipment and storage medium
CN110931042A (en) * 2019-11-14 2020-03-27 北京欧珀通信有限公司 Simultaneous interpretation method and device, electronic equipment and storage medium
CN112650831A (en) * 2020-12-11 2021-04-13 北京大米科技有限公司 Virtual image generation method and device, storage medium and electronic equipment

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708648A (en) * 2022-03-07 2022-07-05 杭州易视通科技有限公司 Sign language recognition method and system based on artificial intelligence
WO2023197949A1 (en) * 2022-04-15 2023-10-19 华为技术有限公司 Chinese translation method and electronic device
CN114895817A (en) * 2022-05-24 2022-08-12 北京百度网讯科技有限公司 Interactive information processing method, and training method and device of network model
CN114895817B (en) * 2022-05-24 2023-08-04 北京百度网讯科技有限公司 Interactive information processing method, network model training method and device
CN115239855A (en) * 2022-06-23 2022-10-25 安徽福斯特信息技术有限公司 Virtual sign language anchor generation method, device and system based on mobile terminal
WO2024008047A1 (en) * 2022-07-04 2024-01-11 阿里巴巴(中国)有限公司 Digital human sign language broadcasting method and apparatus, device, and storage medium
CN115457981A (en) * 2022-09-05 2022-12-09 安徽康佳电子有限公司 Method for facilitating hearing-impaired person to watch video and television based on method
CN115484493A (en) * 2022-09-09 2022-12-16 深圳市小溪流科技有限公司 Real-time intelligent streaming media system for converting IPTV audio and video into virtual sign language video in real time
CN116843805A (en) * 2023-06-19 2023-10-03 上海奥玩士信息技术有限公司 Method, device, equipment and medium for generating virtual image containing behaviors
CN116843805B (en) * 2023-06-19 2024-03-19 上海奥玩士信息技术有限公司 Method, device, equipment and medium for generating virtual image containing behaviors
CN116977499A (en) * 2023-09-21 2023-10-31 粤港澳大湾区数字经济研究院(福田) Combined generation method of facial and body movement parameters and related equipment
CN116977499B (en) * 2023-09-21 2024-01-16 粤港澳大湾区数字经济研究院(福田) Combined generation method of facial and body movement parameters and related equipment

Similar Documents

Publication Publication Date Title
CN113835522A (en) Sign language video generation, translation and customer service method, device and readable medium
JP7408048B2 (en) Anime character driving method and related device based on artificial intelligence
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
Ahmed et al. Deaf talk using 3D animated sign language: A sign language interpreter using Microsoft's kinect v2
Hong et al. Real-time speech-driven face animation with expressions using neural networks
CN112162628A (en) Multi-mode interaction method, device and system based on virtual role, storage medium and terminal
Martins et al. Accessible options for deaf people in e-learning platforms: technology solutions for sign language translation
CN114401438B (en) Video generation method and device for virtual digital person, storage medium and terminal
CN110598576A (en) Sign language interaction method and device and computer medium
Oliveira et al. Automatic sign language translation to improve communication
KR102174922B1 (en) Interactive sign language-voice translation apparatus and voice-sign language translation apparatus reflecting user emotion and intention
KR20120120858A (en) Service and method for video call, server and terminal thereof
Gibbon et al. Audio-visual and multimodal speech-based systems
CN113822187A (en) Sign language translation, customer service, communication method, device and readable medium
CN113851029B (en) Barrier-free communication method and device
Zhang et al. Teaching chinese sign language with a smartphone
Kanvinde et al. Bidirectional sign language translation
CN115188074A (en) Interactive physical training evaluation method, device and system and computer equipment
Rastgoo et al. A survey on recent advances in Sign Language Production
Rastgoo et al. All You Need In Sign Language Production
Putra et al. Designing translation tool: Between sign language to spoken text on kinect time series data using dynamic time warping
JP2017182261A (en) Information processing apparatus, information processing method, and program
CN116088675A (en) Virtual image interaction method, related device, equipment, system and medium
JP6754154B1 (en) Translation programs, translation equipment, translation methods, and wearable devices
CN115171673A (en) Role portrait based communication auxiliary method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination