CN113822186A - Sign language translation, customer service, communication method, device and readable medium - Google Patents

Sign language translation, customer service, communication method, device and readable medium Download PDF

Info

Publication number
CN113822186A
CN113822186A CN202111059974.3A CN202111059974A CN113822186A CN 113822186 A CN113822186 A CN 113822186A CN 202111059974 A CN202111059974 A CN 202111059974A CN 113822186 A CN113822186 A CN 113822186A
Authority
CN
China
Prior art keywords
sign language
video data
language
information
sign
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111059974.3A
Other languages
Chinese (zh)
Inventor
程荣亮
王琪
张邦
潘攀
徐盈辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Damo Institute Hangzhou Technology Co Ltd
Original Assignee
Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Damo Institute Hangzhou Technology Co Ltd filed Critical Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority to CN202111059974.3A priority Critical patent/CN113822186A/en
Publication of CN113822186A publication Critical patent/CN113822186A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B21/00Teaching, or communicating with, the blind, deaf or mute
    • G09B21/04Devices for conversing with the deaf-blind

Abstract

The embodiment of the application provides sign language translation, customer service, communication method and device and a readable medium, so that sign language translation can be conveniently carried out. The method comprises the following steps: acquiring sign language video data; performing sign language recognition on the sign language video data according to a sign language recognition model, and determining corresponding sign language translation information according to a sign language recognition result, wherein the sign language recognition model is used for extracting sign language structural features of the sign language video data and performing sign language recognition according to the sign language structural features; and outputting the sign language translation information. The learning ability of a sign language recognition network can be enhanced based on the explicit sign language structural features, the final recognition precision is improved, and sign language translation information is output, so that sign language translation can be conveniently carried out.

Description

Sign language translation, customer service, communication method, device and readable medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a sign language interpretation method, a sign language customer service method, a sign language communication method, a sign language teaching method, a terminal device, and a machine-readable medium.
Background
The communication between hearing-impaired people and deaf-mute people is usually performed by sign language (sign language), which is a kind of hand language in which hearing-impaired or non-speech people interact and communicate with each other.
However, in daily life, there are very few people who can grasp sign language, so that it is difficult for hearing impaired people, deaf-mute people, and the like to communicate with other people, and the daily life is affected.
Disclosure of Invention
The embodiment of the application provides a sign language translation method, which is used for conveniently translating sign languages.
Correspondingly, the embodiment of the application also provides a sign language customer service method, a sign language communication method, a sign language teaching method, electronic equipment and a machine readable medium, which are used for ensuring the realization and application of the method.
In order to solve the above problem, an embodiment of the present application discloses a translation method, including: acquiring sign language video data; performing sign language recognition on the sign language video data according to a sign language recognition model, and determining corresponding sign language translation information according to a sign language recognition result, wherein the sign language recognition model is used for extracting sign language structural features of the sign language video data and performing sign language recognition according to the sign language structural features; and outputting the sign language translation information.
Optionally, the determining the corresponding sign language translation information according to the sign language recognition result includes: taking sign language identification texts in the sign language identification results as sign language translation information; and/or performing voice synthesis by adopting the sign language recognition text in the sign language recognition result, and taking the synthesized sign language translation voice as sign language translation information.
Optionally, the method further includes: providing a sign language translation page; playing the sign language video data in the sign language translation page; the outputting the sign language translation information comprises: displaying sign language identification text in the sign language translation page, and/or playing the sign language translation audio based on the sign language translation page.
Optionally, the method further includes: displaying language selectable items in response to triggering of a language selection control in the sign language translation page; in response to a trigger for a language selectable item, a selected target language is determined, the target language being a language in which sign language video data is translated.
Optionally, in response to an output adjustment instruction, adjusting an output mode of the sign language translation information, where the output mode includes: a voice output mode, a text output mode and/or a video output mode.
Optionally, the sign language translation page includes a sign language input area and a sign language output area, and playing the sign language video data in the sign language translation page includes: playing the sign language video data in a sign language input area of the sign language translation page; the method further comprises the following steps: and playing synthesized sign language video data in a sign language output area of the sign language translation page, wherein the synthesized sign language video data is video data for executing sign language by adopting an avatar, and the sign language executed by the avatar is determined according to input information.
Optionally, the performing sign language recognition on the sign language video data according to the sign language recognition model includes: performing feature extraction on the sign language video data through a sign language visual structural model to determine sign language structural features; and identifying the sign language structural features through a sign language feature identification model to obtain a sign language identification text.
Optionally, the scene information is determined based on the setting condition, and the scene parameter is determined according to the scene information, so as to assist sign language translation through the scene parameter.
The embodiment of the application also discloses a sign language translation method, which comprises the following steps: providing a sign language translation page; acquiring first hand language video data through an image acquisition assembly, and displaying the first hand language video data in a hand language input area of the hand language translation page; acquiring sign language translation information corresponding to the first sign language video data, wherein the sign language translation information is obtained by extracting sign language structural features of the sign language video data through a sign language recognition model and executing sign language recognition processing; outputting the sign language translation information through the sign language translation page; voice data is collected through an audio input assembly; acquiring second hand language video data correspondingly synthesized by the acquired voice data, wherein the second hand language video data is video data of a virtual image executing hand language according to the semantic meaning of the voice data; and displaying the second sign language video data in a sign language output area of the sign language translation page.
The embodiment of the application discloses a sign language customer service method, which comprises the following steps: providing a sign language customer service page; acquiring first hand language video data through an image acquisition assembly, and displaying the first hand language video data in a hand language input area of the hand language customer service page; determining sign language translation information corresponding to the first sign language video data to output the sign language translation information in a customer service page, wherein the sign language translation information is obtained by extracting sign language structural features of the first sign language video data through a sign language recognition model and executing sign language recognition processing; receiving second hand language video data synthesized according to service reply information of customer service, wherein the second hand language video data is video data of a virtual image executing sign language according to the semantics of the service reply information; and displaying the second sign language video data in a sign language output area of the sign language customer service page.
The embodiment of the application discloses a sign language communication method, which comprises the following steps: providing a video communication page; acquiring first video data through an image acquisition assembly, and displaying the first video data in a local end display area of a video call page, wherein the first video data comprises first finger language video data; displaying sign language translation information of the first sign language video data in a home terminal display area of the video call page, wherein the sign language translation information is obtained by extracting sign language structural features of the first sign language video data through a sign language recognition model and executing sign language recognition processing; receiving second hand language video data synthesized according to communication information of an opposite terminal, wherein the second hand language video data is video data of a virtual image executing hand language according to the semantic meaning of the communication information, and the communication information comprises at least one of text information, voice information and video information; and displaying the second phrase video data in an opposite-end display area of the video call page.
The embodiment of the application discloses a sign language teaching method, which comprises the following steps: providing a sign language teaching page; displaying target teaching information on the sign language teaching page; acquiring first sign language video data through an image acquisition assembly, and displaying the first sign language video data in a sign language input area of the sign language teaching page, wherein the first sign language video data is video data of sign language users executing sign languages according to the target teaching information; uploading the first hand language video data; receiving sign language translation information corresponding to the first sign language video data and synthesized second sign language video data, wherein the sign language translation information is obtained by extracting sign language structural features of the first sign language video data through a sign language recognition model and executing sign language recognition processing, and the second sign language video data is obtained by executing sign language teaching video data of the target teaching information for the virtual image; and displaying the second sign language video data in a sign language output area of the sign language teaching page so that sign language users can learn sign languages.
The embodiment of the application discloses electronic equipment, includes: a processor; and a memory having executable code stored thereon, which when executed, causes the processor to perform a method as in any one of the embodiments of the present application.
The embodiments of the present application disclose one or more machine-readable media having executable code stored thereon, which when executed, causes a processor to perform a method as any one of the embodiments of the present application.
Compared with the prior art, the embodiment of the application has the following advantages:
in this application embodiment, acquire sign language video data, it is right according to sign language identification model sign language video data carries out sign language identification, confirms corresponding sign language translation information, wherein, sign language identification model is used for extracting sign language structural feature of sign language video data, and the basis sign language structural feature carries out sign language identification, strengthens the learning ability of sign language identification network based on the structural feature of explicit sign language, promotes ultimate recognition accuracy, outputs sign language translation information to convenient the sign language translation that carries on.
Drawings
FIG. 1 is a schematic diagram of a sign language translation scenario according to an embodiment of the present application;
FIG. 2 is a flow chart of steps of an embodiment of a method for training a sign language recognition model of the present application;
FIG. 3 is a schematic diagram of an example of a spatial structure according to an embodiment of the present application;
FIG. 4 is a flow chart of the steps of an embodiment of a sign language translation method of the present application;
FIG. 5A is a diagram illustrating an example of a sign language translation page according to an embodiment of the present application;
FIG. 5B is a flowchart illustrating steps of a sign language customer service method according to an embodiment of the present application;
FIG. 5C is a diagram illustrating another sign language translation scenario according to an embodiment of the present application;
FIGS. 6A and 6B are schematic diagrams of examples of an indicating element according to embodiments of the present application;
FIG. 7 is a flow chart of steps in another sign language translation method embodiment of the present application;
FIG. 8 is a flow chart of steps in another sign language translation method embodiment of the present application;
FIG. 9 is a flowchart illustrating the steps of an embodiment of a bidirectional sign language translation method of the present application;
FIG. 10 is an interaction diagram of an embodiment of a method of unobstructed communication of the present application;
FIG. 11 is a flow chart of steps of an embodiment of a sign language teaching method of the present application;
fig. 12 is a schematic structural diagram of an apparatus according to an embodiment of the present application.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.
The method and the device can be applied to various scenes needing sign language translation. For example, in a scene where target users such as hearing-impaired people and deaf-dumb people perform face-to-face communication such as shopping, medical treatment and legal services, the embodiment of the application can provide sign language translation services, can provide translation pages, can collect sign language video data to be translated, and then translates and outputs translation information. According to the embodiment of the application, the third-party user is not needed to be used as the translation, the sign language of the target user such as a hearing-impaired person and a deaf-mute person is automatically recognized, and the translated voice, text and other data are output. The user can use various electronic devices such as a mobile phone, a tablet, a computer and the like to execute the translation method of the embodiment of the application.
The electronic equipment of the embodiment of the application can be provided with an image acquisition component, a display component, an audio input/output component and the like, such as a camera, a display, a microphone, a sound and other components. Therefore, the collection and playing of image, video and audio data can be carried out. In the embodiment of the application, the sign language video data can be collected through image collection equipment such as a camera, and then the sign language video data is subjected to sign language recognition through the sign language recognition model, so that automatic translation aiming at the sign language is realized, and corresponding sign language translation information is obtained. And then, the terminal equipment is adopted to output sign language translation information, so that a non-sign language user can understand the meaning expressed by the sign language user conveniently. As shown in fig. 1, in an example of a sign language translation scenario, a sign language user uses a mobile phone to perform sign language translation, an image capture device such as a camera of the mobile phone captures sign language video data, and the sign language video data captured in real time can be displayed on the mobile phone, so that the sign language user can conveniently check the own sign language state. And then, sign language recognition is carried out on the sign language video data through the sign language recognition model, automatic translation aiming at the sign language is realized, corresponding sign language translation information is obtained, the translation text can be displayed on a display screen of the mobile phone in a text form, and the translated voice can be played by adopting the mobile phone in a voice form, so that a non-sign language user can know the meaning of the sign language.
The sign language recognition model comprises a sign language visual structural model and a sign language feature recognition model, the sign language visual structural model is used for extracting sign language structural features from sign language video data, and the sign language feature recognition model is used for recognizing semantics based on the sign language structural features, and translating sign languages to obtain sign language recognition texts. Therefore, the sign language recognition model can be formed by two models, namely a sign language visual structured model and a sign language feature recognition model, and the sign language visual structured model and the sign language feature recognition model can also be used as submodels of the sign language recognition model. The model training can be carried out on each model in advance, and the subsequent sign language in the sign language video data can be automatically translated based on the trained sign language visual structured model and the trained sign language feature recognition model, so that non-sign language users can conveniently communicate with other users. In the embodiment of the application, the sign language recognition model may adopt a neural network, machine learning, and other models, for example, a convolutional neural network is used to train the sign language recognition model. For example, a sign language visual structured model may be feature extracted and model trained based on visual structured information. The structural information may be information describing or expressing a transaction or an object, for example, the visual structural information may be information describing visual structural features, such as the shape, contour, color, texture, and the like of the object, and the specific structural features may be determined according to the application scenario. In the embodiment Of the application, visual structured elements can be extracted based on sign language video data, and the visual structured elements refer to fine-grained structured visual cue information related to sign language, such as Region Of Interest (ROI), human body posture key point (pos), fuzzy classification information Of hand regions, and the like. Then, a multi-task convolutional neural network can be adopted to simultaneously perform tasks such as object detection, attitude estimation, fuzzy detection and the like.
The training process of a sign language visual structured model is shown in fig. 2:
step 202, sign language video data is acquired.
In the embodiment of the application, sign language video data can be collected in advance in a model training stage, for example, various types of video data with sign language translation, sign language video data collected by electronic equipment such as a mobile phone and the like can be collected. Sign language video data is taken as sample data.
And 204, performing feature extraction on the sign language video data, and determining corresponding visual structural element information.
Feature extraction can be performed based on sign language video data, and the extracted information is visual structural element information, wherein the visual structural element information can comprise multiple types, such as interest types, posture types, fuzzy types and the like, so that an extraction mode can be determined according to the types, and the visual structural element information of the corresponding type can be extracted from the sign language video data based on the extraction mode. For example, a multi-task convolutional neural network can be trained, each type of element is extracted as one task, so that the multi-task convolutional neural network is adopted to simultaneously perform tasks such as object detection, attitude estimation, fuzzy detection and the like, and corresponding visual structured element information is extracted.
A multi-task convolutional neural network can be adopted to execute an object detection task on the sign language video data and determine element information corresponding to the region of interest; executing a posture estimation task on the sign language video data by adopting a multi-task convolutional neural network, and determining element information of corresponding human body posture key points; and executing a fuzzy detection task on the sign language video data by adopting a multi-task convolutional neural network, and determining element information corresponding to the fuzzy classification of the hand region. The above is a multi-task convolutional neural network manner, actually, feature extraction may be performed for each task, a corresponding convolutional neural network or other neural network models may be trained, and other manners, such as analyzing each image in a video, may also be adopted to extract corresponding features. The embodiment of the present application does not limit this.
For the object detection task, an interest region in a sign language image can be located first, and for the sign language, the interest region is mainly expressed by sign language actions and facial expressions, so that a face region or a head region, a body region, a hand region and the like can be used as the interest region, and then corresponding element information is extracted from the interest region. Aiming at the posture estimation task, the gesture language posture of the user is mainly detected, including body posture, limb posture, hand posture and the like, the posture is detected to determine the information of key points, and the element information of the key points of the body posture is obtained. For example, the keypoints may include keypoints of shoulders, elbows, and the like, and keypoints of the face, such as the mouth, and the like. For the fuzzy detection task, the fuzzy detection area is a sign language related area, mainly a hand area, the fuzzy detection task can be used for carrying out fuzzy detection on the hand area, the motion information, the shape information and the like of the hand can be detected, and the corresponding classification information is determined to be used as the element information for fuzzy classification of the hand area.
And step 206, training the sign language visual structural model according to the visual structural element information to obtain a corresponding sign language visual structural model.
The sign language visual structured model comprises a spatial structured model and a time sequence structured model, wherein the sign language visual structured model can be formed by the spatial structured model and the time sequence structured model together, the sign language visual structured model can also be one model, and the spatial structured model and the time sequence structured model are respectively submodels in the model, which is not limited by the embodiment of the application. Sign language visual structured modeling therefore includes spatial structured modeling and temporal structured modeling. The time sequence structured model can perform time sequence modeling by establishing time sequence information between the associated frames. The spatial structural model can structurally model three structural elements of nodes, connections and parts. Wherein, the training of the sign language visual structured model is performed according to the visual structured element information to obtain a corresponding sign language visual structured model, and the training comprises: determining time sequence information and spatial information according to the visual structured element information; training a time sequence structured model according to the time sequence information, and training a space structured model according to the space information; and determining a sign language visual structured model according to the time sequence structured model and the space structured model.
The visual structured element information can be analyzed to determine element information related to time sequence, and time sequence information can be obtained. For example, element information based on the key points of the human posture, timing information between related frames based on the front and back association of the motion, and the like can be determined. For example, change information of the relevant region can be obtained based on the element information of the region of interest, and timing information between the associated frames can be determined. According to the visual structured element information, spatial information can be further determined, wherein the visual structured element information can be analyzed to determine elements related to a spatial structure, including key points (or interest points) in various regions and the like, so as to obtain corresponding spatial information. A temporal structured model can then be trained based on the temporal information, and a spatial structured model can be trained based on the spatial information. For model training, the method generally includes two processes, namely forward propagation and backward propagation, that is, inputting corresponding element information into a model for processing to obtain corresponding visual structural features, and then determining a backward parameter based on the visual structural features, for example, analyzing the visual structural features and a pre-labeled contrast feature to determine an adjusted backward parameter, or determining a feedback function based on the visual structural features, taking the feedback function as the backward parameter, and adjusting the model based on the backward parameter.
Wherein, training the time sequence structured model according to the time sequence information comprises: inputting the time sequence information into a time sequence structured model for processing to obtain time sequence structured characteristics; and determining a reverse parameter by adopting the time sequence structural characteristics, and adjusting the parameters of the time sequence structural model according to the reverse parameter to obtain the trained time sequence structural model. Training a spatial structured model according to the spatial information, comprising: inputting the spatial information into a spatial structural model for processing to obtain spatial structural characteristics; and determining a reverse parameter by adopting the spatial structural characteristics, and adjusting the parameters of the spatial structural model according to the reverse parameter to obtain the trained spatial structural model.
In the embodiment of the application, the spatial information includes three spatial structural elements of nodes, connections and components of the space, and the three spatial structural elements can be analyzed through a spatial structural model. In which, as shown in fig. 3, a Node (Node) includes a motion Node and a position Node, and the position Node is used to describe the image coordinates Node (x, y) of the Node in the 2D space. The motion node is used for expressing the image coordinates of the node in a 2D space and the offset from a reference node, wherein the reference node refers to a reference node corresponding to the motion node, for example, a node of the corresponding motion node at a static position is a reference node, such as a reference node of an elbow, a reference node of a wrist and the like. The connection (Joint) describes the 2D space vector relationship between the moving nodes, such as the angle and distance between the moving nodes. The component (Part) comprises sign language related components, such as three components of a head (R0), a left hand (R1) and a right hand (R2). The parts contain rich information, for example, the head contains various facial organs and expression expressions, and the left hand and the right hand can express different gestures, orientations and other information. For the space structure model, the image can be quantized in a 2D space, the positions of nodes in the 2D space are defined, and the like. And learning the relation of each node in the space by combining the information such as the weight of each node in all the nodes, and the like, such as describing the spatial structural characteristics through the nodes, the connection among the nodes and the components. As in fig. 3, the location node includes: node0, Node1, Node2, Node3, Node4, Node5, Node6, Node11, Node12, Node13, Node14, Node15, the motion Node comprises: node7, Node8, Node9 and Node 10.
Therefore, the dominant characteristics in the sign language video data can be obtained based on the structural model, and the sign language can be described more accurately. The sign language visual structural model can learn the vector relation and the spatial feature expression among key points, connections and components in a 2D image space based on the spatial structural model, and can also perform time sequence spatial feature modeling based on the time sequence structural model to obtain stable sign language time sequence features.
And 208, training a sign language feature recognition model according to the sign language structural features.
After the sign language visual structured model of the above spatial structured model and time sequence structured model is obtained, the sign language structured feature output by the sign language visual structured model can be based on. And then, the learning of sign language words and sentences can be carried out based on the sign language structural characteristics to obtain a word sequence of the sign language grammar structure. And the vocabulary sequence of the sign language is greatly different from the normal Chinese language sequence, so the generation of the normal language sequence is realized through the adjustment of the vocabulary language sequence, and the final sign language recognition text is obtained. Therefore, the sign language feature recognition model can be trained based on the sign language structural features, and the training process is similar to that described above, and therefore, the description is omitted.
A sign language recognition model may be determined based on the sign language structured model and the sign language feature recognition model. In this embodiment of the application, the sign language recognition model may be located on the terminal device side, or may be located on the server side, or the sign language structural model is set on the terminal device side, and the sign language feature recognition model is set on the server side, so that the sign language structural feature is extracted on the terminal device side, and then the sign language structural feature is uploaded to the server side, and the sign language text is recognized on the server side. The method can be determined according to actual requirements, and the embodiment of the present application is not limited thereto.
According to the method, the learning capability of the sign language recognition network is enhanced explicitly by adopting a structural element extraction, structural modeling and learning method aiming at the visual image, and the final recognition precision is improved. And the detailed structural elements can provide customized technical services, such as automatic sentence break, specific action category analysis and the like, so that the accuracy is improved.
Referring to FIG. 4, a flowchart illustrating steps of an embodiment of a sign language translation method of the present application is shown.
Step 402, sign language video data is acquired.
Terminal equipment accessible image acquisition components such as camera gather sign language video data, and the sign language video data that terminal equipment gathered can be received to the server side. The sign language video data includes at least a face image and a sign language image. Wherein the facial image and the sign language image are used for sign language recognition. The sign language video data can be identified by semantically translating sentences as a reference and identifying sign languages sentence by sentence.
The service end can provide a sign language translation page, and the sign language translation page is used for executing sign language translation. Thus, in some embodiments, sign language video data may be displayed in the sign language translation page. For example, when the camera collects sign language video data, the collected sign language video data is displayed in a sign language translation page. In the embodiment of the application, prompt information can be displayed in the sign language translation page, for example, the prompt information aiming at the shooting position is used for reminding a sign language user, the shooting of sign language videos is carried out in a specified area, and the problem that the translation is inaccurate due to incomplete shooting is avoided. The prompt information aiming at the shooting position comprises at least one of the following text prompt information, line prompt information and the like.
For more accurately identifying the sign language of the sign language user, a sign language identification area can be arranged on the sign language translation page, and the sign language identification area can enable the sign language of the sign language user to be located in an acquisition area of the image acquisition assembly, so that the identification failure rate is reduced. And correspondingly, prompting information of the sign language recognition area can be set so as to prompt the input position area. The prompt message of the sign language identification area can be a message in various forms, such as a text prompt message, which prompts the sign language user to make a posture, locate in the middle of the acquisition area, and the like. The sign language can also be line prompt information, for example, the line prompt information is presented as a humanized area to prompt the area where the body of the user is located, so that the acquisition of the sign language is ensured, or various information are combined, and the user body is prompted to be located in a broken line frame through a text.
And 404, performing sign language recognition on the sign language video data according to a sign language recognition model, and determining corresponding sign language translation information according to a sign language recognition result, wherein the sign language recognition model is used for extracting sign language structural features of the sign language video data and performing sign language recognition according to the sign language structural features.
If the terminal equipment side has the sign language recognition model, the sign language video data can be input into the sign language recognition model, sign language recognition is carried out on the sign language video data by adopting the sign language recognition model, and corresponding sign language translation information is determined. If the sign language recognition model is located at the server side, the sign language video data collected by the terminal equipment can be sent to the server side, so that the sign language video data are input into the sign language recognition model at the server side, sign language recognition is carried out on the sign language video data by adopting the sign language recognition model, and corresponding sign language translation information is determined. The sign language recognition model comprises a sign language structural model and a sign language feature recognition model, so that if the sign language structural model is arranged on the side of the terminal device, the sign language feature recognition model is arranged on the server side, so that the sign language structural feature is extracted on the side of the terminal device, the sign language structural feature is uploaded to the server side, and the sign language text is recognized on the server side.
The sign language structural model can identify sign language structural features based on sign language structural elements in the sign language video data, wherein the sign language structural features comprise sign language time sequence features and sign language space features. The sign language structural element can be directly extracted from sign language video data by a sign language structural model, and can also be input into the sign language structural model for feature recognition after being extracted by other models. The processing of the sign language structural model for the sign language video data is similar to the training process, and therefore, the description is omitted. And then inputting the sign language structural features into a sign language feature recognition model, recognizing words and sequences of semantics expressed by the sign language structural features through the sign language feature recognition model, and then adjusting according to the word sequence of the translated natural language to obtain a sign language recognition text as a sign language recognition result. The sign language identification text may be determined as sign language translation information. In other embodiments, sign language recognition text may also be used, and speech data may be synthesized based on text-to-speech (TTS) speech technology to obtain sign language translation speech, which is added to the sign language translation information. The natural language can be understood as a language naturally evolving with culture, that is, a language output by pronunciation. Such as Chinese, English, French, Japanese, etc., or a dialect of a language, such as Guangdong, Minnan, Shanghai, etc. In the embodiment of the application, the sign language translation can be realized by combining multi-dimensional sign language feature data such as expression data and emotion data.
And step 406, outputting the sign language translation information.
For the service side, the sign language translation information can be output to the terminal equipment so as to be displayed on the terminal equipment side. The sign language recognition text can be displayed in the sign language translation page on the terminal equipment side, and sign language translation voice can also be played through the terminal equipment and can be specifically determined according to requirements.
Therefore, sign language video data are obtained, sign language recognition is carried out on the sign language video data according to a sign language recognition model, and corresponding sign language translation information is determined, wherein the sign language recognition model is used for extracting sign language structural features of the sign language video data, sign language recognition is carried out according to the sign language structural features, the learning capacity of a sign language recognition network is enhanced explicitly, the final recognition accuracy is improved, the sign language translation information is output, and therefore sign language translation is carried out conveniently.
Users using sign language are generally classified into several cases, such as hearing-impaired users with hearing impairment, speech-impaired users without speech, or users in which both cases exist, which may be called deaf-dumb users. In the embodiments of the present application, a user who executes sign language is referred to as a sign language user. The sign language translation page can be set based on specific conditions, for example, for a barrier user, only sign language translation can be provided, for hearing barrier users, deaf-mute users and the like, sign language translation aiming at natural language can be provided in the sign language translation page, namely, the natural language is translated into the sign language, an avatar is driven to carry out the sign language, and video data are synthesized, so that bidirectional communication between the sign language user and other users is facilitated. The setting can be specifically set according to the requirements of users, and the embodiment of the application is not limited to this. Wherein the sign language translation page includes a sign language input area and a sign language output area for a scenario of bidirectional translation, such as an example of a sign language translation page shown in fig. 5A. The sign language input area is used for displaying the collected sign language video data, the user executing the sign language in the collected sign language video data is a real user, and the sign language output area is used for displaying the synthesized sign language video data. Playing the sign language video data in a sign language input area of the sign language translation page; and playing synthesized sign language video data in a sign language output area of the sign language translation page, wherein the synthesized sign language video data is video data for executing sign language by adopting an avatar, and the sign language executed by the avatar is determined according to input information. For non-sign language users, information can be input in a voice or text mode, semantic analysis can be carried out on corresponding input information, the information is translated into sign language based on the semantic, the virtual image is driven to execute the sign language, the hand action and/or the facial expression of the virtual image are/is driven, corresponding sign language video data are synthesized, and then the sign language video data can be displayed in a sign language output area, namely, the sign language users can watch the sign language and understand the meaning expressed by other users in the mode of driving the virtual image to execute the sign language. Thus, by the above example of translating sign language into natural voice and translating natural voice into sign language, the sign language translation page according to the embodiment of the present application can provide automatic translation of sign language, and for a sign language video, a translated natural language can be output by voice, text, and the like, and for the sign language translated by the natural language, an avatar can be driven to execute and display as a corresponding sign language video, so that a sign language user can view the sign language video data. The synthesized sign language video data in the embodiment of the present application is sign language video data synthesized by an avatar (also referred to as a digital person). The avatar is a user that simulates a human body based on parameters such as the form and function of the human body through information technology, for example, a character is modeled based on 3D technology in combination with parameters such as the form of the human body, and the avatar obtained through simulation technology may also be referred to as a digital person, a virtual character, and the like. The virtual image can drive to execute actions based on various parameters of human body shapes, limbs, postures and the like, so that sign language actions are simulated, sign languages are executed through the virtual image, and corresponding video data are generated to perform sign language interaction.
In the embodiment of the application, barrier-free interaction aiming at sign language can be applied to various scenes. For example, in a scene of face-to-face communication with sign language users, such as various communication processes of sign language users for registration, payment, medicine taking, inquiry and the like in a medical scene; the method is also applied to face-to-face shopping exchange in shopping scenes such as shopping malls, supermarkets and markets of sign language users; as well as for providing legal service scenarios for sign language users. The barrier-free communication can also be applied to the communication process of sign language users and remote users, and the sign language users can conveniently communicate remotely. For example, in a shopping scenario, a merchant may provide sign language translation services through a device, and when a sign language user enters a shopping environment, such as a store or other merchant, a translation control in a lead page may be triggered to enter a sign language translation page. In another example, in a medical registration scenario, a hospital may provide the device in a registration window, and a sign language user may trigger a translation instruction into a sign language translation page by himself.
In some scenarios, the sign language used by the sign language users may also be different, for example, there is a difference between sign languages in different countries, or there is a certain difference between a natural sign language and a standard sign language, and so on, so that the barrier-free interaction of the embodiment of the present application may also provide a sign language translation service between sign language users using different sign languages, thereby facilitating communication between sign language users. For the translation of different sign language users, sign language video data can be respectively collected by the front camera and the rear camera of one device, and the sign language video data can be transmitted and processed based on a server after being respectively collected by different devices, so that interaction is realized.
On the basis of the above embodiments, the embodiments of the present application further provide a customer service scenario in which a sign language user interacts with a non-sign language user, as shown in fig. 5B and 5C.
Step 502, providing a sign language customer service page.
The service page may provide a sign language translation entry to the user so that the sign language service page may be entered based on the sign language translation entry.
And step 504, acquiring first hand language video data through an image acquisition component, and displaying the first hand language video data in a hand language input area of the hand language customer service page.
The method comprises the steps of collecting sign language video data at a sign language user (first equipment) side, collecting text data at a non-sign language user (second equipment) side, synthesizing synthesized sign language video data based on the text data, and sending the synthesized sign language video data to the first equipment, so that an avatar can watch the synthesized sign language video data. Correspondingly, sign language identification text obtained by translating the collected sign language video data of the sign language user is fed back to the second equipment of the customer service. Wherein the device provides a sign language translation page, the sign language translation page comprising: a sign language input area and a sign language output area. Taking the first device as the device of the sign language user and the second device as the device of the non-sign language user as an example, the translation page is a customer service page, such as a customer service page of a shopping application program, a service page of a medical consultation page, and the like. The first device collects first finger language video data through the image collection assembly. The first device displays the captured first finger language video data in a finger language input area. And the first equipment uploads the acquired first hand language video data to a server.
Step 506, determining sign language translation information corresponding to the first sign language video data to output the sign language translation information in the customer service page, wherein the sign language translation information is obtained by extracting sign language structural features of the first sign language video data through a sign language recognition model and executing sign language recognition processing.
And the service end carries out sign language identification on the collected sign language video data to obtain corresponding sign language translation information, such as a sign language identification text, and the sign language identification text can be sent to the second equipment, so that a customer service end can check text messages on a service page. Performing sign language recognition on the sign language video data according to a sign language recognition model, and determining sign language structural characteristics if feature extraction is performed on the sign language video data through a sign language visual structural model; and identifying the sign language structural features through a sign language feature identification model to obtain a sign language identification text.
Step 508, receiving second language video data synthesized according to the service reply information of the customer service, wherein the second language video data is video data of a sign language executed by an avatar according to the semantics of the service reply information.
And 510, displaying the second sign language video data in a sign language output area of the sign language customer service page.
The second device receives service reply information, such as text data of the service reply, and uploads the text data to the server. And the server performs semantic recognition according to the text data and synthesizes second language video data. And determining sign language parameters according to the text data, and generating second sign language video data containing the virtual image according to the sign language parameters. And the server side sends the second sign language video data to the first equipment, so that the sign language user can watch the corresponding sign language service and can provide the required service for the sign language user.
In an embodiment of the application, the sign language translation page may provide a language selection control, and the language selection control is used for selecting a target language. The target language may include various sign languages and various natural languages. Different sign languages of different countries have certain differences, so that sign language selection controls can be provided, and the sign language selection controls are used for selecting different kinds of sign languages, such as Chinese sign language, English sign language and the like, wherein the different kinds of sign languages can be understood as sign languages of different countries, and can also comprise standard sign languages and natural sign languages, and the natural sign languages refer to sign languages formed naturally. The language selection control may also include a natural language selection control for selecting a translated natural language, such as chinese, english, french, and dialects, for example, thereby facilitating use by various users. Displaying language selectable items in response to triggering of a language selection control in the sign language translation page; in response to a trigger for a language selectable item, a selected target language is determined.
In the embodiment of the application, the required input and output modes can be adjusted based on requirements, for example, an input adjustment control and an output adjustment control are arranged on a page, and different input and output modes can be switched based on the adjustment of the corresponding control. In addition, the switching of input and output modes can be triggered through gestures. Wherein, can be according to first gesture operation, adjust the input mode, the input mode includes: a voice input mode, a text input mode and/or a video input mode; adjusting an output mode according to the second gesture operation, wherein the output mode comprises the following steps: a voice output mode, a text output mode and/or a video output mode. The gesture of this embodiment may be a default gesture, may also be a custom gesture, and may further use a sign language indicating switching as a first gesture operation and a second gesture operation of switching, so that after the gesture operation is detected, an input and output mode may be adjusted based on the gesture operation, for example, switching from sign language input to voice input, or adjusting from voice output to text data, and the like, may be determined based on a requirement. Responding to an output adjusting instruction, and adjusting an output mode of the sign language translation information, wherein the output mode comprises the following steps: a voice output mode, a text output mode and/or a video output mode. The output adjusting instruction can be generated based on the second gesture operation, and can also be generated based on the trigger of the output mode adjusting control provided by the page.
The embodiment of the application can be applied to various service scenes, so that the sign language translation page can also provide various service information, and the information types of the service information comprise: at least one of service text data, service voice data, and service sign language video data; the content type of the service information includes at least one of: prompt information and scene commonly used phrases. That is, the service information may be output in the form of sign language, voice, text, etc., and the content corresponding to the service information may be various kinds of prompt information, commonly used words of scenes, etc.
The service information includes prompt information, and the prompt information may be prompt information of various events, such as waiting prompt information, failure prompt information, operation prompt information, and the like. For example, the waiting prompt message may prompt the sign language user in the form of sign language video, text, etc. to wait for the translation or input data, or prompt the other user in the form of voice, text, etc. to wait for the translation or input data. For the fault prompt information, the corresponding user can be prompted through various forms such as voice, text, sign language video and the like, and the current fault, such as the content of network problems, incapability of translation, translation failure and the like, occurs. The operation prompt information can prompt corresponding user to execute operations such as starting translation, ending translation, switching languages and the like through various forms such as voice, text, sign language video and the like. And prompting input, for example, a sign language user is separated from a sign language recognition area, prompting can be performed, and other users can also be prompted if the voice is relatively small.
The scene commonly used phrases can be related to the translated scene, for example, in a shopping scene, the scene commonly used phrases can be commonly used phrases related to shopping, such as welcome phrases, price replies, commodity introductions, shopping inquiries and the like; also as in the medical scene, commonly used terms for symptoms, insurance, etc.; and as in legal service scenarios, for queries about basic information of users, etc. In short, the common expressions of the scene can be predetermined based on the actually applied scene, and corresponding data such as text, voice, sign language video and the like can be obtained.
The service information is information in the scene service, such as information with a relatively high use frequency, and necessary prompt information. Therefore, the service information can be stored locally in the device in advance, and each service information can correspond to service conditions, such as prompting conditions, scene conditions and the like, and is determined by combining with a specific use scene, and when the service conditions are detected to be met, the corresponding service information is output.
On the basis of the above embodiments, the embodiments of the present application may further determine scene information, and determine scene parameters based on the scene information, so as to assist sign language translation through the scene parameters. And can determine the required service information such as common words of scenes based on the scene information and scene parameters. For example, scene parameters, such as names, tags, attributes, etc., of scenes may be determined based on the scene information, and sign language translation may be assisted based on the parameters, such as invoking a corresponding sign language database, etc. The determined scene information comprises at least one of the following information: analyzing the background of the collected sign language video data to determine corresponding scene information; for the collected sign language video data, the background, such as outdoor or indoor, shopping mall or tourist attractions, can be analyzed through visual processing, so that corresponding scene information is determined based on the analyzed background. Acquiring environment sound data through an audio input assembly, and determining corresponding scene information according to the environment sound data; the environment sound can be analyzed according to the collected voice data, video data and the like, the current environment of the user is determined, and corresponding scene information is obtained. Analyzing the collected voice data to determine corresponding scene information; the analysis of the collected voice data may include content analysis, ambient sound analysis, and the like, to determine scene information. Acquiring position information, and determining scene information according to the position information; the method can also obtain the position information from the terminal equipment, and determine the corresponding scene information based on the position information, for example, determine the current position in a school, a hospital, a market, etc. based on the position information, and determine the corresponding scene information. Determining a target page before the translation page, and determining scene information according to the target page; the page turning page can be entered from other pages, so that a page before entering the translation page can be taken as a target page, and then scene information is analyzed based on the target page, for example, the target page is a payment page, a shopping page, a customer service page of a shopping application, and the like, and corresponding scene information can be determined. Determining an operated application program, and determining scene information according to the operated application program; it is also possible to detect an application that has been run in the device and determine scene information based on the type, function, etc. of the application. Such as shopping applications, social applications, instant messaging applications, etc., wherein the executed applications include the application in which the sign language translation page is located, and other applications that are executed in the background or foreground, which may be determined based on the requirements. And acquiring time information, and determining scene information according to the time information. Scene information, such as day, night, working day, holiday, and the like, can also be determined based on the time information, and is specifically determined according to requirements.
In the embodiment of the application, the scene parameters can be obtained by integrating the scene information determined by the dimensions, so that the processes of sign language translation, sign language synthesis and the like can be assisted based on the scene parameters.
In the embodiment of the application, the sign language translation page further comprises an exit control, and an exit instruction is received according to the triggering of the exit control in the sign language translation page; and closing the sign language translation page according to the quit instruction. If shopping is finished, registration for medical treatment is carried out, and the like, an ending control can be triggered, the sign language translation page is closed, and the guidance page is returned. Thereby providing sign language services to users in various scenarios and assisting in interaction with sign language users.
In the embodiment of the application, each area is further provided with an indication element, and the indication element is used for indicating the input and output states of the current area. The display device can be implemented in various forms, for example, the indication element is an interface icon, and the input state and the output state are indicated by different colors, for example, the input state is red, the output state is green, and the idle state without input and output is gray. If the indication element is a dynamic element, different input and output states can be indicated through dynamic indication effects. An example of such a dynamic element is an indicator light. The indicator light can indicate different input and output states through different apertures. For example, when inputting or outputting, the aperture is dynamically enlarged or reduced to indicate that input or output is currently performed. And may also be prompted in conjunction with different colors, text, etc. Indication primary colors can be respectively arranged in the sign language input area and the sign language output area, so that the input and output states of the area are indicated, and the input and output states of other areas are indicated. Or displaying an indication primary color in the translation page, and prompting the current input and output user through different colors, dynamic effects, characters and the like. Accordingly, an indication element for indicating an input, output state may be displayed in the translation page; the indication element comprises at least one of: text indication elements, dynamic indication elements, color indication elements. As in the example of fig. 6A, the substeps thereof reveal a dynamic effect of the indication element corresponding to the breathing light pattern, and the indication element can reveal the dynamic effect by stepwise enlargement and reduction of the aperture when having an input or an output indicating that the input or the output is being performed. When the user inputs the information, the information is displayed as "A" and the color is adjusted from dark to light, and when the user inputs the information, the information is displayed as "B" and the color is adjusted from light to dark. In one example, as shown in fig. 6B, an indicator element of a breathing light pattern is provided, which is gray in an idle state, and lights up when there is input or output, and is displayed as a breathing light pattern. And in the scene of bidirectional translation, the user who inputs or outputs can be represented by displaying characters on the indication element, such as that "A" represents the user A, "B" represents the user B, and "C" represents the virtual image, so that the user who performs input or output can be visually indicated. For example, when it is detected that the user a performs input or output, "a" may be displayed by the indication element and indicate that the user a is inputting or outputting by a dynamic change or a color change. As another example, when it is detected that the counterpart performs input or output, "B" or "C" may be displayed through the indication element, and the counterpart user B is being input or the avatar C is being output through a dynamic change or a color change. As another example, when the avatar outputs sign language, the indication element on the second interface may display information such as a short name, a nickname, a code number, and the like of the avatar, such as "nine", and indicate that the avatar is outputting sign language through dynamic change or color change.
In the embodiment of the application, the sign language page turning page further comprises an indication tag, and the indication tag can be used for indicating an input state, conversation time, service information and the like. In one example, the indicator tab may be located at the interface of the sign language input area and the sign language output area and may be used to indicate various desired information. For example, the service information is displayed on an indication label, and various kinds of prompt information, scene commonly used words, recommendation information corresponding to scenes, and the like can be indicated. Various types of information may also be presented, such as prompting for input status, in conjunction with an indication element, and the duration of the current translation. The indication label can display different information through different colors, icons, characters and the like, and can prompt through corresponding switching patterns when different information is switched, such as various patterns of turning switching, zooming switching, shutter switching and the like, so that the change of the information can be prompted. Displaying an indication tag in the sign language translation page; switching between different indication labels by setting patterns.
On the basis of the above embodiments, the present application further provides a sign language translation method, which is applied to a terminal device side and can perform sign language translation based on a sign language translation page.
Referring to FIG. 7, a flowchart illustrating steps of another sign language translation method embodiment of the present application is shown.
Step 702 provides a sign language translation page.
A translation guide page may be provided, which may serve as a home page of the sign language translation service to guide the user to make a translation page. The translation guide page thus provides translation controls. Based on a trigger on a translation control in the translation guide page, a translation instruction may be received. In other scenarios, the translation function may be provided by a special application, such as providing a translation entry via an icon of the application, a function button of an application page, or the like, so that the translation instruction may be generated by triggering the translation entry. For example, a translation guidance page or translation portal may be provided in various types of applications, such as a messaging application, a payment application, a social application, a service application, and the like, to facilitate sign language users to use sign language in various scenarios.
Step 704, collecting sign language video data through an image collection component, and displaying the sign language video data in the sign language translation page. Therefore, the sign language user can watch the sign language made by the user through the sign language translation page and determine whether the sign language is completely shot.
Step 706, acquiring sign language translation information corresponding to the sign language video data, wherein the sign language translation information is obtained by extracting sign language structural features of the sign language video data through a sign language recognition model and executing sign language recognition processing.
If the terminal equipment side has the sign language recognition model, the sign language video data can be input into the sign language recognition model, sign language recognition is carried out on the sign language video data by adopting the sign language recognition model, and corresponding sign language translation information is determined. If the sign language structural model is arranged on the terminal equipment side, the sign language feature recognition model is arranged on the server side, so that the sign language structural feature is extracted on the terminal equipment side, the sign language structural feature is uploaded to the server side, and sign language text recognition is carried out on the server side. The sign language structural model can identify sign language structural features based on sign language structural elements in the sign language video data, wherein the sign language structural features comprise sign language time sequence features and sign language space features. The sign language structural element can be directly extracted from sign language video data by a sign language structural model, and can also be input into the sign language structural model for feature recognition after being extracted by other models. The processing of the sign language structural model for the sign language video data is similar to the training process, and therefore, the description is omitted. And then inputting the sign language structural features into a sign language feature recognition model, recognizing words and sequences of semantics expressed by the sign language structural features through the sign language feature recognition model, and then adjusting according to the word sequence of the translated natural language to obtain a sign language recognition text. The sign language identification text may be determined as sign language translation information. In other embodiments, the sign language recognition text may also be used to synthesize speech data to obtain sign language translation speech, which is added to the sign language translation information.
Step 708, outputting the sign language translation information through the sign language translation page. The sign language recognition text can be displayed in the sign language translation page on the terminal equipment side, and sign language translation voice can also be played through the terminal equipment and can be specifically determined according to requirements.
On the basis of the above embodiment, the embodiment of the application further provides a sign language translation method, which is applied to a server and can perform sign language translation based on the sign language translation page.
Referring to FIG. 8, a flow diagram of the steps of another sign language translation method embodiment of the present application is shown.
At step 802, sign language video data is received. The service end can receive sign language video data collected by the terminal equipment. The sign language video data includes at least a face image and a sign language image. Wherein the facial image and the sign language image are used for sign language recognition. The sign language video data can be identified by semantically translating sentences as a reference and identifying sign languages sentence by sentence.
And 804, extracting the features of the sign language video data by adopting a sign language visual structural model, and determining the sign language structural features.
And 806, identifying the sign language structural features through the sign language feature identification model to obtain a sign language identification text.
If the sign language recognition model is located at the server side, the sign language video data collected by the terminal equipment can be sent to the server side, so that the sign language video data are input into the sign language recognition model at the server side, sign language recognition is carried out on the sign language video data by adopting the sign language recognition model, and corresponding sign language translation information is determined. The sign language structural model can identify sign language structural features based on sign language structural elements in the sign language video data, wherein the sign language structural features comprise sign language time sequence features and sign language space features. The sign language structural element can be directly extracted from sign language video data by a sign language structural model, and can also be input into the sign language structural model for feature recognition after being extracted by other models. The processing of the sign language structural model for the sign language video data is similar to the training process, and therefore, the description is omitted. And then inputting the sign language structural features into a sign language feature recognition model, recognizing words and sequences of semantics expressed by the sign language structural features through the sign language feature recognition model, and then adjusting according to the word sequence of the translated natural language to obtain a sign language recognition text. The sign language identification text may be determined as sign language translation information. In other embodiments, the sign language recognition text may also be used to synthesize speech data to obtain sign language translation speech, which is added to the sign language translation information.
And step 808, feeding back the sign language translation information. For the service side, the sign language translation information can be sent to the terminal equipment so as to be displayed on the terminal equipment side.
On the basis of the above embodiments, the embodiments of the present application also provide an example of bidirectional translation.
Referring to FIG. 9, a flowchart illustrating the steps of an embodiment of a bidirectional sign language translation method of the present application is shown.
Step 900, providing a sign language translation page, where the sign language translation page includes: a sign language input area (or first area) and a sign language output area (or second area).
In step 910, a first finger language video data is collected by an image collection component. The first sign language video data of the sign language user can be collected through image collection components such as a local camera, for example, the sign language video data of the sign language user can be collected through a front camera of a mobile phone.
Step 912, displaying the collected first finger language video data in the finger language input area.
Step 914, obtaining sign language translation information corresponding to the first hand language video data. The sign language identification method comprises the steps of carrying out sign language identification on sign language video data according to a sign language identification model, and determining corresponding sign language translation information, wherein the sign language identification model is used for extracting sign language structural features of the sign language video data and carrying out sign language identification according to the sign language structural features. The sign language translation information includes sign language recognition text and/or sign language translation speech.
If the terminal equipment side has the sign language recognition model, the sign language video data can be input into the sign language recognition model, sign language recognition is carried out on the sign language video data by adopting the sign language recognition model, and corresponding sign language translation information is determined. If the sign language recognition model is located at the server side, the sign language video data collected by the terminal equipment can be sent to the server side, so that the sign language video data are input into the sign language recognition model at the server side, sign language recognition is carried out on the sign language video data by adopting the sign language recognition model, and corresponding sign language translation information is determined. The sign language recognition model comprises a sign language structural model and a sign language feature recognition model, so that if the sign language structural model is arranged on the side of the terminal device, the sign language feature recognition model is arranged on the server side, so that the sign language structural feature is extracted on the side of the terminal device, the sign language structural feature is uploaded to the server side, and the sign language text is recognized on the server side. The sign language structural model can identify sign language structural features based on sign language structural elements in the sign language video data, wherein the sign language structural features comprise sign language time sequence features and sign language space features. The sign language structural element can be directly extracted from sign language video data by a sign language structural model, and can also be input into the sign language structural model for feature recognition after being extracted by other models. The processing of the sign language structural model for the sign language video data is similar to the training process, and therefore, the description is omitted. And then inputting the sign language structural features into a sign language feature recognition model, recognizing words and sequences of semantics expressed by the sign language structural features through the sign language feature recognition model, and then adjusting according to the word sequence of the translated natural language to obtain a sign language recognition text. The sign language identification text may be determined as sign language translation information. In other embodiments, the sign language recognition text may also be used to synthesize speech data to obtain sign language translation speech, which is added to the sign language translation information.
Step 916, outputting sign language translation information through the sign language translation page. Therefore, the acquisition, the recognition and the translation of sign language data can be realized, the meaning of the sign language is output, and the sign language user can know the meaning of the sign language user. The sign language recognition text can be displayed in the sign language translation page on the terminal equipment side, and sign language translation voice can also be played through the terminal equipment and can be specifically determined according to requirements.
Step 920, voice data is collected via the audio input component. The non-sign language user can input through voice, for example, if the voice says 'do you take the medical insurance card' in the medical scene, the corresponding equipment can collect voice data through an audio input component such as a microphone. Then, second phrase video data synthesized corresponding to the collected voice data can be obtained, where the second phrase video data is video data in which the avatar executes the phrase according to the semantic of the voice data, and the method specifically includes steps 922 and 924. In other examples, the input may be performed through text, which is a voice input in this example, and if the input is text, step 924 may be performed.
And step 922, performing voice recognition on the collected voice data, and determining corresponding text data.
Step 924, determining sign language parameters according to the text data, and generating second sign language video data containing the virtual image according to the sign language parameters.
Recognizing emotion information according to the collected voice data, and determining expression parameters according to the emotion information; the second sign language video data containing the virtual image is generated according to the sign language parameters, and the method comprises the following steps: and generating second sign language video data containing the virtual image according to the sign language parameters and the expression parameters. And performing voice recognition on the collected voice data to obtain corresponding second text data. And emotion recognition can be performed on the collected voice data, for example, emotion information can be recognized based on various information such as volume, speed of speech and vocabulary, and corresponding expression parameters can be determined based on the emotion information. If the emotion information is angry, happy, excited and the like, the expression parameters of the corresponding emotion can be determined correspondingly. And then driving the avatar to execute the sign language based on the sign language parameters and the expression parameters, wherein the avatar may be generated based on 3D technology modeling, and then driving the avatar to execute the actions of the sign language and the corresponding expressions, mouth shapes, etc. based on the driving sign language parameters and the expression parameters, and generating second sign language video data.
In step 926, the second sign language video data is displayed in the sign language output area of the sign language translation page. Therefore, the sign language user can watch the second sign language video data of the virtual image displayed in the sign language output area, so that the words of other users, such as words of non-sign language users, can be known, and communication can be realized. In addition, the text data corresponding to the input can also be displayed in the sign language translation page, such as in the sign language output area of the sign language translation page.
In the embodiment of the present application, the sign language user may also be referred to as a first user, and the non-sign language user may also be referred to as a second user.
The following provides an embodiment for barrier-free communication in sign language based on device and server interaction, providing a video communication page with sign language translation functionality upon which a remote user can effect barrier-free communication, wherein the two users can be sign language users and non-sign language users.
Referring to fig. 10, an interaction diagram of another barrier-free communication method embodiment of the present application is shown. As shown in fig. 10, both sign language users and non-sign language users interact through video, where sign language video data is collected on the sign language user (first device) side and voice data is collected on the non-sign language user (second device) side. The following steps can be specifically executed:
step 1000, a device provides a video communication page, where the video communication page includes: the home terminal display area and the opposite terminal display area take the home terminal display area as a sign language input area and the opposite terminal display area as a sign language output area as an example. Take the first device as the device of the sign language user and the second device as the device of the non-sign language user as an example. For example, the sign language translation page is a video communication page of an Instant Messaging (IM) application.
Step 1002, a first device acquires first video data through an image acquisition component. The first video data comprises first sign language video data.
Step 1004, the first device displays the first video data in a local display area of the video call page.
Step 1006, the first device uploads the collected first mobile language video data to the server.
And 1008, the service side performs sign language recognition on the sign language video data according to a sign language recognition model, and determines corresponding sign language translation information, wherein the sign language recognition model is used for extracting sign language structural features of the sign language video data and performing sign language recognition according to the sign language structural features. The sign language translation information comprises synthesized sign language translation voice and sign language identification text. And performing feature extraction on the sign language video data by adopting a sign language visual structural model to determine the sign language structural features. And identifying the sign language structural features through a sign language feature identification model to obtain a sign language identification text.
Step 1010, the server sends the collected first sign language video data and sign language translation information. The service end can send at least one of sign language translation voice and sign language recognition text synthesized in the sign language translation information to the first device. The determination as to the fed back data may be based on various situations such as setting of sign language user, network situation, etc. to determine whether to return sign language translation information. For the second device, the server may return at least one of synthesized sign language interpreted speech and sign language recognized text so that the user of the second device can understand what the sign language user has expressed. Of course, the collected sign language video data may also be fed back to the second device based on settings, network conditions, etc.
If the communication scene is applied to a scene of unidirectionally translating the sign language into the natural language, the server side feeds back the sign language video data and the sign language translation information to the second device side, so that the sign language video data can be displayed in the second device and corresponding sign language translation information can be output, and interaction can be carried out between a sign language user and a non-sign language user. For example, the sign language user is a non-sign language user, and can understand the words of the non-sign language user, but cannot speak but needs to communicate with the sign language.
If the communication scene is to be translated in two directions of sign language and natural language, the natural language of the non-sign language user is translated into the sign language, and the following steps can be executed:
at step 1012, the audio input component of the second device collects voice data.
And 1014, the second device uploads the acquired voice data to the server.
If the second device collects video data, the video data can be directly transmitted to the server, and the server can separate voice data from the video data for translation.
And step 1016, the server generates synthesized sign language video data according to the collected voice data.
The server can perform voice recognition on the voice data and determine corresponding text data. And determining sign language parameters according to the text data, recognizing emotion information according to the collected voice data, and determining expression parameters according to the emotion information. And generating synthesized sign language video data containing virtual images according to the sign language parameters and the expression parameters.
In step 1018, the server sends the synthesized sign language video data to the first device.
And the server side sends the synthesized sign language video data to the first equipment. Text data, collected voice data may also be sent to the first device. And for the second device, whether to feed back the synthesized sign language video data, text data, collected voice data may be determined based on settings, network conditions, and the like.
And step 1020, the first device displays the collected sign language video data in a sign language output area.
So that the sign language user can perform barrier-free communication with the non-sign language user through the sign language translation page.
In the embodiment of the application, sign language video data is translated, and in the translation process, a sign language recognition result can be fed back to a sign language user, so that the sign language user can confirm whether the sign language data is accurate or not, if the sign language data is not accurate, the text can be adjusted based on a corresponding adjusting control, and corresponding candidate suggestions can be given during adjustment. In addition, in the process of translating the natural language into the sign language, after the sign language video data of the virtual image is displayed to the sign language user, the sign language video data can also prompt that the sign language video data is output completely, and confirm whether the sign language user understands the meaning of the previous virtual image sign language, if not, a translation adjusting control can be given, and corresponding candidate texts can be given, so that the sign language video data of the virtual image can be adjusted based on the candidate texts, and the accuracy of translation is improved.
On the basis of the above embodiments, the present application embodiment further provides a sign language teaching method, as shown in fig. 11.
Step 1102, provide a sign language teaching page.
And 1104, displaying target teaching information on the sign language teaching page.
Step 1106, acquiring first sign language video data through an image acquisition component, and displaying the first sign language video data in a sign language input area of the sign language teaching page, wherein the first sign language video data is video data of sign language users executing sign language according to the target teaching information.
The sign language teaching page includes a sign language input area and a sign language output area for displaying standard sign language of the avatar for teaching collation. Therefore, target teaching information can be displayed on the sign language teaching page, and the target teaching information can be text data, and voice data can also be adopted in some examples. The target teaching information is information that a user needs to input sign language. The corresponding user can execute sign language based on the target teaching information, and the equipment collects first sign language video data of the user through the image collection assembly.
Step 1108, uploading the first native language video data.
Step 1110, receiving sign language translation information corresponding to the first sign language video data and synthesized second sign language video data, where the sign language translation information is obtained by extracting sign language structural features from the first sign language video data through a sign language recognition model and executing sign language recognition processing, and the second sign language video data is obtained by executing sign language teaching video data of the target teaching information by the avatar.
And 1112, displaying the second sign language video data in a sign language output area of the sign language teaching page so that sign language users can learn sign languages.
Uploading the first sign language video data to a server side, and performing feature extraction on the sign language video data through a sign language visual structural model by the server side to determine sign language structural features; and identifying the sign language structural features through a sign language feature identification model to obtain a sign language identification text. Whether the sign language recognition text is consistent with the target teaching information is determined based on the sign language recognition text, so that whether the sign language of the user is correct is determined. If the sign language of the user has problems, such as errors or non-standard sign language, the second sign language video data of the virtual image can be compared with the first sign language video data to determine sign language information to be corrected. A remediation tag may then be added to the second or first hand language video data based on the sign language information to be remedied. So that the first hand language video data can be displayed on the device against the standard second hand language video data. The user may also determine a sign language action that requires correction based on the correction indicia in the sign language video data.
At present, there are some implicit feature recognition methods, which use a large-scale sign language data set to perform sign language recognition neural network learning, and this learning method directly uses a neural network to implicitly learn the feature expression of sign language actions, which is a black box learning process, and cannot purposefully improve the precision for some detailed information, specific actions, and specific categories. Compared with the problem of low recognition precision of directly learning sign language actions through the neural network, the embodiment of the application introduces the visual structuralization method, and explicitly, pertinently and customizes tuning of the neural network learning by extracting the structuralization information with fine granularity, so that the sign language recognition technology with high precision and strong generalization performance is realized.
According to the method, the learning capability of the sign language recognition network is enhanced explicitly by adopting a structural element extraction, structural modeling and learning method aiming at the visual image, and the final recognition precision is improved. And the detailed structural elements can provide customized technical services, such as automatic sentence break, specific action category analysis and the like, so that the accuracy is improved.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.
On the basis of the above embodiments, the present embodiment further provides a sign language translation apparatus, which is applied to a terminal device.
And the acquisition module is used for acquiring sign language video data. And the sign language identification module is used for carrying out sign language identification on the sign language video data according to a sign language identification model and determining corresponding sign language translation information according to a sign language identification result, wherein the sign language identification model is used for extracting sign language structural features of the sign language video data and carrying out sign language identification according to the sign language structural features. And the output module is used for outputting the sign language translation information.
The sign language recognition module is used for adopting a sign language recognition text in a sign language recognition result as sign language translation information; and/or performing voice synthesis by adopting the sign language recognition text in the sign language recognition result, and taking the synthesized sign language translation voice as sign language translation information.
The output module is also used for providing a sign language translation page; playing the sign language video data in the sign language translation page; displaying sign language identification text in the sign language translation page, and/or playing the sign language translation audio based on the sign language translation page.
The adjusting module is used for responding to the trigger of a language selection control in the sign language translation page and displaying language selectable items; in response to a trigger for a language selectable item, a selected target language is determined, the target language being a language in which sign language video data is translated.
The adjusting module is configured to adjust an output mode of the sign language translation information in response to an output adjusting instruction, where the output mode includes: a voice output mode, a text output mode and/or a video output mode.
The sign language translation page comprises a sign language input area, a sign language output area and an output module, and is also used for playing the sign language video data in the sign language input area of the sign language translation page; and playing synthesized sign language video data in a sign language output area of the sign language translation page, wherein the synthesized sign language video data is video data for executing sign language by adopting an avatar, and the sign language executed by the avatar is determined according to input information.
The sign language identification module is used for extracting the features of the sign language video data through a sign language visual structural model and determining the sign language structural features; and identifying the sign language structural features through a sign language feature identification model to obtain a sign language identification text.
And the auxiliary module is used for determining scene information based on set conditions and determining scene parameters according to the scene information so as to assist sign language translation through the scene parameters.
In an alternative embodiment, a bidirectional sign language translation apparatus is provided: the output module is used for providing a sign language translation page; displaying first sign language video data in a sign language input area of the sign language translation page; acquiring sign language translation information corresponding to the first sign language video data, wherein the sign language translation information is obtained by extracting sign language structural features of the sign language video data through a sign language recognition model and executing sign language recognition processing; outputting the sign language translation information through the sign language translation page; acquiring second hand language video data correspondingly synthesized by the acquired voice data, wherein the second hand language video data is video data of a virtual image executing hand language according to the semantic meaning of the voice data; and displaying the second sign language video data in a sign language output area of the sign language translation page.
The acquisition module is used for acquiring first finger language video data through the image acquisition assembly; voice data is collected through the audio input assembly.
In an alternative embodiment, there is provided a sign language customer service device: the output module is used for providing a sign language customer service page; displaying the first sign language video data in a sign language input area of the sign language customer service page; determining sign language translation information corresponding to the first sign language video data to output the sign language translation information in a customer service page, wherein the sign language translation information is obtained by extracting sign language structural features of the first sign language video data through a sign language recognition model and executing sign language recognition processing; receiving second hand language video data synthesized according to service reply information of customer service, wherein the second hand language video data is video data of a virtual image executing sign language according to the semantics of the service reply information; and displaying the second sign language video data in a sign language output area of the sign language customer service page.
The acquisition module is used for acquiring first finger language video data through the image acquisition assembly.
In an alternative embodiment, there is provided a sign language communication device: the output module is used for providing a video communication page; displaying the first video data in a local display area of the video call page, wherein the first video data comprises first finger language video data; displaying sign language translation information of the first sign language video data in a home terminal display area of the video call page, wherein the sign language translation information is obtained by extracting sign language structural features of the first sign language video data through a sign language recognition model and executing sign language recognition processing; receiving second hand language video data synthesized according to communication information of an opposite terminal, wherein the second hand language video data is video data of a virtual image executing hand language according to the semantic meaning of the communication information, and the communication information comprises at least one of text information, voice information and video information; and displaying the second phrase video data in an opposite-end display area of the video call page.
The acquisition module is used for acquiring first video data through the image acquisition assembly.
In an alternative embodiment, there is provided a sign language teaching apparatus: the output module is used for providing a sign language teaching page; displaying target teaching information on the sign language teaching page; displaying the first sign language video data in a sign language input area of the sign language teaching page, wherein the first sign language video data is video data of sign language users executing sign language according to the target teaching information; receiving sign language translation information corresponding to the first sign language video data and synthesized second sign language video data, wherein the sign language translation information is obtained by extracting sign language structural features of the first sign language video data through a sign language recognition model and executing sign language recognition processing, and the second sign language video data is obtained by executing sign language teaching video data of the target teaching information for the virtual image; and displaying the second sign language video data in a sign language output area of the sign language teaching page so that sign language users can learn sign languages.
The acquisition module is used for acquiring first finger language video data through the image acquisition assembly and uploading the first finger language video data.
In summary, sign language video data are obtained, sign language recognition is carried out on the sign language video data according to a sign language recognition model, corresponding sign language translation information is determined, wherein the sign language recognition model is used for extracting sign language structural features of the sign language video data, sign language recognition is carried out according to the sign language structural features, the learning capacity of a sign language recognition network is explicitly enhanced, the final recognition precision is improved, and the sign language translation information is output, so that sign language translation is conveniently carried out.
Compared with the problem of low recognition precision of directly and implicitly learning sign language actions through a neural network, the embodiment of the application introduces a visual structuralization method, and explicitly, pertinently and customizes tuning of neural network learning by extracting fine-grained structuralization information, so that a high-precision and high-generalization sign language recognition technology is realized.
The present application further provides a non-transitory, readable storage medium, where one or more modules (programs) are stored, and when the one or more modules are applied to a device, the device may execute instructions (instructions) of method steps in this application.
Embodiments of the present application provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an electronic device to perform the methods as described in one or more of the above embodiments. In the embodiment of the present application, the electronic device includes various types of devices such as a terminal device and a server (cluster).
Embodiments of the present disclosure may be implemented as an apparatus, which may include electronic devices such as a terminal device, a server (cluster), etc. within a data center, using any suitable hardware, firmware, software, or any combination thereof, in a desired configuration. Fig. 12 schematically illustrates an example apparatus 1200 that can be used to implement various embodiments described herein.
For one embodiment, fig. 12 illustrates an example apparatus 1200 having one or more processors 1202, a control module (chipset) 1204 coupled to at least one of the processor(s) 1202, a memory 1206 coupled to the control module 1204, a non-volatile memory (NVM)/storage 1208 coupled to the control module 1204, one or more input/output devices 1210 coupled to the control module 1204, and a network interface 1212 coupled to the control module 1204.
The processor 1202 may include one or more single-core or multi-core processors, and the processor 1202 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 1200 can be used as a terminal device, a server (cluster), or the like in the embodiments of the present application.
In some embodiments, the apparatus 1200 may include one or more computer-readable media (e.g., the memory 1206 or NVM/storage 1208) having instructions 1214 and one or more processors 1202 in combination with the one or more computer-readable media and configured to execute the instructions 1214 to implement modules to perform the actions described in this disclosure.
For one embodiment, the control module 1204 may include any suitable interface controllers to provide any suitable interface to at least one of the processor(s) 1202 and/or to any suitable device or component in communication with the control module 1204.
The control module 1204 may include a memory controller module to provide an interface to the memory 1206. The memory controller module may be a hardware module, a software module, and/or a firmware module.
Memory 1206 may be used, for example, to load and store data and/or instructions 1214 for apparatus 1200. For one embodiment, memory 1206 may comprise any suitable volatile memory, such as suitable DRAM. In some embodiments, the memory 1206 may comprise a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).
For one embodiment, the control module 1204 may include one or more input/output controllers to provide an interface to the NVM/storage 1208 and the input/output device(s) 1210.
For example, NVM/storage 1208 may be used to store data and/or instructions 1214. NVM/storage 1208 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more hard disk drive(s) (HDD (s)), one or more Compact Disc (CD) drive(s), and/or one or more Digital Versatile Disc (DVD) drive (s)).
The NVM/storage 1208 may include storage resources physically part of the device on which the apparatus 1200 is installed, or it may be accessible by the device and may not necessarily be part of the device. For example, the NVM/storage 1208 may be accessed over a network via the input/output device(s) 1210.
Input/output device(s) 1210 may provide an interface for apparatus 1200 to communicate with any other suitable device, input/output devices 1210 may include communication components, audio components, sensor components, and the like. The network interface 1212 may provide an interface for the device 1200 to communicate over one or more networks, and the device 1200 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as access to a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.
For one embodiment, at least one of the processor(s) 1202 may be packaged together with logic for one or more controller(s) (e.g., memory controller module) of the control module 1204. For one embodiment, at least one of the processor(s) 1202 may be packaged together with logic for one or more controllers of the control module 1204 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 1202 may be integrated on the same die with logic for one or more controller(s) of the control module 1204. For one embodiment, at least one of the processor(s) 1202 may be integrated on the same die with logic of one or more controllers of the control module 1204 to form a system on a chip (SoC).
In various embodiments, the apparatus 1200 may be, but is not limited to being: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, the apparatus 1200 may have more or fewer components and/or different architectures. For example, in some embodiments, device 1200 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.
The detection device can adopt a main control chip as a processor or a control module, sensor data, position information and the like are stored in a memory or an NVM/storage device, a sensor group can be used as an input/output device, and a communication interface can comprise a network interface.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The sign language translation method and apparatus, the terminal device and the machine-readable medium provided by the present application are introduced in detail, and specific examples are applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (14)

1. A sign language translation method, the method comprising:
acquiring sign language video data;
performing sign language recognition on the sign language video data according to a sign language recognition model, and determining corresponding sign language translation information according to a sign language recognition result, wherein the sign language recognition model is used for extracting sign language structural features of the sign language video data and performing sign language recognition according to the sign language structural features;
and outputting the sign language translation information.
2. The method of claim 1, wherein determining corresponding sign language translation information according to the sign language recognition result comprises:
taking sign language identification texts in the sign language identification results as sign language translation information; and/or
And performing voice synthesis by adopting the sign language recognition text in the sign language recognition result, and taking the synthesized sign language translation voice as sign language translation information.
3. The method of claim 2, further comprising:
providing a sign language translation page;
playing the sign language video data in the sign language translation page;
the outputting the sign language translation information comprises: displaying sign language identification text in the sign language translation page, and/or playing the sign language translation audio based on the sign language translation page.
4. The method of claim 3, further comprising:
displaying language selectable items in response to triggering of a language selection control in the sign language translation page;
in response to a trigger for a language selectable item, a selected target language is determined, the target language being a language in which sign language video data is translated.
5. The method of claim 3, further comprising:
responding to an output adjusting instruction, and adjusting an output mode of the sign language translation information, wherein the output mode comprises the following steps: a voice output mode, a text output mode and/or a video output mode.
6. The method of claim 3, wherein the sign language translation page includes a sign language input area and a sign language output area,
playing the sign language video data in the sign language translation page, including: playing the sign language video data in a sign language input area of the sign language translation page;
the method further comprises the following steps: and playing synthesized sign language video data in a sign language output area of the sign language translation page, wherein the synthesized sign language video data is video data for executing sign language by adopting an avatar, and the sign language executed by the avatar is determined according to input information.
7. The method of claim 1, wherein the sign language recognizing the sign language video data according to the sign language recognizing model comprises:
performing feature extraction on the sign language video data through a sign language visual structural model to determine sign language structural features;
and identifying the sign language structural features through a sign language feature identification model to obtain a sign language identification text.
8. The method of claim 1, further comprising:
determining scene information based on set conditions, and determining scene parameters according to the scene information so as to assist sign language translation through the scene parameters.
9. A sign language translation method, the method comprising:
providing a sign language translation page;
acquiring first hand language video data through an image acquisition assembly, and displaying the first hand language video data in a hand language input area of the hand language translation page;
acquiring sign language translation information corresponding to the first sign language video data, wherein the sign language translation information is obtained by extracting sign language structural features of the sign language video data through a sign language recognition model and executing sign language recognition processing;
outputting the sign language translation information through the sign language translation page;
voice data is collected through an audio input assembly;
acquiring second hand language video data correspondingly synthesized by the acquired voice data, wherein the second hand language video data is video data of a virtual image executing hand language according to the semantic meaning of the voice data;
and displaying the second sign language video data in a sign language output area of the sign language translation page.
10. A sign language customer service method is characterized by comprising the following steps:
providing a sign language customer service page;
acquiring first hand language video data through an image acquisition assembly, and displaying the first hand language video data in a hand language input area of the hand language customer service page;
determining sign language translation information corresponding to the first sign language video data to output the sign language translation information in a customer service page, wherein the sign language translation information is obtained by extracting sign language structural features of the first sign language video data through a sign language recognition model and executing sign language recognition processing;
receiving second hand language video data synthesized according to service reply information of customer service, wherein the second hand language video data is video data of a virtual image executing sign language according to the semantics of the service reply information;
and displaying the second sign language video data in a sign language output area of the sign language customer service page.
11. A sign language communication method, the method comprising:
providing a video communication page;
acquiring first video data through an image acquisition assembly, and displaying the first video data in a local end display area of a video call page, wherein the first video data comprises first finger language video data;
displaying sign language translation information of the first sign language video data in a home terminal display area of the video call page, wherein the sign language translation information is obtained by extracting sign language structural features of the first sign language video data through a sign language recognition model and executing sign language recognition processing;
receiving second hand language video data synthesized according to communication information of an opposite terminal, wherein the second hand language video data is video data of a virtual image executing hand language according to the semantic meaning of the communication information, and the communication information comprises at least one of text information, voice information and video information;
and displaying the second phrase video data in an opposite-end display area of the video call page.
12. A sign language teaching method, the method comprising:
providing a sign language teaching page;
displaying target teaching information on the sign language teaching page;
acquiring first sign language video data through an image acquisition assembly, and displaying the first sign language video data in a sign language input area of the sign language teaching page, wherein the first sign language video data is video data of sign language users executing sign languages according to the target teaching information;
uploading the first hand language video data;
receiving sign language translation information corresponding to the first sign language video data and synthesized second sign language video data, wherein the sign language translation information is obtained by extracting sign language structural features of the first sign language video data through a sign language recognition model and executing sign language recognition processing, and the second sign language video data is obtained by executing sign language teaching video data of the target teaching information for the virtual image;
and displaying the second sign language video data in a sign language output area of the sign language teaching page so that sign language users can learn sign languages.
13. An electronic device, comprising: a processor; and
a memory having executable code stored thereon that, when executed, causes the processor to perform the method of any of claims 1-12.
14. One or more machine-readable media having executable code stored thereon that, when executed, causes a processor to perform the method of any of claims 1-12.
CN202111059974.3A 2021-09-10 2021-09-10 Sign language translation, customer service, communication method, device and readable medium Pending CN113822186A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111059974.3A CN113822186A (en) 2021-09-10 2021-09-10 Sign language translation, customer service, communication method, device and readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111059974.3A CN113822186A (en) 2021-09-10 2021-09-10 Sign language translation, customer service, communication method, device and readable medium

Publications (1)

Publication Number Publication Date
CN113822186A true CN113822186A (en) 2021-12-21

Family

ID=78921859

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111059974.3A Pending CN113822186A (en) 2021-09-10 2021-09-10 Sign language translation, customer service, communication method, device and readable medium

Country Status (1)

Country Link
CN (1) CN113822186A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116576A (en) * 2013-01-29 2013-05-22 安徽安泰新型包装材料有限公司 Voice and gesture interactive translation device and control method thereof
CN103136986A (en) * 2011-12-02 2013-06-05 深圳泰山在线科技有限公司 Sign language identification method and sign language identification system
CN104809927A (en) * 2015-04-23 2015-07-29 苏州锟恩电子科技有限公司 Gesture interactive learning machine
US10176366B1 (en) * 2017-11-01 2019-01-08 Sorenson Ip Holdings Llc Video relay service, communication system, and related methods for performing artificial intelligence sign language translation services in a video relay service environment
WO2019214456A1 (en) * 2018-05-11 2019-11-14 深圳双猴科技有限公司 Gesture language translation system and method, and server
CN110826441A (en) * 2019-10-25 2020-02-21 深圳追一科技有限公司 Interaction method, interaction device, terminal equipment and storage medium
CN112668463A (en) * 2020-12-25 2021-04-16 株洲手之声信息科技有限公司 Chinese sign language translation method and system based on scene recognition

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136986A (en) * 2011-12-02 2013-06-05 深圳泰山在线科技有限公司 Sign language identification method and sign language identification system
CN103116576A (en) * 2013-01-29 2013-05-22 安徽安泰新型包装材料有限公司 Voice and gesture interactive translation device and control method thereof
CN104809927A (en) * 2015-04-23 2015-07-29 苏州锟恩电子科技有限公司 Gesture interactive learning machine
US10176366B1 (en) * 2017-11-01 2019-01-08 Sorenson Ip Holdings Llc Video relay service, communication system, and related methods for performing artificial intelligence sign language translation services in a video relay service environment
WO2019214456A1 (en) * 2018-05-11 2019-11-14 深圳双猴科技有限公司 Gesture language translation system and method, and server
CN110826441A (en) * 2019-10-25 2020-02-21 深圳追一科技有限公司 Interaction method, interaction device, terminal equipment and storage medium
CN112668463A (en) * 2020-12-25 2021-04-16 株洲手之声信息科技有限公司 Chinese sign language translation method and system based on scene recognition

Similar Documents

Publication Publication Date Title
CN113835522A (en) Sign language video generation, translation and customer service method, device and readable medium
US10438586B2 (en) Voice dialog device and voice dialog method
US9824687B2 (en) System and terminal for presenting recommended utterance candidates
US20190188903A1 (en) Method and apparatus for providing virtual companion to a user
KR102167760B1 (en) Sign language analysis Algorithm System using Recognition of Sign Language Motion process and motion tracking pre-trained model
WO2022161298A1 (en) Information generation method and apparatus, device, storage medium, and program product
CN114401438B (en) Video generation method and device for virtual digital person, storage medium and terminal
Oliveira et al. Automatic sign language translation to improve communication
US9525841B2 (en) Imaging device for associating image data with shooting condition information
Yuan et al. Large scale sign language interpretation
CN113822187A (en) Sign language translation, customer service, communication method, device and readable medium
CN113851029B (en) Barrier-free communication method and device
JP2007272534A (en) Apparatus, method and program for complementing ellipsis of word
WO2017036516A1 (en) Externally wearable treatment device for medical application, voice-memory system, and voice-memory-method
JP7370050B2 (en) Lip reading device and method
CN113409770A (en) Pronunciation feature processing method, pronunciation feature processing device, pronunciation feature processing server and pronunciation feature processing medium
JP2017182261A (en) Information processing apparatus, information processing method, and program
CN116088675A (en) Virtual image interaction method, related device, equipment, system and medium
CN113822186A (en) Sign language translation, customer service, communication method, device and readable medium
JP6754154B1 (en) Translation programs, translation equipment, translation methods, and wearable devices
CN113780013A (en) Translation method, translation equipment and readable medium
CN108346423B (en) Method and device for processing speech synthesis model
KR102460553B1 (en) Method, device and computer program for providing sign language response using neural network in avehicle
US20240096093A1 (en) Ai-driven augmented reality mentoring and collaboration
Asskour et al. Design & Development Of Mapping Tool For The Blind Or Partially Sighted

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination