WO2024008047A1

WO2024008047A1 - Digital human sign language broadcasting method and apparatus, device, and storage medium

Info

Publication number: WO2024008047A1
Application number: PCT/CN2023/105575
Authority: WO
Inventors: 韩玉洁; 谭启敏; 吴淑明; 张家硕; 张泽旋; 周靖坤; 祖新星; 王琪
Original assignee: 阿里巴巴（中国）有限公司
Priority date: 2022-07-04
Filing date: 2023-07-03
Publication date: 2024-01-11
Also published as: CN115359796A

Abstract

The present disclosure relates to a digital human sign language broadcasting method and apparatus, a device, and a storage medium. According to the present disclosure, a natural language text used by people with normal hearing is translated into a first sign language text, and semantic simplification is performed on the first sign language text to obtain a second sign language text. Furthermore, a digital human is driven according to the second sign language text so that the digital human expresses, by means of the limbs, sign language actions corresponding to the second sign language text. The second sign language text obtained by performing semantic simplification on the first sign language text can comprise fewer action names, and therefore, compared with the first sign language text, the digital human can do fewer sign language actions within the same time, so that the digital human has more sufficient time to do each sign language action, thereby ensuring that a hearing-impaired person can see each sign language action clearly.

Description

Digital human sign language broadcasting method, device, equipment and storage medium

This application claims priority to the Chinese patent application submitted to the China Patent Office on July 4, 2022, with the application number 202210785961.2 and the application title "Digital Human Sign Language Broadcasting Method, Device, Equipment and Storage Medium", the entire content of which is incorporated by reference. incorporated in this application.

Technical field

The present disclosure relates to the field of information technology, and in particular to a digital human sign language broadcasting method, device, equipment and storage medium.

Background technique

With the continuous development of technology, more and more users can view multimedia information through terminals. Multimedia information usually includes text, audio, video, etc. However, for the hearing-impaired, it is sign language that is cognitively appropriate. Therefore, there is a need to convert natural language speech and text information into sign language so that it can be understood by hearing-impaired people.

However, the inventor of this application found that for the same sentence, the sound speed of a normal person when speaking is usually faster than the speed of a digital person's sign language movements. If the process of the digital person's sign language movements is required to be the same as the process of a normal person's speaking To align in time, it is necessary to speed up the digital person's sign language movements, or increase the playback speed of the video of the digital person's sign language movements, so that the hearing-impaired people cannot see the sign language movements clearly.

Contents of the invention

In order to solve the above technical problems or at least partially solve the above technical problems, the present disclosure provides a digital human sign language broadcasting method, device, equipment and storage medium, so that the digital human has more sufficient time to perform each sign language action, so that the digital human can Ensure that hearing-impaired people can clearly see every sign language movement.

In a first aspect, embodiments of the present disclosure provide a digital human sign language broadcasting method, including:

Obtain multimedia information and determine the natural language text corresponding to the multimedia information;

Translate the natural language text into a first sign language text;

Perform semantic simplification processing on the first sign language text to obtain a second sign language text;

The digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.

In a second aspect, an embodiment of the present disclosure provides a digital human sign language broadcasting device, including:

Acquisition module, used to obtain multimedia information;

A determination module, used to determine the natural language text corresponding to the multimedia information;

A translation module for translating the natural language text into a first sign language text;

A processing module for performing semantic simplification processing on the first sign language text to obtain a second sign language text;

A driving module is used to drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

memory;

processor; and

Computer program;

Wherein, the computer program is stored in the memory and configured to be executed by the processor to implement the method as described in the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the method described in the first aspect.

The digital human sign language broadcasting method, device, equipment and storage medium provided by the embodiments of the present disclosure translate natural language text used by normal people into first sign language text, and perform semantic streamlining processing on the first sign language text to obtain Second sign language text. Furthermore, the digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body. Because the second sign language text obtained by semantically reducing the first sign language text can include fewer action names, the digital person can do less in the same time compared to the first sign language text. The sign language movements allow the digital person to have more time to perform each sign language movement, thus ensuring that the hearing-impaired can clearly see each sign language movement.

Description of the drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those of ordinary skill in the art, It is said that other drawings can be obtained based on these drawings without exerting creative labor.

Figure 1 is a flow chart of a digital human sign language broadcasting method provided by an embodiment of the present disclosure;

Figure 2 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure;

Figure 3 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure;

Figure 4 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure;

Figure 5 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure;

Figure 6 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure;

Figure 7 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure;

Figure 8 is a schematic diagram of an operator user interface provided by another embodiment of the present disclosure;

Figure 9 is a schematic diagram of an operator user interface provided by another embodiment of the present disclosure;

Figure 10 is a schematic structural diagram of a digital human sign language broadcasting device provided by an embodiment of the present disclosure;

FIG. 11 is a schematic structural diagram of an electronic device embodiment provided by an embodiment of the present disclosure.

Detailed ways

In order to understand the above objects, features and advantages of the present disclosure more clearly, the solutions of the present disclosure will be further described below. It should be noted that, as long as there is no conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other.

Many specific details are set forth in the following description to fully understand the present disclosure, but the present disclosure can also be implemented in other ways different from those described here; obviously, the embodiments in the description are only part of the embodiments of the present disclosure, and Not all examples.

Normally, for the same sentence, the speed of a normal person's voice when speaking is usually faster than the speed of a digital person's sign language movements. If the process of the digital person's sign language movements and the process of normal people's speaking are required to be aligned in time, Then it is necessary to speed up the digital human's sign language movements, or increase the playback speed of the video of the digital human's sign language movements, so that the hearing-impaired people cannot see the sign language movements clearly. To address this problem, embodiments of the present disclosure provide a digital human sign language broadcasting method. This method will be introduced below with reference to specific embodiments.

Figure 1 is a flow chart of a digital human sign language broadcasting method provided by an embodiment of the present disclosure. The method can be executed by a digital human sign language broadcasting device, which can be implemented in the form of software and/or hardware. The device can be configured in an electronic device, such as a server or a terminal, where the terminal specifically includes a mobile phone, a computer or a tablet computer, etc. . In addition, the digital human sign language broadcasting method described in this embodiment can be applied to the application scenario shown in Figure 2. As shown in Figure 2, the application scenario includes a terminal 21 and a server 22, where the server 22 can obtain multimedia information from other terminals or other servers, and generate a sign language animation of a digital person signing based on the multimedia information. Furthermore, the server 22 can send the sign language animation of the digital person signing to the terminal 21. The terminal 21 can be a terminal for the hearing-impaired, so that the hearing-impaired can understand the meaning expressed by the multimedia information. The method is introduced in detail below in conjunction with Figure 2, as shown in Figure 1. The specific steps of the method are as follows:

S101. Obtain multimedia information and determine the natural language text corresponding to the multimedia information.

For example, the server 22 can obtain multimedia information from other terminals or other servers, and the multimedia information can be text information, audio information, or audio and video information. The audio information may be a real-time audio stream or an offline audio file. The audio and video information can be real-time audio and video streams, or can be offline audio and video files. For example, as shown in Figure 3, the terminal 23 can send a live audio and video stream to the server 22 in real time. The server 22 can not only forward the live audio and video stream to the terminal 21, but also send a video stream of digital people signing to the terminal 21. Digital people use sign language to express the meaning of the audio signals or subtitles in the live audio and video stream, so that hearing-impaired people can watch the online live broadcast. Or as shown in Figure 4, the server 24 sends the live TV program to the server 22 in real time. The live TV program is sent to the server 22 in the form of streaming media. The digital person generated by the server 22 can assist the hearing-impaired to watch the live TV program. In some other embodiments, the server 22 can also generate multimedia information, such as film and television consultation, education and training videos, etc., so that the hearing-impaired can watch the film and television consultation, education and training videos, etc. based on the digital people generated by the server 22 . In addition, hearing-impaired people and normal people can also conduct online or offline meetings through their respective terminals. For example, as shown in Figure 3, assuming that terminal 21 is a terminal for hearing-impaired people and terminal 23 is a terminal for normal people, the hearing-impaired person Conduct remote online meetings with normal people through their respective terminals. For example, the terminal 23 collects the normal people’s information in real time. The audio and video stream is sent to the server 22. The server 22 generates a video stream of the digital person signing in sign language based on the meaning expressed by the normal person, and sends the video stream of the digital person signing in sign language to the terminal 21 in real time. , to assist hearing-impaired people to understand what normal people say. Or, the hearing-impaired person and the normal person conduct offline meetings through their respective terminals. For example, the hearing-impaired person and the normal person are in the same conference room, and the terminal 23 collects the audio and video stream of the normal person in real time, and combines the audio and video stream of the normal person. Sent to the server 22, the server 22 translates the normal person's natural language into sign language movements in real time, and streams the video of the digital person signing to the terminal 21, so that the hearing-impaired can understand what the normal person says in real time. It is understandable that the terminal 21 can also be a large screen in public places such as airports, train stations, sports venues, etc. The terminal 21 can play videos of digital people signing, so that hearing-impaired people can understand the current situation anytime and anywhere when they are in public places. consult. It can be understood that the method described in this embodiment is not limited to being applicable to these scenarios, but can also be applied to other application scenarios, which will not be described again here.

When the multimedia information is text information, the text information can be used as the natural language text corresponding to the multimedia information.

When the multimedia information is audio information, the natural language text corresponding to the multimedia information may be text converted from the audio information using automatic speech recognition technology (Automatic Speech Recognition, ASR).

When the multimedia information is audio and video information, the audio and video information can be parsed to extract the audio components from the audio and video information, and use ASR technology to convert the audio components into text. The text can be used as the natural language text corresponding to the multimedia information.

S102. Translate the natural language text into a first sign language text.

Sign language uses gestures to compare movements, and simulates images or syllables according to changes in gestures to form a certain meaning or word. It is a hand language for people with hearing impairment or speechlessness to communicate with each other and exchange ideas. Since sign language is a visual language, there are great differences between it and natural language text in word usage and grammatical rules. For example, "Follow the directed route and leave the venue, do not stay in the audience area" is a natural language text, and the corresponding sign language text is "Follow/direct/route/go/stay/this/don't". Therefore, it is necessary to translate the natural language text into a sign language text, and here the signed language text translated from the natural language text is recorded as the first sign language text. For example, "Follow/command/way/go/stay/this/don't" can be used as a first sign language text. The first sign language text consists of multiple action names, and adjacent action names are separated by "/". Each action name can correspond to a coherent sign language action, that is, different action names are used to distinguish different sign language actions.

S103. Perform semantic simplification processing on the first sign language text to obtain a second sign language text.

Since "Exit according to the guided route, do not stay in the audience area" and "Follow / command / road / go / stay / this / don't" are corresponding, that is, "Exit according to the guided route, do not stay in the audience area" corresponds to 7 action names , 7 sign language movements. Therefore, when a normal person says "exit according to the guided route and do not stay in the audience area", the digital person needs to make 7 sign language movements, and each sign language movement may be a continuous and coherent movements, that is, the time required for a digital person to perform each sign language action will be longer, while the time required for a normal person to speak each word may be shorter, resulting in a normal person speaking faster than a digital person. The speed of sign language movements. If the process of requiring a digital person to perform these 7 sign language movements is aligned with the process of a normal person saying "exit the venue according to the guided route and do not stay in the audience area" on the timeline, then it is necessary Speeding up the speed of the digital person's sign language movements, or increasing the playback speed of the video of the digital person's sign language movements, resulting in the hearing-impaired person not being able to see the sign language movements clearly. In order to solve this problem, this embodiment proposes a solution, that is, after obtaining the first sign language text such as "Follow/command/road/go/stay/this/don't", perform semantic analysis on the first sign language text. Streamline the processing to obtain the second sign language text. For example, the second sign language text is "Follow/command/road/walk". Assume that the time required for a normal person to say "exit the venue according to the guided route and do not stay in the audience area" is recorded as t1. Before the first sign language text is semantically streamlined, the digital person needs to perform 7 sign language within the t1 time period. However, after semantically simplifying the first sign language text, the digital person only needs to perform 4 sign language actions within the same duration, that is, the duration of t1, so that the digital person has more sufficient time to perform each sign language movements, thereby ensuring that hearing-impaired people can clearly see each sign language movement.

S104. Drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.

Specifically, the server 22 can drive the digital human according to each action name in the second sign language text, so that the digital human can express the sign language movements corresponding to each action name in the second sign language text through its limbs, such as hands. come out.

In this embodiment, driving the digital human according to the second sign language text so that the digital human expresses the sign language movements corresponding to the second sign language text through the body includes: according to the second sign language text The digital human is driven so that the digital human expresses the sign language movements corresponding to the second sign language text through the body, and the mouth shape and expression of the digital human are consistent with the second sign language text.

For example, in this embodiment, while the server 22 drives the digital human according to each action name in the second sign language text, it can also control the mouth shape of the digital human to be consistent with the second sign language text. For example, when the digital person performs the sign language action corresponding to "according to", the digital person's mouth shape is consistent with "according to". In addition, the expression of the digital human can also be controlled. For example, when the digital human expresses the sign language movements corresponding to the second sign language text, the digital human's expression can remain serious.

The embodiment of the present disclosure obtains a second sign language text by translating a natural language text used by normal people into a first sign language text, and performing semantic streamlining processing on the first sign language text. Furthermore, the digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body. Because the second sign language text obtained by semantically reducing the first sign language text can include fewer action names, the digital person can do less in the same time compared to the first sign language text. The sign language movements allow the digital person to have more time to perform each sign language movement, thus ensuring that the hearing-impaired can clearly see each sign language movement.

Figure 5 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure. In this embodiment, the specific steps of this method are as follows:

S501. Obtain multimedia information and determine the natural language text corresponding to the multimedia information.

Specifically, the specific principles and implementation processes of S501 and S101 are the same and will not be described again here.

S502. Perform semantic streamlining processing on the natural language text to obtain a streamlined natural language text.

For example, when determining the natural language text such as "Follow the guided route and exit the venue, do not stay in the audience area", this embodiment The natural language text can also be semantically simplified. For example, the natural language text can be semantically understood based on the behavior of artificial sign language translation experts during the translation process, and key information in the natural language text can be extracted to filter out invalid or redundant information. The remaining information is obtained to obtain a streamlined natural language text, for example, "Exit the venue according to the guided route."

S503. Translate the streamlined natural language text into a first sign language text.

Since the streamlined natural language text contains less content, after translating "Exit according to the guided route" into a first sign language text, the number of action names included in the first sign language text will also be reduced accordingly. For example, the first sign language text is "Follow/direct/way/walk". In this embodiment, the process of translating the natural language text into the first sign language text, or translating the streamlined natural language text into the first sign language text, can be achieved through machine translation, which is also called automatic translation. It is the process of using computers to convert one language (source language) into another language (target language).

S504. Perform semantic simplification processing on the first sign language text to obtain a second sign language text.

For example, you can also perform semantic streamlining processing on "Follow/command/road/walk" to further reduce the number of action names. The second sign language text is "Follow/direct/go", thus making the second sign language text more concise. In some embodiments, the natural language text can be recorded as the original text, and the first sign language text and the second sign language text can be recorded as the translation text respectively.

S505. Drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body, and the mouth shape and expression of the digital human are respectively consistent with the Second sign language text concordance.

Specifically, the specific principles and implementation processes of S505 and S104 are the same and will not be described again here.

Optionally, driving the digital human according to the second sign language text includes: if the multimedia information is a non-real-time audio file or audio and video file, obtaining each audio signal in the audio file or the audio and video file. the start time and the end time; according to the start time and the end time, adjust the speed at which the digital human expresses sign language movements, so that the sign language movements expressed by the digital human and the audio signal are aligned on the time axis .

For example, if the multimedia information is a non-real-time audio file or audio and video file, then the server 22 can also obtain each audio signal from the audio file or the audio and video file, and each audio signal can be the audio of a sentence in natural language. Signal. Further, the server 22 can calculate the start time and end time of each audio signal, and the start time and end time can be recorded as the start and end time axis. For each audio signal, the server 22 can adjust the speed at which the digital human expresses sign language movements according to the start time and end time of the audio signal, that is, automatically perform algorithm adaptation to the sign language broadcast speed of different sentences, and adjust the broadcast faster or slower. The speed enables the process of the digital human to express the sign language movements corresponding to a certain sentence and the audio signal of the sentence to be aligned on the time axis. Among them, sign language broadcasting is to convert natural language text into sign language text, drive the digital human to express the sign language movements corresponding to the sign language text through the body, and cooperate with the corresponding facial expressions and mouth shapes of the digital human to broadcast. In this embodiment, the digital person may be a virtual character with a digital appearance.

This embodiment performs semantic simplification processing on the natural language text and the first sign language text respectively, so that the action names included in the second sign language text are as few as possible, that is, the second sign language text is as concise as possible. In this way, when driving the digital human based on the second sign language text, for the same sentence, it can effectively prevent the digital human's sign language movements from lagging behind the normal speed. The speed of a person's voice when speaking allows the digital person's sign language movement process to be synchronized with the normal person's speaking process, improving information synchronization. In addition, by adapting the algorithm to the sign language broadcast speed of different sentences, the alignment of the sign language broadcast and the original audio and video content can be achieved.

Figure 6 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure. In this embodiment, the specific steps of this method are as follows:

S601. Obtain multimedia information and determine the natural language text corresponding to the multimedia information.

For example, the multimedia information obtained by the server 22 may be at least one of text information, real-time audio and video streams, audio files, and audio and video files as shown in FIG. 7 .

If the multimedia information is text information, the natural language text can be obtained through text parsing as shown in Figure 7. If the multimedia information is a real-time audio and video stream, real-time ASR is called to obtain the natural language text. If the multimedia information is an audio file, the audio recording file ASR is called to obtain the natural language text. If the multimedia information is an audio and video file, the audio and video file is first subjected to video analysis to extract the audio signal in the audio and video file, and then the recording file ASR is called to obtain the natural language text.

S602. Send the natural language text corresponding to the multimedia information to the operator's terminal.

In this embodiment, the server 22 may send the natural language text corresponding to the multimedia information to the operator's terminal, so that the terminal can display the natural language text. Further, the operator can modify the natural language text displayed on the terminal to achieve original text intervention as shown in Figure 7.

S603. Receive the natural language text modified by the operator.

For example, after the operator modifies the natural language text, the server 22 can receive the modified natural language text from the operator's terminal. The modified natural language text is the original text after intervention as shown in FIG. 7 . It can be understood that in other embodiments, the operator may not modify the natural language text.

S604. Translate the natural language text modified by the operator into a first sign language text.

For example, as shown in Figure 7, by calling machine translation, the modified natural language text can be translated into a first sign language text, or the original natural language text can be translated into a first sign language text. Specifically, the process of translating the modified natural language text into the first sign language text, or translating the original natural language text into the first sign language text, can be displayed on the operator's terminal, as shown in Figure 8 or as shown in Figure 8. As shown in Figure 9. Among them, Figure 8 shows the process of translating real-time audio and video into sign language animation, and Figure 9 shows the process of translating text into sign language animation.

S605. Perform semantic simplification processing on the first sign language text to obtain a second sign language text.

For example, as shown in Figure 7, by calling semantic reduction, the first sign language text can be semantically streamlined to obtain a second sign language text. The second sign language text can be the sign language text result shown in Figure 7.

S606. Send the second sign language text to the operator's terminal.

For example, in this embodiment, the server 22 can also send the second sign language text to the operator's terminal, so that the operator can modify the second sign language text, thereby realizing translation intervention as shown in FIG. 7 .

S607. Receive the second sign language text modified by the operator.

For example, after the operator modifies the second sign language text, the server 22 can receive the modified second sign language text from the operator's terminal. The modified second sign language text is as shown in Figure 7 Post-intervention translation. It can be understood that in other embodiments, the operator may not modify the second sign language text.

S608. Drive the digital human according to the second sign language text modified by the operator, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body, and the mouth shape of the digital human and The expressions were respectively consistent with the second sign language text.

For example, as shown in Figure 7, the server 22 can drive the digital human based on the second sign language text modified by the operator, or based on the unmodified second sign language text. The process of driving the digital human includes sign language synthesis, expression synthesis, and mouth shape synthesis. Waiting process. Among them, sign language synthesis can be to control the digital person to express the sign language movements corresponding to the second sign language text through the body. Expression synthesis can control the expression of a digital person to be consistent with the expression of a normal person speaking natural language. Mouth synthesis can control the mouth shape of the digital person to be consistent with the second sign language text.

S609. If the multimedia information is a real-time audio stream or audio and video stream, generate the digital human's streaming sign language broadcast video stream, and send the streaming sign language broadcast video stream to the terminal in real time.

For example, as shown in Figure 7, if the multimedia information is a real-time audio stream or audio and video stream, the server 22 can generate a streaming sign language broadcast video stream of the digital human during the process of driving the digital human, and use the streaming sign language broadcast video stream to Sent to hearing-impaired terminals in real time. It can be understood that in some embodiments, the server 22 can simultaneously send the real-time audio and video stream and the digital human's streaming sign language broadcast video stream to the terminal of the hearing-impaired person, so that the terminal of the hearing-impaired person can not only play normal Audio and video that people can watch, and at the same time, the digital person's sign language broadcast video can also be played.

Optionally, generating a streaming sign language broadcast video stream of the digital human includes: generating a streaming sign language broadcast video stream of the digital human according to the configuration information of the digital human. Wherein, the configuration information of the digital person includes at least one of the following: the background, color of the digital person, the position and size of the digital person in the user interface.

As shown in Figure 7, the operator can also configure the synthesis effect. For example, the operator's terminal can display a configuration interface, and the configuration interface can display the configuration options of the digital person. The operator can operate these configuration options to achieve the desired effect. The configuration of the digital person is to generate the configuration information of the digital person. The configuration information may include the background, color, position and size of the digital person in the user interface for the hearing-impaired, etc. Among them, the lens distance shown in Figure 7 is used to control the size of the digital human in the user interface for the hearing-impaired. Specifically, the server 22 may generate a streaming sign language broadcast video stream of the digital person according to the configuration information of the digital person. In addition, the operator can also configure whether to display subtitles. For example, if subtitles are configured, the hearing-impaired can also watch the digital sign language while reading the subtitles to improve understanding efficiency.

S610. If the multimedia information is a non-real-time audio file, audio and video file or text file, generate a sign language broadcast video file of the digital person, and send the sign language broadcast video file to the terminal.

For example, as shown in Figure 7, if the multimedia information is text information, audio files or audio and video files, the server 22 can generate a sign language broadcast video file of the digital human during the process of driving the digital human, and send the sign language broadcast video file to the listener. Terminal for people with disabilities. It can be understood that in some embodiments, the server 22 can simultaneously send the multimedia information and the digital human sign language broadcast video file to the terminal of the hearing-impaired person, so that the terminal of the hearing-impaired person can not only It can play text information, audio files or audio and video files that normal people can watch, and it can also play the sign language video files of the digital person.

Optionally, generating a sign language broadcast video file of the digital human includes: generating a sign language broadcast video file of the digital human according to the configuration information of the digital human; wherein the configuration information of the digital human includes at least one of the following: Type: the background, color of the digital person, the position and size of the digital person in the user interface.

Specifically, the server 22 can generate the sign language broadcast video file of the digital person based on the configuration information of the digital person. The source and content of the configuration information are as mentioned above and will not be described again here. In this embodiment, the configuration information of the digital human may be configured by an operator.

By integrating real-time speech recognition, recorded file speech recognition, video analysis and other technologies, this embodiment can support multiple modes of plain text, real-time audio and video, and offline audio and video files, and has a wider application scenario. In addition, the sign language broadcast provided by this embodiment involves multiple algorithm technologies, which are interlocking, and the output of each link affects the input of the next link. This solution can output independent results for each link of sign language broadcasting, making it easy to quickly trace and locate problems in the link. In addition, the presentation of sign language is not only the body and hand movements. Based on the synthesis of sign language, through the technology of mouth synthesis and expression synthesis, the body posture, expression and mouth shape are integrated, and a variety of information is linked, so as to better communicate to the audience. Hearing-impaired people convey information. Since the sign language broadcasting technology involves various algorithms, it is difficult to achieve 100% accuracy. At the same time, the application scenarios of sign language broadcasting are diverse, and there are differentiated requirements for the final result presentation of the entire sign language broadcasting in different application scenarios. Therefore, by providing a visual interface, Empowering operations personnel to intervene and edit natural language texts and sign language texts, and using human-machine collaboration mechanisms to improve the accuracy of sign language translation and improve end-to-end effects.

Figure 10 is a schematic structural diagram of a digital human sign language broadcasting device provided by an embodiment of the present disclosure. The digital human sign language broadcasting device provided by the embodiment of the present disclosure can execute the processing flow provided by the digital human sign language broadcasting method embodiment. As shown in Figure 10, the digital human sign language broadcasting device 100 includes:

Acquisition module 101, used to obtain multimedia information;

Determining module 102, used to determine the natural language text corresponding to the multimedia information;

Translation module 103, used to translate the natural language text into a first sign language text;

The processing module 104 is used to perform semantic simplification processing on the first sign language text to obtain a second sign language text;

The driving module 105 is used to drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.

Optionally, the driving module 105 drives the digital human according to the second sign language text, so that when the digital human expresses the sign language movements corresponding to the second sign language text through the body, it specifically includes: according to the second sign language text. The sign language text drives the digital human, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body, and the mouth shape and expression of the digital human are consistent with the second sign language text.

Optionally, the processing module 104 is also configured to perform semantic simplification processing on the natural language text after the determination module 102 determines the natural language text corresponding to the multimedia information, to obtain a simplified natural language text. The translation module 103 is specifically configured to translate the streamlined natural language text into a first sign language text.

Optionally, the driving module 105 includes an acquisition unit 1051 and an adjustment unit 1052, wherein the acquisition unit 1051 is used to acquire the audio file or the audio and video file when the multimedia information is a non-real-time audio file or audio and video file. The start time and end time of each audio signal in and the audio signal on the time axis.

Optionally, the digital human sign language broadcasting device 100 also includes: a sending module 106 and a receiving module 107. The sending module 106 is configured to send the natural language text corresponding to the multimedia information after the determining module 102 determines the natural language text corresponding to the multimedia information. The language text is sent to the operator's terminal; the receiving module 107 is used to receive the natural language text modified by the operator. The translation module 103 is specifically configured to translate the natural language text modified by the operator into a first sign language text.

Optionally, the sending module 106 is also configured to: after the processing module 104 performs semantic simplification processing on the first sign language text to obtain the second sign language text, send the second sign language text to the operator's terminal. ; The receiving module 107 is also used to receive the second sign language text modified by the operator. The driving module 105 is specifically used to drive the digital human according to the second sign language text modified by the operator.

Optionally, the digital human sign language broadcasting device 100 also includes: a generating module 108, configured to generate a signal if the multimedia information is a real-time audio stream or audio and video stream after the driving module 105 drives the digital human according to the second sign language text. , then the streaming sign language broadcast video stream of the digital person is generated, and the streaming sign language broadcast video stream is sent to the terminal in real time; if the multimedia information is a non-real-time audio file, audio and video file or text file, then A sign language broadcast video file of the digital person is generated, and the sign language broadcast video file is sent to the terminal. Optionally, the terminal may be a hearing-impaired terminal.

Optionally, when generating the digital human's streaming sign language broadcast video stream, the generation module 108 is specifically configured to: generate the digital human's streaming sign language broadcast video stream according to the configuration information of the digital human; the generation module 108 When generating the sign language broadcast video file of the digital human, it is specifically used to: generate the sign language broadcast video file of the digital human according to the configuration information of the digital human; wherein the configuration information of the digital human includes at least the following: One: the background, color, and position and size of the digital human in the user interface. Wherein, the configuration information of the digital human may be configured by operating personnel.

The digital human sign language broadcasting device of the embodiment shown in Figure 10 can be used to implement the technical solution of the above method embodiment. Its implementation principles and technical effects are similar and will not be described again here.

The above describes the internal functions and structure of the digital human sign language broadcasting device, which can be implemented as an electronic device. FIG. 11 is a schematic structural diagram of an electronic device embodiment provided by an embodiment of the present disclosure. As shown in FIG. 11 , the electronic device includes a memory 111 and a processor 112 .

The memory 111 is used to store programs. In addition to the above-mentioned programs, the memory 111 may also be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, etc.

Memory 111 may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as a static State random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), Magnetic memory, flash memory, magnetic disk or optical disk.

The processor 112 is coupled to the memory 111 and executes the program stored in the memory 111 for:

Translate the natural language text into a first sign language text;

Further, as shown in FIG. 11 , the electronic device may also include: a communication component 113 , a power supply component 114 , an audio component 115 , a display 116 and other components. Only some components are schematically shown in FIG. 11 , which does not mean that the electronic device only includes the components shown in FIG. 11 .

The communication component 113 is configured to facilitate wired or wireless communication between the electronic device and other devices. Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 113 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 113 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

The power supply component 114 provides power to various components of the electronic device. Power supply components 114 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic devices.

Audio component 115 is configured to output and/or input audio signals. For example, the audio component 115 includes a microphone (MIC) configured to receive external audio signals when the electronic device is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 111 or sent via communication component 113 . In some embodiments, audio component 115 also includes a speaker for outputting audio signals.

Display 116 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action.

In addition, embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the digital human sign language broadcasting method described in the above embodiments.

It should be noted that in this article, relational terms such as “first” and “second” are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these There is no such actual relationship or sequence between entities or operations. Furthermore, the terms "comprises,""comprises," or any other variations thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that includes a list of elements includes not only those elements, but also those not expressly listed other elements of the process, method, Elements inherent in an item or piece of equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or apparatus that includes the stated element.

The above descriptions are only specific embodiments of the present disclosure, enabling those skilled in the art to understand or implement the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be practiced in other embodiments without departing from the spirit or scope of the disclosure. Therefore, the present disclosure is not to be limited to the embodiments described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

A digital human sign language broadcasting method, wherein the method includes:

Obtain multimedia information and determine the natural language text corresponding to the multimedia information;

Translate the natural language text into a first sign language text;

Perform semantic simplification processing on the first sign language text to obtain a second sign language text;

The digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
The method according to claim 1, wherein after determining the natural language text corresponding to the multimedia information, the method further includes:

Perform semantic streamlining processing on the natural language text to obtain a streamlined natural language text;

Translating the natural language text into a first sign language text includes:

Translate the streamlined natural language text into a first sign language text.
The method of claim 1, wherein driving the digital person according to the second sign language text includes:

If the multimedia information is a non-real-time audio file or audio and video file, obtain the start time and end time of each audio signal in the audio file or audio and video file;

According to the start time and the end time, the speed at which the digital human expresses sign language movements is adjusted so that the sign language movements expressed by the digital human and the audio signal are aligned on the time axis.
The method of claim 1, wherein after driving the digital person according to the second sign language text, the method further includes:

If the multimedia information is a real-time audio stream or audio and video stream, generate the digital human's streaming sign language broadcast video stream, and send the streaming sign language broadcast video stream to the terminal in real time;

If the multimedia information is a non-real-time audio file, audio and video file or text file, a sign language broadcast video file of the digital person is generated, and the sign language broadcast video file is sent to the terminal.
The method of claim 4, wherein generating the digital human's streaming sign language broadcast video stream includes:

Generate a streaming sign language broadcast video stream of the digital person according to the configuration information of the digital person;

Generate the sign language broadcast video file of the digital person, including:

Generate a sign language broadcast video file of the digital person according to the configuration information of the digital person;

Wherein, the configuration information of the digital human includes at least one of the following:

The background, color, and position and size of the digital human in the user interface.
A digital human sign language broadcasting device, which includes:

Acquisition module, used to obtain multimedia information;

A determination module, used to determine the natural language text corresponding to the multimedia information;

A translation module for translating the natural language text into a first sign language text;

A processing module for performing semantic simplification processing on the first sign language text to obtain a second sign language text;

A driving module is used to drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
The device according to claim 6, wherein the processing module is further configured to perform semantic streamlining processing on the natural language text after the determining module determines the natural language text corresponding to the multimedia information, and obtain the simplified process. natural language text;

Correspondingly, the translation module is specifically configured to translate the streamlined natural language text into a first sign language text.
The device according to claim 6, wherein the driving module includes an acquisition unit and an adjustment unit;

The acquisition unit is configured to acquire the start time and end time of each audio signal in the audio file or the audio and video file when the multimedia information is a non-real-time audio file or audio and video file;

The adjustment unit is configured to adjust the speed at which the digital human expresses sign language movements according to the start time and the end time, so that the sign language movements expressed by the digital human and the audio signal are aligned on the time axis.
An electronic device, including:

memory;

processor; and

Computer program;

Wherein, the computer program is stored in the memory and configured to be executed by the processor to implement the method according to any one of claims 1-5.
A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the method according to any one of claims 1-5 is implemented.