WO2024008047A1 - Digital human sign language broadcasting method and apparatus, device, and storage medium - Google Patents

Digital human sign language broadcasting method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2024008047A1
WO2024008047A1 PCT/CN2023/105575 CN2023105575W WO2024008047A1 WO 2024008047 A1 WO2024008047 A1 WO 2024008047A1 CN 2023105575 W CN2023105575 W CN 2023105575W WO 2024008047 A1 WO2024008047 A1 WO 2024008047A1
Authority
WO
WIPO (PCT)
Prior art keywords
sign language
language text
text
digital
digital human
Prior art date
Application number
PCT/CN2023/105575
Other languages
French (fr)
Chinese (zh)
Inventor
韩玉洁
谭启敏
吴淑明
张家硕
张泽旋
周靖坤
祖新星
王琪
Original Assignee
阿里巴巴(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴(中国)有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2024008047A1 publication Critical patent/WO2024008047A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Definitions

  • the present disclosure relates to the field of information technology, and in particular to a digital human sign language broadcasting method, device, equipment and storage medium.
  • Multimedia information usually includes text, audio, video, etc.
  • sign language that is cognitively appropriate. Therefore, there is a need to convert natural language speech and text information into sign language so that it can be understood by hearing-impaired people.
  • the inventor of this application found that for the same sentence, the sound speed of a normal person when speaking is usually faster than the speed of a digital person's sign language movements. If the process of the digital person's sign language movements is required to be the same as the process of a normal person's speaking To align in time, it is necessary to speed up the digital person's sign language movements, or increase the playback speed of the video of the digital person's sign language movements, so that the hearing-impaired people cannot see the sign language movements clearly.
  • the present disclosure provides a digital human sign language broadcasting method, device, equipment and storage medium, so that the digital human has more sufficient time to perform each sign language action, so that the digital human can Ensure that hearing-impaired people can clearly see every sign language movement.
  • embodiments of the present disclosure provide a digital human sign language broadcasting method, including:
  • the digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
  • an embodiment of the present disclosure provides a digital human sign language broadcasting device, including:
  • Acquisition module used to obtain multimedia information
  • a determination module used to determine the natural language text corresponding to the multimedia information
  • a translation module for translating the natural language text into a first sign language text
  • a driving module is used to drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
  • an electronic device including:
  • the computer program is stored in the memory and configured to be executed by the processor to implement the method as described in the first aspect.
  • embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the method described in the first aspect.
  • the digital human sign language broadcasting method, device, equipment and storage medium provided by the embodiments of the present disclosure translate natural language text used by normal people into first sign language text, and perform semantic streamlining processing on the first sign language text to obtain Second sign language text. Furthermore, the digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body. Because the second sign language text obtained by semantically reducing the first sign language text can include fewer action names, the digital person can do less in the same time compared to the first sign language text. The sign language movements allow the digital person to have more time to perform each sign language movement, thus ensuring that the hearing-impaired can clearly see each sign language movement.
  • Figure 1 is a flow chart of a digital human sign language broadcasting method provided by an embodiment of the present disclosure
  • Figure 2 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure
  • Figure 3 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure.
  • Figure 4 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure.
  • Figure 5 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure.
  • Figure 6 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure.
  • Figure 7 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure.
  • Figure 8 is a schematic diagram of an operator user interface provided by another embodiment of the present disclosure.
  • Figure 9 is a schematic diagram of an operator user interface provided by another embodiment of the present disclosure.
  • Figure 10 is a schematic structural diagram of a digital human sign language broadcasting device provided by an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of an electronic device embodiment provided by an embodiment of the present disclosure.
  • the speed of a normal person's voice when speaking is usually faster than the speed of a digital person's sign language movements. If the process of the digital person's sign language movements and the process of normal people's speaking are required to be aligned in time, Then it is necessary to speed up the digital human's sign language movements, or increase the playback speed of the video of the digital human's sign language movements, so that the hearing-impaired people cannot see the sign language movements clearly. To address this problem, embodiments of the present disclosure provide a digital human sign language broadcasting method. This method will be introduced below with reference to specific embodiments.
  • Figure 1 is a flow chart of a digital human sign language broadcasting method provided by an embodiment of the present disclosure.
  • the method can be executed by a digital human sign language broadcasting device, which can be implemented in the form of software and/or hardware.
  • the device can be configured in an electronic device, such as a server or a terminal, where the terminal specifically includes a mobile phone, a computer or a tablet computer, etc. .
  • the digital human sign language broadcasting method described in this embodiment can be applied to the application scenario shown in Figure 2.
  • the application scenario includes a terminal 21 and a server 22, where the server 22 can obtain multimedia information from other terminals or other servers, and generate a sign language animation of a digital person signing based on the multimedia information.
  • the server 22 can send the sign language animation of the digital person signing to the terminal 21.
  • the terminal 21 can be a terminal for the hearing-impaired, so that the hearing-impaired can understand the meaning expressed by the multimedia information.
  • the method is introduced in detail below in conjunction with Figure 2, as shown in Figure 1. The specific steps of the method are as follows:
  • the server 22 can obtain multimedia information from other terminals or other servers, and the multimedia information can be text information, audio information, or audio and video information.
  • the audio information may be a real-time audio stream or an offline audio file.
  • the audio and video information can be real-time audio and video streams, or can be offline audio and video files.
  • the terminal 23 can send a live audio and video stream to the server 22 in real time.
  • the server 22 can not only forward the live audio and video stream to the terminal 21, but also send a video stream of digital people signing to the terminal 21.
  • Digital people use sign language to express the meaning of the audio signals or subtitles in the live audio and video stream, so that hearing-impaired people can watch the online live broadcast.
  • the server 24 sends the live TV program to the server 22 in real time.
  • the live TV program is sent to the server 22 in the form of streaming media.
  • the digital person generated by the server 22 can assist the hearing-impaired to watch the live TV program.
  • the server 22 can also generate multimedia information, such as film and television consultation, education and training videos, etc., so that the hearing-impaired can watch the film and television consultation, education and training videos, etc. based on the digital people generated by the server 22 .
  • hearing-impaired people and normal people can also conduct online or offline meetings through their respective terminals.
  • terminal 21 is a terminal for hearing-impaired people and terminal 23 is a terminal for normal people
  • the hearing-impaired person Conduct remote online meetings with normal people through their respective terminals.
  • the terminal 23 collects the normal people’s information in real time.
  • the audio and video stream is sent to the server 22.
  • the server 22 generates a video stream of the digital person signing in sign language based on the meaning expressed by the normal person, and sends the video stream of the digital person signing in sign language to the terminal 21 in real time. , to assist hearing-impaired people to understand what normal people say. Or, the hearing-impaired person and the normal person conduct offline meetings through their respective terminals.
  • the hearing-impaired person and the normal person are in the same conference room, and the terminal 23 collects the audio and video stream of the normal person in real time, and combines the audio and video stream of the normal person.
  • the server 22 translates the normal person's natural language into sign language movements in real time, and streams the video of the digital person signing to the terminal 21, so that the hearing-impaired can understand what the normal person says in real time.
  • the terminal 21 can also be a large screen in public places such as airports, train stations, sports venues, etc.
  • the terminal 21 can play videos of digital people signing, so that hearing-impaired people can understand the current situation anytime and anywhere when they are in public places. consult. It can be understood that the method described in this embodiment is not limited to being applicable to these scenarios, but can also be applied to other application scenarios, which will not be described again here.
  • the text information can be used as the natural language text corresponding to the multimedia information.
  • the natural language text corresponding to the multimedia information may be text converted from the audio information using automatic speech recognition technology (Automatic Speech Recognition, ASR).
  • ASR Automatic Speech Recognition
  • the audio and video information can be parsed to extract the audio components from the audio and video information, and use ASR technology to convert the audio components into text.
  • the text can be used as the natural language text corresponding to the multimedia information.
  • Sign language uses gestures to compare movements, and simulates images or syllables according to changes in gestures to form a certain meaning or word. It is a hand language for people with hearing impairment or speechlessness to communicate with each other and exchange ideas. Since sign language is a visual language, there are great differences between it and natural language text in word usage and grammatical rules. For example, "Follow the directed route and leave the venue, do not stay in the audience area" is a natural language text, and the corresponding sign language text is "Follow/direct/route/go/stay/this/don't”. Therefore, it is necessary to translate the natural language text into a sign language text, and here the signed language text translated from the natural language text is recorded as the first sign language text.
  • “Follow/command/way/go/stay/this/don't” can be used as a first sign language text.
  • the first sign language text consists of multiple action names, and adjacent action names are separated by "/".
  • Each action name can correspond to a coherent sign language action, that is, different action names are used to distinguish different sign language actions.
  • this embodiment proposes a solution, that is, after obtaining the first sign language text such as "Follow/command/road/go/stay/this/don't", perform semantic analysis on the first sign language text. Streamline the processing to obtain the second sign language text.
  • the second sign language text is "Follow/command/road/walk". Assume that the time required for a normal person to say “exit the venue according to the guided route and do not stay in the audience area” is recorded as t1. Before the first sign language text is semantically streamlined, the digital person needs to perform 7 sign language within the t1 time period. However, after semantically simplifying the first sign language text, the digital person only needs to perform 4 sign language actions within the same duration, that is, the duration of t1, so that the digital person has more sufficient time to perform each sign language movements, thereby ensuring that hearing-impaired people can clearly see each sign language movement.
  • the server 22 can drive the digital human according to each action name in the second sign language text, so that the digital human can express the sign language movements corresponding to each action name in the second sign language text through its limbs, such as hands. come out.
  • driving the digital human according to the second sign language text so that the digital human expresses the sign language movements corresponding to the second sign language text through the body includes: according to the second sign language text
  • the digital human is driven so that the digital human expresses the sign language movements corresponding to the second sign language text through the body, and the mouth shape and expression of the digital human are consistent with the second sign language text.
  • the server 22 drives the digital human according to each action name in the second sign language text
  • it can also control the mouth shape of the digital human to be consistent with the second sign language text.
  • the digital person performs the sign language action corresponding to "according to”
  • the digital person's mouth shape is consistent with "according to”.
  • the expression of the digital human can also be controlled.
  • the digital human expresses the sign language movements corresponding to the second sign language text, the digital human's expression can remain serious.
  • the embodiment of the present disclosure obtains a second sign language text by translating a natural language text used by normal people into a first sign language text, and performing semantic streamlining processing on the first sign language text. Furthermore, the digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body. Because the second sign language text obtained by semantically reducing the first sign language text can include fewer action names, the digital person can do less in the same time compared to the first sign language text. The sign language movements allow the digital person to have more time to perform each sign language movement, thus ensuring that the hearing-impaired can clearly see each sign language movement.
  • FIG. 5 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure. In this embodiment, the specific steps of this method are as follows:
  • the natural language text can also be semantically simplified.
  • the natural language text can be semantically understood based on the behavior of artificial sign language translation experts during the translation process, and key information in the natural language text can be extracted to filter out invalid or redundant information. The remaining information is obtained to obtain a streamlined natural language text, for example, "Exit the venue according to the guided route.”
  • the process of translating the natural language text into the first sign language text, or translating the streamlined natural language text into the first sign language text can be achieved through machine translation, which is also called automatic translation. It is the process of using computers to convert one language (source language) into another language (target language).
  • the second sign language text is "Follow/direct/go", thus making the second sign language text more concise.
  • the natural language text can be recorded as the original text, and the first sign language text and the second sign language text can be recorded as the translation text respectively.
  • driving the digital human according to the second sign language text includes: if the multimedia information is a non-real-time audio file or audio and video file, obtaining each audio signal in the audio file or the audio and video file. the start time and the end time; according to the start time and the end time, adjust the speed at which the digital human expresses sign language movements, so that the sign language movements expressed by the digital human and the audio signal are aligned on the time axis .
  • the server 22 can also obtain each audio signal from the audio file or the audio and video file, and each audio signal can be the audio of a sentence in natural language. Signal. Further, the server 22 can calculate the start time and end time of each audio signal, and the start time and end time can be recorded as the start and end time axis. For each audio signal, the server 22 can adjust the speed at which the digital human expresses sign language movements according to the start time and end time of the audio signal, that is, automatically perform algorithm adaptation to the sign language broadcast speed of different sentences, and adjust the broadcast faster or slower.
  • the speed enables the process of the digital human to express the sign language movements corresponding to a certain sentence and the audio signal of the sentence to be aligned on the time axis.
  • sign language broadcasting is to convert natural language text into sign language text, drive the digital human to express the sign language movements corresponding to the sign language text through the body, and cooperate with the corresponding facial expressions and mouth shapes of the digital human to broadcast.
  • the digital person may be a virtual character with a digital appearance.
  • This embodiment performs semantic simplification processing on the natural language text and the first sign language text respectively, so that the action names included in the second sign language text are as few as possible, that is, the second sign language text is as concise as possible.
  • the digital human based on the second sign language text for the same sentence, it can effectively prevent the digital human's sign language movements from lagging behind the normal speed.
  • the speed of a person's voice when speaking allows the digital person's sign language movement process to be synchronized with the normal person's speaking process, improving information synchronization.
  • the algorithm to the sign language broadcast speed of different sentences, the alignment of the sign language broadcast and the original audio and video content can be achieved.
  • Figure 6 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure. In this embodiment, the specific steps of this method are as follows:
  • the multimedia information obtained by the server 22 may be at least one of text information, real-time audio and video streams, audio files, and audio and video files as shown in FIG. 7 .
  • the natural language text can be obtained through text parsing as shown in Figure 7. If the multimedia information is a real-time audio and video stream, real-time ASR is called to obtain the natural language text. If the multimedia information is an audio file, the audio recording file ASR is called to obtain the natural language text. If the multimedia information is an audio and video file, the audio and video file is first subjected to video analysis to extract the audio signal in the audio and video file, and then the recording file ASR is called to obtain the natural language text.
  • the server 22 may send the natural language text corresponding to the multimedia information to the operator's terminal, so that the terminal can display the natural language text. Further, the operator can modify the natural language text displayed on the terminal to achieve original text intervention as shown in Figure 7.
  • the server 22 can receive the modified natural language text from the operator's terminal.
  • the modified natural language text is the original text after intervention as shown in FIG. 7 . It can be understood that in other embodiments, the operator may not modify the natural language text.
  • the modified natural language text can be translated into a first sign language text, or the original natural language text can be translated into a first sign language text.
  • the process of translating the modified natural language text into the first sign language text, or translating the original natural language text into the first sign language text can be displayed on the operator's terminal, as shown in Figure 8 or as shown in Figure 8.
  • Figure 8 shows the process of translating real-time audio and video into sign language animation
  • Figure 9 shows the process of translating text into sign language animation.
  • the first sign language text can be semantically streamlined to obtain a second sign language text.
  • the second sign language text can be the sign language text result shown in Figure 7.
  • the server 22 can also send the second sign language text to the operator's terminal, so that the operator can modify the second sign language text, thereby realizing translation intervention as shown in FIG. 7 .
  • the server 22 can receive the modified second sign language text from the operator's terminal.
  • the modified second sign language text is as shown in Figure 7 Post-intervention translation. It can be understood that in other embodiments, the operator may not modify the second sign language text.
  • the server 22 can drive the digital human based on the second sign language text modified by the operator, or based on the unmodified second sign language text.
  • the process of driving the digital human includes sign language synthesis, expression synthesis, and mouth shape synthesis. Waiting process.
  • sign language synthesis can be to control the digital person to express the sign language movements corresponding to the second sign language text through the body.
  • Expression synthesis can control the expression of a digital person to be consistent with the expression of a normal person speaking natural language.
  • Mouth synthesis can control the mouth shape of the digital person to be consistent with the second sign language text.
  • the multimedia information is a real-time audio stream or audio and video stream, generate the digital human's streaming sign language broadcast video stream, and send the streaming sign language broadcast video stream to the terminal in real time.
  • the server 22 can generate a streaming sign language broadcast video stream of the digital human during the process of driving the digital human, and use the streaming sign language broadcast video stream to Sent to hearing-impaired terminals in real time. It can be understood that in some embodiments, the server 22 can simultaneously send the real-time audio and video stream and the digital human's streaming sign language broadcast video stream to the terminal of the hearing-impaired person, so that the terminal of the hearing-impaired person can not only play normal Audio and video that people can watch, and at the same time, the digital person's sign language broadcast video can also be played.
  • generating a streaming sign language broadcast video stream of the digital human includes: generating a streaming sign language broadcast video stream of the digital human according to the configuration information of the digital human.
  • the configuration information of the digital person includes at least one of the following: the background, color of the digital person, the position and size of the digital person in the user interface.
  • the operator can also configure the synthesis effect.
  • the operator's terminal can display a configuration interface, and the configuration interface can display the configuration options of the digital person.
  • the operator can operate these configuration options to achieve the desired effect.
  • the configuration of the digital person is to generate the configuration information of the digital person.
  • the configuration information may include the background, color, position and size of the digital person in the user interface for the hearing-impaired, etc.
  • the lens distance shown in Figure 7 is used to control the size of the digital human in the user interface for the hearing-impaired.
  • the server 22 may generate a streaming sign language broadcast video stream of the digital person according to the configuration information of the digital person.
  • the operator can also configure whether to display subtitles. For example, if subtitles are configured, the hearing-impaired can also watch the digital sign language while reading the subtitles to improve understanding efficiency.
  • the multimedia information is a non-real-time audio file, audio and video file or text file, generate a sign language broadcast video file of the digital person, and send the sign language broadcast video file to the terminal.
  • the server 22 can generate a sign language broadcast video file of the digital human during the process of driving the digital human, and send the sign language broadcast video file to the listener. Terminal for people with disabilities. It can be understood that in some embodiments, the server 22 can simultaneously send the multimedia information and the digital human sign language broadcast video file to the terminal of the hearing-impaired person, so that the terminal of the hearing-impaired person can not only It can play text information, audio files or audio and video files that normal people can watch, and it can also play the sign language video files of the digital person.
  • generating a sign language broadcast video file of the digital human includes: generating a sign language broadcast video file of the digital human according to the configuration information of the digital human; wherein the configuration information of the digital human includes at least one of the following: Type: the background, color of the digital person, the position and size of the digital person in the user interface.
  • the server 22 can generate the sign language broadcast video file of the digital person based on the configuration information of the digital person.
  • the source and content of the configuration information are as mentioned above and will not be described again here.
  • the configuration information of the digital human may be configured by an operator.
  • this embodiment can support multiple modes of plain text, real-time audio and video, and offline audio and video files, and has a wider application scenario.
  • the sign language broadcast provided by this embodiment involves multiple algorithm technologies, which are interlocking, and the output of each link affects the input of the next link.
  • This solution can output independent results for each link of sign language broadcasting, making it easy to quickly trace and locate problems in the link.
  • the presentation of sign language is not only the body and hand movements. Based on the synthesis of sign language, through the technology of mouth synthesis and expression synthesis, the body posture, expression and mouth shape are integrated, and a variety of information is linked, so as to better communicate to the audience. Hearing-impaired people convey information.
  • FIG 10 is a schematic structural diagram of a digital human sign language broadcasting device provided by an embodiment of the present disclosure.
  • the digital human sign language broadcasting device provided by the embodiment of the present disclosure can execute the processing flow provided by the digital human sign language broadcasting method embodiment.
  • the digital human sign language broadcasting device 100 includes:
  • Acquisition module 101 used to obtain multimedia information
  • Determining module 102 used to determine the natural language text corresponding to the multimedia information
  • Translation module 103 used to translate the natural language text into a first sign language text
  • the processing module 104 is used to perform semantic simplification processing on the first sign language text to obtain a second sign language text;
  • the driving module 105 is used to drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
  • the driving module 105 drives the digital human according to the second sign language text, so that when the digital human expresses the sign language movements corresponding to the second sign language text through the body, it specifically includes: according to the second sign language text.
  • the sign language text drives the digital human, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body, and the mouth shape and expression of the digital human are consistent with the second sign language text.
  • the processing module 104 is also configured to perform semantic simplification processing on the natural language text after the determination module 102 determines the natural language text corresponding to the multimedia information, to obtain a simplified natural language text.
  • the translation module 103 is specifically configured to translate the streamlined natural language text into a first sign language text.
  • the driving module 105 includes an acquisition unit 1051 and an adjustment unit 1052, wherein the acquisition unit 1051 is used to acquire the audio file or the audio and video file when the multimedia information is a non-real-time audio file or audio and video file.
  • the digital human sign language broadcasting device 100 also includes: a sending module 106 and a receiving module 107.
  • the sending module 106 is configured to send the natural language text corresponding to the multimedia information after the determining module 102 determines the natural language text corresponding to the multimedia information.
  • the language text is sent to the operator's terminal; the receiving module 107 is used to receive the natural language text modified by the operator.
  • the translation module 103 is specifically configured to translate the natural language text modified by the operator into a first sign language text.
  • the sending module 106 is also configured to: after the processing module 104 performs semantic simplification processing on the first sign language text to obtain the second sign language text, send the second sign language text to the operator's terminal. ;
  • the receiving module 107 is also used to receive the second sign language text modified by the operator.
  • the driving module 105 is specifically used to drive the digital human according to the second sign language text modified by the operator.
  • the digital human sign language broadcasting device 100 also includes: a generating module 108, configured to generate a signal if the multimedia information is a real-time audio stream or audio and video stream after the driving module 105 drives the digital human according to the second sign language text. , then the streaming sign language broadcast video stream of the digital person is generated, and the streaming sign language broadcast video stream is sent to the terminal in real time; if the multimedia information is a non-real-time audio file, audio and video file or text file, then A sign language broadcast video file of the digital person is generated, and the sign language broadcast video file is sent to the terminal.
  • the terminal may be a hearing-impaired terminal.
  • the generation module 108 when generating the digital human's streaming sign language broadcast video stream, is specifically configured to: generate the digital human's streaming sign language broadcast video stream according to the configuration information of the digital human; the generation module 108 When generating the sign language broadcast video file of the digital human, it is specifically used to: generate the sign language broadcast video file of the digital human according to the configuration information of the digital human; wherein the configuration information of the digital human includes at least the following: One: the background, color, and position and size of the digital human in the user interface. Wherein, the configuration information of the digital human may be configured by operating personnel.
  • the digital human sign language broadcasting device of the embodiment shown in Figure 10 can be used to implement the technical solution of the above method embodiment. Its implementation principles and technical effects are similar and will not be described again here.
  • FIG. 11 is a schematic structural diagram of an electronic device embodiment provided by an embodiment of the present disclosure. As shown in FIG. 11 , the electronic device includes a memory 111 and a processor 112 .
  • the memory 111 is used to store programs. In addition to the above-mentioned programs, the memory 111 may also be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, etc.
  • Memory 111 may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as a static State random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), Magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static State random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable programmable read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • Magnetic memory flash memory, magnetic disk or optical disk.
  • the processor 112 is coupled to the memory 111 and executes the program stored in the memory 111 for:
  • the digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
  • the electronic device may also include: a communication component 113 , a power supply component 114 , an audio component 115 , a display 116 and other components. Only some components are schematically shown in FIG. 11 , which does not mean that the electronic device only includes the components shown in FIG. 11 .
  • the communication component 113 is configured to facilitate wired or wireless communication between the electronic device and other devices.
  • Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 113 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 113 also includes a near field communication (NFC) module to facilitate short-range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the power supply component 114 provides power to various components of the electronic device.
  • Power supply components 114 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic devices.
  • Audio component 115 is configured to output and/or input audio signals.
  • the audio component 115 includes a microphone (MIC) configured to receive external audio signals when the electronic device is in operating modes, such as call mode, recording mode, and voice recognition mode.
  • the received audio signal may be further stored in memory 111 or sent via communication component 113 .
  • audio component 115 also includes a speaker for outputting audio signals.
  • Display 116 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action.
  • embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the digital human sign language broadcasting method described in the above embodiments.

Abstract

The present disclosure relates to a digital human sign language broadcasting method and apparatus, a device, and a storage medium. According to the present disclosure, a natural language text used by people with normal hearing is translated into a first sign language text, and semantic simplification is performed on the first sign language text to obtain a second sign language text. Furthermore, a digital human is driven according to the second sign language text so that the digital human expresses, by means of the limbs, sign language actions corresponding to the second sign language text. The second sign language text obtained by performing semantic simplification on the first sign language text can comprise fewer action names, and therefore, compared with the first sign language text, the digital human can do fewer sign language actions within the same time, so that the digital human has more sufficient time to do each sign language action, thereby ensuring that a hearing-impaired person can see each sign language action clearly.

Description

数字人手语播报方法、装置、设备及存储介质Digital human sign language broadcasting method, device, equipment and storage medium
本申请要求于2022年07月04日提交中国专利局、申请号为202210785961.2、申请名称为“数字人手语播报方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application submitted to the China Patent Office on July 4, 2022, with the application number 202210785961.2 and the application title "Digital Human Sign Language Broadcasting Method, Device, Equipment and Storage Medium", the entire content of which is incorporated by reference. incorporated in this application.
技术领域Technical field
本公开涉及信息技术领域,尤其涉及一种数字人手语播报方法、装置、设备及存储介质。The present disclosure relates to the field of information technology, and in particular to a digital human sign language broadcasting method, device, equipment and storage medium.
背景技术Background technique
随着科技的不断发展,越来越多的用户可以通过终端观看多媒体信息,多媒体信息通常包括文本、音频、视频等。但是,对于听障人士而言,符合认知习惯的是手语。因此,需要将自然语言的语音和文本信息转换为手语,以便听障人士理解。With the continuous development of technology, more and more users can view multimedia information through terminals. Multimedia information usually includes text, audio, video, etc. However, for the hearing-impaired, it is sign language that is cognitively appropriate. Therefore, there is a need to convert natural language speech and text information into sign language so that it can be understood by hearing-impaired people.
但是,本申请的发明人发现,对于同一句话而言,正常人说话时的声音速度通常会快于数字人打手语动作的速度,如果要求数字人打手语动作的过程和正常人说话的过程在时间上对齐,那么需要加快数字人打手语动作的速度,或者提高数字人打手语动作的视频的播放速度,从而导致听障人士看不清手语动作。However, the inventor of this application found that for the same sentence, the sound speed of a normal person when speaking is usually faster than the speed of a digital person's sign language movements. If the process of the digital person's sign language movements is required to be the same as the process of a normal person's speaking To align in time, it is necessary to speed up the digital person's sign language movements, or increase the playback speed of the video of the digital person's sign language movements, so that the hearing-impaired people cannot see the sign language movements clearly.
发明内容Contents of the invention
为了解决上述技术问题或者至少部分地解决上述技术问题,本公开提供了一种数字人手语播报方法、装置、设备及存储介质,使得数字人拥有更充分的时长来做每个手语动作,从而可以保证听障人士可以看清楚每个手语动作。In order to solve the above technical problems or at least partially solve the above technical problems, the present disclosure provides a digital human sign language broadcasting method, device, equipment and storage medium, so that the digital human has more sufficient time to perform each sign language action, so that the digital human can Ensure that hearing-impaired people can clearly see every sign language movement.
第一方面,本公开实施例提供一种数字人手语播报方法,包括:In a first aspect, embodiments of the present disclosure provide a digital human sign language broadcasting method, including:
获取多媒体信息,并确定所述多媒体信息对应的自然语言文本;Obtain multimedia information and determine the natural language text corresponding to the multimedia information;
将所述自然语言文本翻译为第一手语文本;Translate the natural language text into a first sign language text;
对所述第一手语文本进行语义精简处理,得到第二手语文本;Perform semantic simplification processing on the first sign language text to obtain a second sign language text;
根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来。The digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
第二方面,本公开实施例提供一种数字人手语播报装置,包括:In a second aspect, an embodiment of the present disclosure provides a digital human sign language broadcasting device, including:
获取模块,用于获取多媒体信息;Acquisition module, used to obtain multimedia information;
确定模块,用于确定所述多媒体信息对应的自然语言文本;A determination module, used to determine the natural language text corresponding to the multimedia information;
翻译模块,用于将所述自然语言文本翻译为第一手语文本; A translation module for translating the natural language text into a first sign language text;
处理模块,用于对所述第一手语文本进行语义精简处理,得到第二手语文本;A processing module for performing semantic simplification processing on the first sign language text to obtain a second sign language text;
驱动模块,用于根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来。A driving module is used to drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
第三方面,本公开实施例提供一种电子设备,包括:In a third aspect, an embodiment of the present disclosure provides an electronic device, including:
存储器;memory;
处理器;以及processor; and
计算机程序;Computer program;
其中,所述计算机程序存储在所述存储器中,并被配置为由所述处理器执行以实现如第一方面所述的方法。Wherein, the computer program is stored in the memory and configured to be executed by the processor to implement the method as described in the first aspect.
第四方面,本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现第一方面所述的方法。In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the method described in the first aspect.
本公开实施例提供的数字人手语播报方法、装置、设备及存储介质,通过将正常人所使用的自然语言文本翻译为第一手语文本,并对第一手语文本进行语义精简处理,得到第二手语文本。进一步,根据第二手语文本驱动数字人,使得数字人通过肢体将第二手语文本对应的手语动作表达出来。由于对第一手语文本进行语义精简处理后得到的第二手语文本可以包括较少的动作名称,因此,相比于第一手语文本而言,数字人可以在相同时间内做较少的手语动作,使得数字人拥有更充分的时长来做每个手语动作,从而可以保证听障人士可以看清楚每个手语动作。The digital human sign language broadcasting method, device, equipment and storage medium provided by the embodiments of the present disclosure translate natural language text used by normal people into first sign language text, and perform semantic streamlining processing on the first sign language text to obtain Second sign language text. Furthermore, the digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body. Because the second sign language text obtained by semantically reducing the first sign language text can include fewer action names, the digital person can do less in the same time compared to the first sign language text. The sign language movements allow the digital person to have more time to perform each sign language movement, thus ensuring that the hearing-impaired can clearly see each sign language movement.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those of ordinary skill in the art, It is said that other drawings can be obtained based on these drawings without exerting creative labor.
图1为本公开实施例提供的数字人手语播报方法流程图;Figure 1 is a flow chart of a digital human sign language broadcasting method provided by an embodiment of the present disclosure;
图2为本公开实施例提供的应用场景的示意图;Figure 2 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure;
图3为本公开实施例提供的应用场景的示意图;Figure 3 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure;
图4为本公开实施例提供的应用场景的示意图;Figure 4 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure;
图5为本公开另一实施例提供的数字人手语播报方法流程图;Figure 5 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure;
图6为本公开另一实施例提供的数字人手语播报方法流程图;Figure 6 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure;
图7为本公开另一实施例提供的数字人手语播报方法流程图;Figure 7 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure;
图8为本公开另一实施例提供的运营人员的用户界面的示意图;Figure 8 is a schematic diagram of an operator user interface provided by another embodiment of the present disclosure;
图9为本公开另一实施例提供的运营人员的用户界面的示意图;Figure 9 is a schematic diagram of an operator user interface provided by another embodiment of the present disclosure;
图10为本公开实施例提供的数字人手语播报装置的结构示意图;Figure 10 is a schematic structural diagram of a digital human sign language broadcasting device provided by an embodiment of the present disclosure;
图11为本公开实施例提供的电子设备实施例的结构示意图。 FIG. 11 is a schematic structural diagram of an electronic device embodiment provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为了能够更清楚地理解本公开的上述目的、特征和优点,下面将对本公开的方案进行进一步描述。需要说明的是,在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合。In order to understand the above objects, features and advantages of the present disclosure more clearly, the solutions of the present disclosure will be further described below. It should be noted that, as long as there is no conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other.
在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。Many specific details are set forth in the following description to fully understand the present disclosure, but the present disclosure can also be implemented in other ways different from those described here; obviously, the embodiments in the description are only part of the embodiments of the present disclosure, and Not all examples.
通常情况下,对于同一句话而言,正常人说话时的声音速度通常会快于数字人打手语动作的速度,如果要求数字人打手语动作的过程和正常人说话的过程在时间上对齐,那么需要加快数字人打手语动作的速度,或者提高数字人打手语动作的视频的播放速度,从而导致听障人士看不清手语动作。针对该问题,本公开实施例提供了一种数字人手语播报方法,下面结合具体的实施例对该方法进行介绍。Normally, for the same sentence, the speed of a normal person's voice when speaking is usually faster than the speed of a digital person's sign language movements. If the process of the digital person's sign language movements and the process of normal people's speaking are required to be aligned in time, Then it is necessary to speed up the digital human's sign language movements, or increase the playback speed of the video of the digital human's sign language movements, so that the hearing-impaired people cannot see the sign language movements clearly. To address this problem, embodiments of the present disclosure provide a digital human sign language broadcasting method. This method will be introduced below with reference to specific embodiments.
图1为本公开实施例提供的数字人手语播报方法流程图。该方法可以由数字人手语播报装置执行,该装置可以采用软件和/或硬件的方式实现,该装置可配置于电子设备中,例如服务器或终端,其中,终端具体包括手机、电脑或平板电脑等。另外,本实施例所述的数字人手语播报方法可以适用于如图2所示的应用场景。如图2所示,该应用场景包括终端21和服务器22,其中,服务器22可以从其他终端或其他服务器获取多媒体信息,并根据该多媒体信息生成数字人打手语的手语动画。进一步,服务器22可以将数字人打手语的手语动画发送给终端21,终端21可以是听障人士的终端,从而使得听障人士可以理解到该多媒体信息所表达的意思。下面结合图2对该方法进行详细介绍,如图1所示,该方法具体步骤如下:Figure 1 is a flow chart of a digital human sign language broadcasting method provided by an embodiment of the present disclosure. The method can be executed by a digital human sign language broadcasting device, which can be implemented in the form of software and/or hardware. The device can be configured in an electronic device, such as a server or a terminal, where the terminal specifically includes a mobile phone, a computer or a tablet computer, etc. . In addition, the digital human sign language broadcasting method described in this embodiment can be applied to the application scenario shown in Figure 2. As shown in Figure 2, the application scenario includes a terminal 21 and a server 22, where the server 22 can obtain multimedia information from other terminals or other servers, and generate a sign language animation of a digital person signing based on the multimedia information. Furthermore, the server 22 can send the sign language animation of the digital person signing to the terminal 21. The terminal 21 can be a terminal for the hearing-impaired, so that the hearing-impaired can understand the meaning expressed by the multimedia information. The method is introduced in detail below in conjunction with Figure 2, as shown in Figure 1. The specific steps of the method are as follows:
S101、获取多媒体信息,并确定所述多媒体信息对应的自然语言文本。S101. Obtain multimedia information and determine the natural language text corresponding to the multimedia information.
例如,服务器22可以从其他终端或其他服务器获取多媒体信息,该多媒体信息可以是文本信息、音频信息或音视频信息。其中,音频信息可以是实时的音频流,或者可以是离线的音频文件。音视频信息可以是实时的音视频流,或者可以是离线的音视频文件。例如图3所示,终端23可以向服务器22实时的发送直播音视频流,服务器22不仅可以向终端21转发该直播音视频流,同时还可以向终端21发送数字人打手语的视频流,该数字人通过打手语来表达该直播音视频流中音频信号或字幕的意思,从而使得听障人士可以观看网络直播。或者如图4所示,服务器24向服务器22实时的发送电视直播节目,该电视直播节目以流媒体的形式发送给服务器22,服务器22生成的数字人可以辅助听障人士观看该电视直播节目。在其他一些实施例中,服务器22还可以生成多媒体信息,例如,影视咨询、教育培训类视频等,使得听障人士可以根据服务器22生成的数字人观看该影视咨询、教育培训类视频等。此外,听障人士和正常人还可以通过各自的终端开展线上会议或线下会议,例如图3所示,假设终端21是听障人士的终端,终端23是正常人的终端,听障人士和正常人通过各自的终端进行远程的线上会议,例如终端23实时采集正常人的 音视频流,并将正常人的音视频流发送给服务器22,服务器22根据该正常人所表达的意思生成数字人打手语的视频流,并实时的向终端21发送数字人打手语的视频流,以辅助听障人士理解正常人所说的话。或者,听障人士和正常人通过各自的终端进行线下会议,例如,听障人士和正常人位于同一个会议室,终端23实时采集正常人的音视频流,并将正常人的音视频流发送给服务器22,服务器22将该正常人的自然语言实时的翻译为手语动作,并将数字人打手语的视频流下发给终端21,使得听障人士可以实时理解正常人所说的话。可以理解的是,终端21还可以是机场、火车站、体育场馆等公共场所中的大屏幕,终端21可以播放数字人打手语的视频,使得听障人士位于公共场所时可以随时随地的了解当前咨询。可以理解的是,本实施例所述的方法不限于适用这几种场景,还可以适用于其他应用场景,此处不再一一赘述。For example, the server 22 can obtain multimedia information from other terminals or other servers, and the multimedia information can be text information, audio information, or audio and video information. The audio information may be a real-time audio stream or an offline audio file. The audio and video information can be real-time audio and video streams, or can be offline audio and video files. For example, as shown in Figure 3, the terminal 23 can send a live audio and video stream to the server 22 in real time. The server 22 can not only forward the live audio and video stream to the terminal 21, but also send a video stream of digital people signing to the terminal 21. Digital people use sign language to express the meaning of the audio signals or subtitles in the live audio and video stream, so that hearing-impaired people can watch the online live broadcast. Or as shown in Figure 4, the server 24 sends the live TV program to the server 22 in real time. The live TV program is sent to the server 22 in the form of streaming media. The digital person generated by the server 22 can assist the hearing-impaired to watch the live TV program. In some other embodiments, the server 22 can also generate multimedia information, such as film and television consultation, education and training videos, etc., so that the hearing-impaired can watch the film and television consultation, education and training videos, etc. based on the digital people generated by the server 22 . In addition, hearing-impaired people and normal people can also conduct online or offline meetings through their respective terminals. For example, as shown in Figure 3, assuming that terminal 21 is a terminal for hearing-impaired people and terminal 23 is a terminal for normal people, the hearing-impaired person Conduct remote online meetings with normal people through their respective terminals. For example, the terminal 23 collects the normal people’s information in real time. The audio and video stream is sent to the server 22. The server 22 generates a video stream of the digital person signing in sign language based on the meaning expressed by the normal person, and sends the video stream of the digital person signing in sign language to the terminal 21 in real time. , to assist hearing-impaired people to understand what normal people say. Or, the hearing-impaired person and the normal person conduct offline meetings through their respective terminals. For example, the hearing-impaired person and the normal person are in the same conference room, and the terminal 23 collects the audio and video stream of the normal person in real time, and combines the audio and video stream of the normal person. Sent to the server 22, the server 22 translates the normal person's natural language into sign language movements in real time, and streams the video of the digital person signing to the terminal 21, so that the hearing-impaired can understand what the normal person says in real time. It is understandable that the terminal 21 can also be a large screen in public places such as airports, train stations, sports venues, etc. The terminal 21 can play videos of digital people signing, so that hearing-impaired people can understand the current situation anytime and anywhere when they are in public places. consult. It can be understood that the method described in this embodiment is not limited to being applicable to these scenarios, but can also be applied to other application scenarios, which will not be described again here.
当该多媒体信息是文本信息时,该文本信息可以作为该多媒体信息对应的自然语言文本。When the multimedia information is text information, the text information can be used as the natural language text corresponding to the multimedia information.
当该多媒体信息是音频信息时,该多媒体信息对应的自然语言文本可以是采用自动语音识别技术(Automatic Speech Recognition,ASR)将该音频信息转换成的文本。When the multimedia information is audio information, the natural language text corresponding to the multimedia information may be text converted from the audio information using automatic speech recognition technology (Automatic Speech Recognition, ASR).
当该多媒体信息是音视频信息时,可以对该音视频信息进行解析,以便将该音视频信息中的音频成分从该音视频信息中抽取出来,并采用ASR技术将该音频成分转换成文本,该文本可以作为该多媒体信息对应的自然语言文本。When the multimedia information is audio and video information, the audio and video information can be parsed to extract the audio components from the audio and video information, and use ASR technology to convert the audio components into text. The text can be used as the natural language text corresponding to the multimedia information.
S102、将所述自然语言文本翻译为第一手语文本。S102. Translate the natural language text into a first sign language text.
手语是用手势比量动作,根据手势的变化模拟形象或者音节以构成的一定意思或词语,是听力障碍或者无法言语的人互相交际和交流思想的一种手的语言。由于手语属于一种视觉语言,在用词、语法规则上与自然语言文本都存在着极大的差异。例如,“按指引路线退场,不要逗留观众区”是自然语言文本,与之相对应的手语文本是“按照/指挥/路/走/留/这/不要”。因此,需要将自然语言文本翻译为手语文本,此处将自然语言文本翻译成的手语文本记为第一手语文本。例如,“按照/指挥/路/走/留/这/不要”可以作为第一手语文本。该第一手语文本由多个动作名称组成,相邻动作名称之间采用“/”分割开来。每个动作名称可以对应一个连贯的手语动作,也就是说,不同的动作名称用于区分不同的手语动作。Sign language uses gestures to compare movements, and simulates images or syllables according to changes in gestures to form a certain meaning or word. It is a hand language for people with hearing impairment or speechlessness to communicate with each other and exchange ideas. Since sign language is a visual language, there are great differences between it and natural language text in word usage and grammatical rules. For example, "Follow the directed route and leave the venue, do not stay in the audience area" is a natural language text, and the corresponding sign language text is "Follow/direct/route/go/stay/this/don't". Therefore, it is necessary to translate the natural language text into a sign language text, and here the signed language text translated from the natural language text is recorded as the first sign language text. For example, "Follow/command/way/go/stay/this/don't" can be used as a first sign language text. The first sign language text consists of multiple action names, and adjacent action names are separated by "/". Each action name can correspond to a coherent sign language action, that is, different action names are used to distinguish different sign language actions.
S103、对所述第一手语文本进行语义精简处理,得到第二手语文本。S103. Perform semantic simplification processing on the first sign language text to obtain a second sign language text.
由于“按指引路线退场,不要逗留观众区”和“按照/指挥/路/走/留/这/不要”是对应的,即“按指引路线退场,不要逗留观众区”对应有7个动作名称、7个手语动作,因此,正常人说“按指引路线退场,不要逗留观众区”这句话的时候,需要数字人做出7个手语动作,而每个手语动作可能是一个持续的、连贯的动作,即数字人做每个手语动作时所需的时间会较长,而正常人说每个字所需的时间可能较短,从而导致正常人说话时的声音速度通常会快于数字人打手语动作的速度。如果要求数字人做这7个手语动作的过程和正常人说“按指引路线退场,不要逗留观众区”这句话的过程在时间轴上对齐,那么需要 加快数字人打手语动作的速度,或者提高数字人打手语动作的视频的播放速度,从而导致听障人士看不清手语动作。为了解决该问题,本实施例提出了一种解决办法,即在得到第一手语文本例如“按照/指挥/路/走/留/这/不要”之后,对该第一手语文本进行语义精简处理,得到第二手语文本,例如,第二手语文本是“按照/指挥/路/走”。假设正常人说“按指引路线退场,不要逗留观众区”这句话所需要的时长记为t1,在对第一手语文本进行语义精简处理之前,数字人需要在t1时长内做7个手语动作,但是,在对第一手语文本进行语义精简处理之后,数字人只需在相同时长即t1时长内做4个手语动作即可,从而使得数字人拥有更充分的时长来做每个手语动作,从而可以保证听障人士可以看清楚每个手语动作。Since "Exit according to the guided route, do not stay in the audience area" and "Follow / command / road / go / stay / this / don't" are corresponding, that is, "Exit according to the guided route, do not stay in the audience area" corresponds to 7 action names , 7 sign language movements. Therefore, when a normal person says "exit according to the guided route and do not stay in the audience area", the digital person needs to make 7 sign language movements, and each sign language movement may be a continuous and coherent movements, that is, the time required for a digital person to perform each sign language action will be longer, while the time required for a normal person to speak each word may be shorter, resulting in a normal person speaking faster than a digital person. The speed of sign language movements. If the process of requiring a digital person to perform these 7 sign language movements is aligned with the process of a normal person saying "exit the venue according to the guided route and do not stay in the audience area" on the timeline, then it is necessary Speeding up the speed of the digital person's sign language movements, or increasing the playback speed of the video of the digital person's sign language movements, resulting in the hearing-impaired person not being able to see the sign language movements clearly. In order to solve this problem, this embodiment proposes a solution, that is, after obtaining the first sign language text such as "Follow/command/road/go/stay/this/don't", perform semantic analysis on the first sign language text. Streamline the processing to obtain the second sign language text. For example, the second sign language text is "Follow/command/road/walk". Assume that the time required for a normal person to say "exit the venue according to the guided route and do not stay in the audience area" is recorded as t1. Before the first sign language text is semantically streamlined, the digital person needs to perform 7 sign language within the t1 time period. However, after semantically simplifying the first sign language text, the digital person only needs to perform 4 sign language actions within the same duration, that is, the duration of t1, so that the digital person has more sufficient time to perform each sign language movements, thereby ensuring that hearing-impaired people can clearly see each sign language movement.
S104、根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来。S104. Drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
具体的,服务器22可以根据第二手语文本中的每个动作名称驱动数字人,使得数字人可以通过其肢体例如手部将第二手语文本中的每个动作名称分别对应的手语动作表达出来。Specifically, the server 22 can drive the digital human according to each action name in the second sign language text, so that the digital human can express the sign language movements corresponding to each action name in the second sign language text through its limbs, such as hands. come out.
在本实施例中,根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来,包括:根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来,并且所述数字人的嘴型和表情分别与所述第二手语文本一致。In this embodiment, driving the digital human according to the second sign language text so that the digital human expresses the sign language movements corresponding to the second sign language text through the body includes: according to the second sign language text The digital human is driven so that the digital human expresses the sign language movements corresponding to the second sign language text through the body, and the mouth shape and expression of the digital human are consistent with the second sign language text.
例如,在本实施例中,服务器22根据第二手语文本中的每个动作名称驱动数字人的同时还可以控制数字人的嘴型与该第二手语文本一致。例如,数字人在做“按照”对应的手语动作时,数字人的嘴型与“按照”保持一致。此外,还可以控制数字人的表情,例如,数字人在表达第二手语文本对应的手语动作时,数字人的表情可以保持严肃认真。For example, in this embodiment, while the server 22 drives the digital human according to each action name in the second sign language text, it can also control the mouth shape of the digital human to be consistent with the second sign language text. For example, when the digital person performs the sign language action corresponding to "according to", the digital person's mouth shape is consistent with "according to". In addition, the expression of the digital human can also be controlled. For example, when the digital human expresses the sign language movements corresponding to the second sign language text, the digital human's expression can remain serious.
本公开实施例通过将正常人所使用的自然语言文本翻译为第一手语文本,并对第一手语文本进行语义精简处理,得到第二手语文本。进一步,根据第二手语文本驱动数字人,使得数字人通过肢体将第二手语文本对应的手语动作表达出来。由于对第一手语文本进行语义精简处理后得到的第二手语文本可以包括较少的动作名称,因此,相比于第一手语文本而言,数字人可以在相同时间内做较少的手语动作,使得数字人拥有更充分的时长来做每个手语动作,从而可以保证听障人士可以看清楚每个手语动作。The embodiment of the present disclosure obtains a second sign language text by translating a natural language text used by normal people into a first sign language text, and performing semantic streamlining processing on the first sign language text. Furthermore, the digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body. Because the second sign language text obtained by semantically reducing the first sign language text can include fewer action names, the digital person can do less in the same time compared to the first sign language text. The sign language movements allow the digital person to have more time to perform each sign language movement, thus ensuring that the hearing-impaired can clearly see each sign language movement.
图5为本公开另一实施例提供的数字人手语播报方法流程图。在本实施例中,该方法具体步骤如下:Figure 5 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure. In this embodiment, the specific steps of this method are as follows:
S501、获取多媒体信息,并确定所述多媒体信息对应的自然语言文本。S501. Obtain multimedia information and determine the natural language text corresponding to the multimedia information.
具体的,S501和S101的具体原理和实现过程一致,此处不再赘述。Specifically, the specific principles and implementation processes of S501 and S101 are the same and will not be described again here.
S502、对所述自然语言文本进行语义精简处理,得到精简处理后的自然语言文本。S502. Perform semantic streamlining processing on the natural language text to obtain a streamlined natural language text.
例如,在确定出自然语言文本如“按指引路线退场,不要逗留观众区”时,本实施例 还可以对该自然语言文本进行语义精简处理,例如,结合人工手语翻译专家在翻译过程中的行为,对该自然语言文本进行语义理解,提取该自然语言文本中的关键信息,过滤掉无效或冗余的信息,从而得到精简处理后的自然语言文本,例如,“按指引路线退场”。For example, when determining the natural language text such as "Follow the guided route and exit the venue, do not stay in the audience area", this embodiment The natural language text can also be semantically simplified. For example, the natural language text can be semantically understood based on the behavior of artificial sign language translation experts during the translation process, and key information in the natural language text can be extracted to filter out invalid or redundant information. The remaining information is obtained to obtain a streamlined natural language text, for example, "Exit the venue according to the guided route."
S503、将所述精简处理后的自然语言文本翻译为第一手语文本。S503. Translate the streamlined natural language text into a first sign language text.
由于精简处理后的自然语言文本包括的内容更少,因此,将“按指引路线退场”翻译为第一手语文本后,第一手语文本包括的动作名称的个数也会相应减少。例如,第一手语文本是“按照/指挥/路/走”。在本实施例中,将自然语言文本翻译为第一手语文本,或者将精简处理后的自然语言文本翻译为第一手语文本的过程可以通过机器翻译来实现,机器翻译又称自动翻译,是利用计算机将一种语言(源语言)转换为另一种语言(目标语言)的过程。Since the streamlined natural language text contains less content, after translating "Exit according to the guided route" into a first sign language text, the number of action names included in the first sign language text will also be reduced accordingly. For example, the first sign language text is "Follow/direct/way/walk". In this embodiment, the process of translating the natural language text into the first sign language text, or translating the streamlined natural language text into the first sign language text, can be achieved through machine translation, which is also called automatic translation. It is the process of using computers to convert one language (source language) into another language (target language).
S504、对所述第一手语文本进行语义精简处理,得到第二手语文本。S504. Perform semantic simplification processing on the first sign language text to obtain a second sign language text.
例如,还可以对“按照/指挥/路/走”进行语义精简处理,从而进一步减少动作名称的个数,例如,对“按照/指挥/路/走”进行语义精简处理后得到的第二手语文本是“按照/指挥/走”,从而使得第二手语文本更加简练。在一些实施例中,自然语言文本可以记为原文,第一手语文本和第二手语文本可以分别记为译文。For example, you can also perform semantic streamlining processing on "Follow/command/road/walk" to further reduce the number of action names. The second sign language text is "Follow/direct/go", thus making the second sign language text more concise. In some embodiments, the natural language text can be recorded as the original text, and the first sign language text and the second sign language text can be recorded as the translation text respectively.
S505、根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来,并且所述数字人的嘴型和表情分别与所述第二手语文本一致。S505. Drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body, and the mouth shape and expression of the digital human are respectively consistent with the Second sign language text concordance.
具体的,S505和S104的具体原理和实现过程一致,此处不再赘述。Specifically, the specific principles and implementation processes of S505 and S104 are the same and will not be described again here.
可选的,根据所述第二手语文本驱动数字人,包括:若所述多媒体信息是非实时的音频文件或音视频文件,则获取所述音频文件或所述音视频文件中每个音频信号的起始时间和终止时间;根据所述起始时间和所述终止时间,调整所述数字人表达手语动作的速度,使得所述数字人表达的手语动作和所述音频信号在时间轴上对齐。Optionally, driving the digital human according to the second sign language text includes: if the multimedia information is a non-real-time audio file or audio and video file, obtaining each audio signal in the audio file or the audio and video file. the start time and the end time; according to the start time and the end time, adjust the speed at which the digital human expresses sign language movements, so that the sign language movements expressed by the digital human and the audio signal are aligned on the time axis .
例如,如果多媒体信息是非实时的音频文件或音视频文件,那么服务器22还可以从该音频文件或该音视频文件中获取每个音频信号,每个音频信号可以是自然语言中的一个句子的音频信号。进一步,服务器22可以计算每个音频信号的起始时间和终止时间,该起始时间和终止时间可以记为起止时间轴。针对每个音频信号,服务器22可以根据该音频信号的起始时间和终止时间,调整数字人表达手语动作的速度,即自动对不同句子的手语播报速度进行算法自适应,调快或调慢播报速度,使得数字人表达某个句子对应的手语动作的过程和该句子的音频信号在时间轴上对齐。其中,手语播报就是将自然语言文本转换为手语文本后,驱动数字人通过肢体将手语文本对应的手语动作表达出来,并配合相应的数字人面部表情和嘴型播报。在本实施例中,数字人可以是具有数字化外形的虚拟人物。For example, if the multimedia information is a non-real-time audio file or audio and video file, then the server 22 can also obtain each audio signal from the audio file or the audio and video file, and each audio signal can be the audio of a sentence in natural language. Signal. Further, the server 22 can calculate the start time and end time of each audio signal, and the start time and end time can be recorded as the start and end time axis. For each audio signal, the server 22 can adjust the speed at which the digital human expresses sign language movements according to the start time and end time of the audio signal, that is, automatically perform algorithm adaptation to the sign language broadcast speed of different sentences, and adjust the broadcast faster or slower. The speed enables the process of the digital human to express the sign language movements corresponding to a certain sentence and the audio signal of the sentence to be aligned on the time axis. Among them, sign language broadcasting is to convert natural language text into sign language text, drive the digital human to express the sign language movements corresponding to the sign language text through the body, and cooperate with the corresponding facial expressions and mouth shapes of the digital human to broadcast. In this embodiment, the digital person may be a virtual character with a digital appearance.
本实施例通过对自然语言文本和第一手语文本分别进行语义精简处理,使得第二手语文本中包括的动作名称尽可能的少,即第二手语文本尽可能的简练。这样,在根据第二手语文本驱动数字人时,针对同一句话,可以有效避免数字人打手语动作的速度落后于正常 人说话时的声音速度,从而使得数字人打手语动作的过程和正常人的说话过程保持同步,提升了信息同步性。另外,通过对不同句子的手语播报速度进行算法自适应,可实现手语播报与原始音视频内容的对齐。This embodiment performs semantic simplification processing on the natural language text and the first sign language text respectively, so that the action names included in the second sign language text are as few as possible, that is, the second sign language text is as concise as possible. In this way, when driving the digital human based on the second sign language text, for the same sentence, it can effectively prevent the digital human's sign language movements from lagging behind the normal speed. The speed of a person's voice when speaking allows the digital person's sign language movement process to be synchronized with the normal person's speaking process, improving information synchronization. In addition, by adapting the algorithm to the sign language broadcast speed of different sentences, the alignment of the sign language broadcast and the original audio and video content can be achieved.
图6为本公开另一实施例提供的数字人手语播报方法流程图。在本实施例中,该方法具体步骤如下:Figure 6 is a flow chart of a digital human sign language broadcasting method provided by another embodiment of the present disclosure. In this embodiment, the specific steps of this method are as follows:
S601、获取多媒体信息,并确定所述多媒体信息对应的自然语言文本。S601. Obtain multimedia information and determine the natural language text corresponding to the multimedia information.
例如,服务器22获取的多媒体信息可以是如图7所示的文本信息、实时音视频流、音频文件、音视频文件中的至少一个。For example, the multimedia information obtained by the server 22 may be at least one of text information, real-time audio and video streams, audio files, and audio and video files as shown in FIG. 7 .
如果该多媒体信息是文本信息,则可以通过如图7所示的文本解析获得自然语言文本。如果该多媒体信息是实时音视频流,则调用实时ASR获得自然语言文本。如果该多媒体信息是音频文件,则调用录音文件ASR获得自然语言文本。如果该多媒体信息是音视频文件,则先对该音视频文件进行视频解析,以便抽取出该音视频文件中的音频信号,然后调用录音文件ASR获得自然语言文本。If the multimedia information is text information, the natural language text can be obtained through text parsing as shown in Figure 7. If the multimedia information is a real-time audio and video stream, real-time ASR is called to obtain the natural language text. If the multimedia information is an audio file, the audio recording file ASR is called to obtain the natural language text. If the multimedia information is an audio and video file, the audio and video file is first subjected to video analysis to extract the audio signal in the audio and video file, and then the recording file ASR is called to obtain the natural language text.
S602、将所述多媒体信息对应的自然语言文本发送给运营人员的终端。S602. Send the natural language text corresponding to the multimedia information to the operator's terminal.
在本实施例中,服务器22可以将该多媒体信息对应的自然语言文本发送给运营人员的终端,使得该终端可以显示该自然语言文本。进一步,运营人员可以对该终端上显示的自然语言文本进行修改,从而实现如图7所示的原文干预。In this embodiment, the server 22 may send the natural language text corresponding to the multimedia information to the operator's terminal, so that the terminal can display the natural language text. Further, the operator can modify the natural language text displayed on the terminal to achieve original text intervention as shown in Figure 7.
S603、接收所述运营人员修改后的自然语言文本。S603. Receive the natural language text modified by the operator.
例如,运营人员对该自然语言文本进行修改后,服务器22可以从该运营人员的终端接收修改后的自然语言文本,该修改后的自然语言文本是如图7所示的干预后的原文。可以理解的是,在其他一些实施例中,运营人员可以不对该自然语言文本进行修改。For example, after the operator modifies the natural language text, the server 22 can receive the modified natural language text from the operator's terminal. The modified natural language text is the original text after intervention as shown in FIG. 7 . It can be understood that in other embodiments, the operator may not modify the natural language text.
S604、将所述运营人员修改后的自然语言文本翻译为第一手语文本。S604. Translate the natural language text modified by the operator into a first sign language text.
例如图7所示,通过调用机器翻译可以将修改后的自然语言文本翻译为第一手语文本,或者将原始的自然语言文本翻译为第一手语文本。具体的,将修改后的自然语言文本翻译为第一手语文本,或者将原始的自然语言文本翻译为第一手语文本的过程可以展示在运营人员的终端上,如图8所示或如图9所示。其中,图8所示是将实时音视频翻译为手语动画的过程,图9所示是将文本翻译为手语动画的过程。For example, as shown in Figure 7, by calling machine translation, the modified natural language text can be translated into a first sign language text, or the original natural language text can be translated into a first sign language text. Specifically, the process of translating the modified natural language text into the first sign language text, or translating the original natural language text into the first sign language text, can be displayed on the operator's terminal, as shown in Figure 8 or as shown in Figure 8. As shown in Figure 9. Among them, Figure 8 shows the process of translating real-time audio and video into sign language animation, and Figure 9 shows the process of translating text into sign language animation.
S605、对所述第一手语文本进行语义精简处理,得到第二手语文本。S605. Perform semantic simplification processing on the first sign language text to obtain a second sign language text.
例如图7所示,通过调用语义精简可以对第一手语文本进行语义精简处理,得到第二手语文本,该第二手语文本可以是如图7所示的手语文本结果。For example, as shown in Figure 7, by calling semantic reduction, the first sign language text can be semantically streamlined to obtain a second sign language text. The second sign language text can be the sign language text result shown in Figure 7.
S606、将所述第二手语文本发送给运营人员的终端。S606. Send the second sign language text to the operator's terminal.
例如,在本实施例中,服务器22还可以将第二手语文本发送给运营人员的终端,使得运营人员对该第二手语文本进行修改,从而实现如图7所示的译文干预。For example, in this embodiment, the server 22 can also send the second sign language text to the operator's terminal, so that the operator can modify the second sign language text, thereby realizing translation intervention as shown in FIG. 7 .
S607、接收所述运营人员修改后的第二手语文本。 S607. Receive the second sign language text modified by the operator.
例如,运营人员对该第二手语文本进行修改后,服务器22可以从该运营人员的终端接收修改后的第二手语文本,该修改后的第二手语文本是如图7所示的干预后的译文。可以理解的是,在其他一些实施例中,运营人员可以不对该第二手语文本进行修改。For example, after the operator modifies the second sign language text, the server 22 can receive the modified second sign language text from the operator's terminal. The modified second sign language text is as shown in Figure 7 Post-intervention translation. It can be understood that in other embodiments, the operator may not modify the second sign language text.
S608、根据所述运营人员修改后的第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来,并且所述数字人的嘴型和表情分别与所述第二手语文本一致。S608. Drive the digital human according to the second sign language text modified by the operator, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body, and the mouth shape of the digital human and The expressions were respectively consistent with the second sign language text.
例如图7所示,服务器22可以根据运营人员修改后的第二手语文本,或者根据未修改的第二手语文本驱动数字人,驱动数字人的过程包括手语合成、表情合成、嘴型合成等过程。其中,手语合成可以是控制数字人通过肢体将第二手语文本对应的手语动作表达出来。表情合成可以是控制数字人的表情与正常人说自然语言时的表情一致。嘴型合成可以是控制数字人的嘴型与第二手语文本保持一致。For example, as shown in Figure 7, the server 22 can drive the digital human based on the second sign language text modified by the operator, or based on the unmodified second sign language text. The process of driving the digital human includes sign language synthesis, expression synthesis, and mouth shape synthesis. Waiting process. Among them, sign language synthesis can be to control the digital person to express the sign language movements corresponding to the second sign language text through the body. Expression synthesis can control the expression of a digital person to be consistent with the expression of a normal person speaking natural language. Mouth synthesis can control the mouth shape of the digital person to be consistent with the second sign language text.
S609、若所述多媒体信息是实时的音频流或音视频流,则生成所述数字人的流式手语播报视频流,并将所述流式手语播报视频流实时的发送给终端。S609. If the multimedia information is a real-time audio stream or audio and video stream, generate the digital human's streaming sign language broadcast video stream, and send the streaming sign language broadcast video stream to the terminal in real time.
例如图7所示,如果多媒体信息是实时的音频流或音视频流,则服务器22在驱动数字人的过程中可以生成数字人的流式手语播报视频流,并将该流式手语播报视频流实时的发送给听障人士的终端。可以理解的是,在一些实施例中,服务器22可以同时将该实时的音视频流和数字人的流式手语播报视频流下发给听障人士的终端,使得听障人士的终端不仅可以播放正常人能够观看的音视频,同时还可以播放该数字人的手语播报视频。For example, as shown in Figure 7, if the multimedia information is a real-time audio stream or audio and video stream, the server 22 can generate a streaming sign language broadcast video stream of the digital human during the process of driving the digital human, and use the streaming sign language broadcast video stream to Sent to hearing-impaired terminals in real time. It can be understood that in some embodiments, the server 22 can simultaneously send the real-time audio and video stream and the digital human's streaming sign language broadcast video stream to the terminal of the hearing-impaired person, so that the terminal of the hearing-impaired person can not only play normal Audio and video that people can watch, and at the same time, the digital person's sign language broadcast video can also be played.
可选的,生成所述数字人的流式手语播报视频流,包括:根据所述数字人的配置信息,生成所述数字人的流式手语播报视频流。其中,所述数字人的配置信息包括如下至少一种:所述数字人的背景、颜色、所述数字人在用户界面中的位置和尺寸。Optionally, generating a streaming sign language broadcast video stream of the digital human includes: generating a streaming sign language broadcast video stream of the digital human according to the configuration information of the digital human. Wherein, the configuration information of the digital person includes at least one of the following: the background, color of the digital person, the position and size of the digital person in the user interface.
如图7所示,运营人员还可以配置合成效果,例如,运营人员的终端可以显示有配置界面,该配置界面中可以显示有该数字人的配置选项,运营人员通过操作这些配置选项从而实现对数字人的配置,即生成该数字人的配置信息,该配置信息可以包括数字人的背景、颜色、数字人在听障人士的用户界面中的位置和尺寸等。其中,如图7所示的镜头远近用于控制数字人在听障人士的用户界面中的尺寸。具体的,服务器22可以根据该数字人的配置信息生成数字人的流式手语播报视频流。此外,运营人员还可以对是否展示字幕进行配置,例如在配置字幕的情况下,听障人士还可以边看数字人打手语,边看字幕,提高理解效率。As shown in Figure 7, the operator can also configure the synthesis effect. For example, the operator's terminal can display a configuration interface, and the configuration interface can display the configuration options of the digital person. The operator can operate these configuration options to achieve the desired effect. The configuration of the digital person is to generate the configuration information of the digital person. The configuration information may include the background, color, position and size of the digital person in the user interface for the hearing-impaired, etc. Among them, the lens distance shown in Figure 7 is used to control the size of the digital human in the user interface for the hearing-impaired. Specifically, the server 22 may generate a streaming sign language broadcast video stream of the digital person according to the configuration information of the digital person. In addition, the operator can also configure whether to display subtitles. For example, if subtitles are configured, the hearing-impaired can also watch the digital sign language while reading the subtitles to improve understanding efficiency.
S610、若所述多媒体信息是非实时的音频文件、音视频文件或文本文件,则生成所述数字人的手语播报视频文件,并将所述手语播报视频文件发送给终端。S610. If the multimedia information is a non-real-time audio file, audio and video file or text file, generate a sign language broadcast video file of the digital person, and send the sign language broadcast video file to the terminal.
例如图7所示,如果多媒体信息是文本信息、音频文件或音视频文件,则服务器22在驱动数字人的过程中可以生成数字人的手语播报视频文件,并将该手语播报视频文件发送给听障人士的终端。可以理解的是,在一些实施例中,服务器22可以同时将该多媒体信息和数字人的手语播报视频文件下发给听障人士的终端,使得听障人士的终端不仅可以 播放正常人能够观看的文本信息、音频文件或音视频文件,同时还可以播放该数字人的手语播报视频文件。For example, as shown in Figure 7, if the multimedia information is text information, audio files or audio and video files, the server 22 can generate a sign language broadcast video file of the digital human during the process of driving the digital human, and send the sign language broadcast video file to the listener. Terminal for people with disabilities. It can be understood that in some embodiments, the server 22 can simultaneously send the multimedia information and the digital human sign language broadcast video file to the terminal of the hearing-impaired person, so that the terminal of the hearing-impaired person can not only It can play text information, audio files or audio and video files that normal people can watch, and it can also play the sign language video files of the digital person.
可选的,生成所述数字人的手语播报视频文件,包括:根据所述数字人的配置信息,生成所述数字人的手语播报视频文件;其中,所述数字人的配置信息包括如下至少一种:所述数字人的背景、颜色、所述数字人在用户界面中的位置和尺寸。Optionally, generating a sign language broadcast video file of the digital human includes: generating a sign language broadcast video file of the digital human according to the configuration information of the digital human; wherein the configuration information of the digital human includes at least one of the following: Type: the background, color of the digital person, the position and size of the digital person in the user interface.
具体的,服务器22可以根据该数字人的配置信息生成数字人的手语播报视频文件,该配置信息的来源和包括的内容如上所述,此处不再赘述。在本实施例中,该数字人的配置信息具体可以是由运营人员配置的。Specifically, the server 22 can generate the sign language broadcast video file of the digital person based on the configuration information of the digital person. The source and content of the configuration information are as mentioned above and will not be described again here. In this embodiment, the configuration information of the digital human may be configured by an operator.
本实施例通过融合实时语音识别、录音文件语音识别、视频解析等技术,可做到对纯文本、实时音视频、离线音视频文件多种模态的支持,应用场景更广。另外,本实施例提供的手语播报涉及多项算法技术,环环相扣,每个环节的输出均影响下一个环节的输入。本方案针对手语播报每个环节可输出独立结果,便于快速追溯定位链路中的问题。此外,手语的呈现不仅仅是身体和手部动作,在手语合成的基础上,通过嘴型合成、表情合成技术,将身体姿态与表情、嘴型融合,多种信息联动,从而更好的向听障人士传达信息。由于手语播报技术涉及算法多样,难以做到100%准确,同时,手语播报的应用场景多样,在不同的应用场景下对整个手语播报的最终结果呈现有差异化要求,因此通过提供可视化的界面,赋能运营人员对自然语言文本和手语文本进行干预、编辑,利用人机协同机制,提高了手语翻译的准确性,提升了端到端的效果。By integrating real-time speech recognition, recorded file speech recognition, video analysis and other technologies, this embodiment can support multiple modes of plain text, real-time audio and video, and offline audio and video files, and has a wider application scenario. In addition, the sign language broadcast provided by this embodiment involves multiple algorithm technologies, which are interlocking, and the output of each link affects the input of the next link. This solution can output independent results for each link of sign language broadcasting, making it easy to quickly trace and locate problems in the link. In addition, the presentation of sign language is not only the body and hand movements. Based on the synthesis of sign language, through the technology of mouth synthesis and expression synthesis, the body posture, expression and mouth shape are integrated, and a variety of information is linked, so as to better communicate to the audience. Hearing-impaired people convey information. Since the sign language broadcasting technology involves various algorithms, it is difficult to achieve 100% accuracy. At the same time, the application scenarios of sign language broadcasting are diverse, and there are differentiated requirements for the final result presentation of the entire sign language broadcasting in different application scenarios. Therefore, by providing a visual interface, Empowering operations personnel to intervene and edit natural language texts and sign language texts, and using human-machine collaboration mechanisms to improve the accuracy of sign language translation and improve end-to-end effects.
图10为本公开实施例提供的数字人手语播报装置的结构示意图。本公开实施例提供的数字人手语播报装置可以执行数字人手语播报方法实施例提供的处理流程,如图10所示,数字人手语播报装置100包括:Figure 10 is a schematic structural diagram of a digital human sign language broadcasting device provided by an embodiment of the present disclosure. The digital human sign language broadcasting device provided by the embodiment of the present disclosure can execute the processing flow provided by the digital human sign language broadcasting method embodiment. As shown in Figure 10, the digital human sign language broadcasting device 100 includes:
获取模块101,用于获取多媒体信息;Acquisition module 101, used to obtain multimedia information;
确定模块102,用于确定所述多媒体信息对应的自然语言文本;Determining module 102, used to determine the natural language text corresponding to the multimedia information;
翻译模块103,用于将所述自然语言文本翻译为第一手语文本;Translation module 103, used to translate the natural language text into a first sign language text;
处理模块104,用于对所述第一手语文本进行语义精简处理,得到第二手语文本;The processing module 104 is used to perform semantic simplification processing on the first sign language text to obtain a second sign language text;
驱动模块105,用于根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来。The driving module 105 is used to drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
可选的,驱动模块105根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来时,具体包括:根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来,并且所述数字人的嘴型和表情分别与所述第二手语文本一致。Optionally, the driving module 105 drives the digital human according to the second sign language text, so that when the digital human expresses the sign language movements corresponding to the second sign language text through the body, it specifically includes: according to the second sign language text. The sign language text drives the digital human, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body, and the mouth shape and expression of the digital human are consistent with the second sign language text.
可选的,处理模块104还用于在确定模块102确定所述多媒体信息对应的自然语言文本之后,对所述自然语言文本进行语义精简处理,得到精简处理后的自然语言文本。翻译模块103具体用于:将所述精简处理后的自然语言文本翻译为第一手语文本。 Optionally, the processing module 104 is also configured to perform semantic simplification processing on the natural language text after the determination module 102 determines the natural language text corresponding to the multimedia information, to obtain a simplified natural language text. The translation module 103 is specifically configured to translate the streamlined natural language text into a first sign language text.
可选的,驱动模块105包括获取单元1051和调整单元1052,其中,获取单元1051用于当所述多媒体信息是非实时的音频文件或音视频文件时,获取所述音频文件或所述音视频文件中每个音频信号的起始时间和终止时间;调整单元1052用于根据所述起始时间和所述终止时间,调整所述数字人表达手语动作的速度,使得所述数字人表达的手语动作和所述音频信号在时间轴上对齐。Optionally, the driving module 105 includes an acquisition unit 1051 and an adjustment unit 1052, wherein the acquisition unit 1051 is used to acquire the audio file or the audio and video file when the multimedia information is a non-real-time audio file or audio and video file. The start time and end time of each audio signal in and the audio signal on the time axis.
可选的,数字人手语播报装置100还包括:发送模块106和接收模块107,发送模块106用于在确定模块102确定所述多媒体信息对应的自然语言文本之后,将所述多媒体信息对应的自然语言文本发送给运营人员的终端;接收模块107用于接收所述运营人员修改后的自然语言文本。翻译模块103具体用于将所述运营人员修改后的自然语言文本翻译为第一手语文本。Optionally, the digital human sign language broadcasting device 100 also includes: a sending module 106 and a receiving module 107. The sending module 106 is configured to send the natural language text corresponding to the multimedia information after the determining module 102 determines the natural language text corresponding to the multimedia information. The language text is sent to the operator's terminal; the receiving module 107 is used to receive the natural language text modified by the operator. The translation module 103 is specifically configured to translate the natural language text modified by the operator into a first sign language text.
可选的,发送模块106还用于:在处理模块104对所述第一手语文本进行语义精简处理,得到第二手语文本之后,将所述第二手语文本发送给运营人员的终端;接收模块107还用于接收所述运营人员修改后的第二手语文本。驱动模块105具体用于:根据所述运营人员修改后的第二手语文本驱动数字人。Optionally, the sending module 106 is also configured to: after the processing module 104 performs semantic simplification processing on the first sign language text to obtain the second sign language text, send the second sign language text to the operator's terminal. ; The receiving module 107 is also used to receive the second sign language text modified by the operator. The driving module 105 is specifically used to drive the digital human according to the second sign language text modified by the operator.
可选的,数字人手语播报装置100还包括:生成模块108,用于在驱动模块105根据所述第二手语文本驱动数字人之后,若所述多媒体信息是实时的音频流或音视频流,则生成所述数字人的流式手语播报视频流,并将所述流式手语播报视频流实时的发送给终端;若所述多媒体信息是非实时的音频文件、音视频文件或文本文件,则生成所述数字人的手语播报视频文件,并将所述手语播报视频文件发送给终端。可选的,该终端可以是听障人士的终端。Optionally, the digital human sign language broadcasting device 100 also includes: a generating module 108, configured to generate a signal if the multimedia information is a real-time audio stream or audio and video stream after the driving module 105 drives the digital human according to the second sign language text. , then the streaming sign language broadcast video stream of the digital person is generated, and the streaming sign language broadcast video stream is sent to the terminal in real time; if the multimedia information is a non-real-time audio file, audio and video file or text file, then A sign language broadcast video file of the digital person is generated, and the sign language broadcast video file is sent to the terminal. Optionally, the terminal may be a hearing-impaired terminal.
可选的,生成模块108在生成所述数字人的流式手语播报视频流时,具体用于:根据所述数字人的配置信息,生成所述数字人的流式手语播报视频流;生成模块108在生成所述数字人的手语播报视频文件时,具体用于:根据所述数字人的配置信息,生成所述数字人的手语播报视频文件;其中,所述数字人的配置信息包括如下至少一种:所述数字人的背景、颜色、所述数字人在用户界面中的位置和尺寸。其中,所述数字人的配置信息可以是由运营人员配置的。Optionally, when generating the digital human's streaming sign language broadcast video stream, the generation module 108 is specifically configured to: generate the digital human's streaming sign language broadcast video stream according to the configuration information of the digital human; the generation module 108 When generating the sign language broadcast video file of the digital human, it is specifically used to: generate the sign language broadcast video file of the digital human according to the configuration information of the digital human; wherein the configuration information of the digital human includes at least the following: One: the background, color, and position and size of the digital human in the user interface. Wherein, the configuration information of the digital human may be configured by operating personnel.
图10所示实施例的数字人手语播报装置可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。The digital human sign language broadcasting device of the embodiment shown in Figure 10 can be used to implement the technical solution of the above method embodiment. Its implementation principles and technical effects are similar and will not be described again here.
以上描述了数字人手语播报装置的内部功能和结构,该装置可实现为一种电子设备。图11为本公开实施例提供的电子设备实施例的结构示意图。如图11所示,该电子设备包括存储器111和处理器112。The above describes the internal functions and structure of the digital human sign language broadcasting device, which can be implemented as an electronic device. FIG. 11 is a schematic structural diagram of an electronic device embodiment provided by an embodiment of the present disclosure. As shown in FIG. 11 , the electronic device includes a memory 111 and a processor 112 .
存储器111用于存储程序。除上述程序之外,存储器111还可被配置为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。The memory 111 is used to store programs. In addition to the above-mentioned programs, the memory 111 may also be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, etc.
存储器111可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静 态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。Memory 111 may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as a static State random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), Magnetic memory, flash memory, magnetic disk or optical disk.
处理器112与存储器111耦合,执行存储器111所存储的程序,以用于:The processor 112 is coupled to the memory 111 and executes the program stored in the memory 111 for:
获取多媒体信息,并确定所述多媒体信息对应的自然语言文本;Obtain multimedia information and determine the natural language text corresponding to the multimedia information;
将所述自然语言文本翻译为第一手语文本;Translate the natural language text into a first sign language text;
对所述第一手语文本进行语义精简处理,得到第二手语文本;Perform semantic simplification processing on the first sign language text to obtain a second sign language text;
根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来。The digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
进一步,如图11所示,电子设备还可以包括:通信组件113、电源组件114、音频组件115、显示器116等其它组件。图11中仅示意性给出部分组件,并不意味着电子设备只包括图11所示组件。Further, as shown in FIG. 11 , the electronic device may also include: a communication component 113 , a power supply component 114 , an audio component 115 , a display 116 and other components. Only some components are schematically shown in FIG. 11 , which does not mean that the electronic device only includes the components shown in FIG. 11 .
通信组件113被配置为便于电子设备和其他设备之间有线或无线方式的通信。电子设备可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件113经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件113还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。The communication component 113 is configured to facilitate wired or wireless communication between the electronic device and other devices. Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 113 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 113 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
电源组件114,为电子设备的各种组件提供电力。电源组件114可以包括电源管理系统,一个或多个电源,及其他与为电子设备生成、管理和分配电力相关联的组件。The power supply component 114 provides power to various components of the electronic device. Power supply components 114 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic devices.
音频组件115被配置为输出和/或输入音频信号。例如,音频组件115包括一个麦克风(MIC),当电子设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器111或经由通信组件113发送。在一些实施例中,音频组件115还包括一个扬声器,用于输出音频信号。Audio component 115 is configured to output and/or input audio signals. For example, the audio component 115 includes a microphone (MIC) configured to receive external audio signals when the electronic device is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 111 or sent via communication component 113 . In some embodiments, audio component 115 also includes a speaker for outputting audio signals.
显示器116包括屏幕,其屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。Display 116 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action.
另外,本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现上述实施例所述的数字人手语播报方法。In addition, embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the digital human sign language broadcasting method described in the above embodiments.
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、 物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as “first” and “second” are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these There is no such actual relationship or sequence between entities or operations. Furthermore, the terms "comprises,""comprises," or any other variations thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that includes a list of elements includes not only those elements, but also those not expressly listed other elements of the process, method, Elements inherent in an item or piece of equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or apparatus that includes the stated element.
以上所述仅是本公开的具体实施方式,使本领域技术人员能够理解或实现本公开。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此,本公开将不会被限制于本文所述的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。 The above descriptions are only specific embodiments of the present disclosure, enabling those skilled in the art to understand or implement the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be practiced in other embodiments without departing from the spirit or scope of the disclosure. Therefore, the present disclosure is not to be limited to the embodiments described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

  1. 一种数字人手语播报方法,其中,所述方法包括:A digital human sign language broadcasting method, wherein the method includes:
    获取多媒体信息,并确定所述多媒体信息对应的自然语言文本;Obtain multimedia information and determine the natural language text corresponding to the multimedia information;
    将所述自然语言文本翻译为第一手语文本;Translate the natural language text into a first sign language text;
    对所述第一手语文本进行语义精简处理,得到第二手语文本;Perform semantic simplification processing on the first sign language text to obtain a second sign language text;
    根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来。The digital human is driven according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
  2. 根据权利要求1所述的方法,其中,确定所述多媒体信息对应的自然语言文本之后,所述方法还包括:The method according to claim 1, wherein after determining the natural language text corresponding to the multimedia information, the method further includes:
    对所述自然语言文本进行语义精简处理,得到精简处理后的自然语言文本;Perform semantic streamlining processing on the natural language text to obtain a streamlined natural language text;
    将所述自然语言文本翻译为第一手语文本,包括:Translating the natural language text into a first sign language text includes:
    将所述精简处理后的自然语言文本翻译为第一手语文本。Translate the streamlined natural language text into a first sign language text.
  3. 根据权利要求1所述的方法,其中,根据所述第二手语文本驱动数字人,包括:The method of claim 1, wherein driving the digital person according to the second sign language text includes:
    若所述多媒体信息是非实时的音频文件或音视频文件,则获取所述音频文件或所述音视频文件中每个音频信号的起始时间和终止时间;If the multimedia information is a non-real-time audio file or audio and video file, obtain the start time and end time of each audio signal in the audio file or audio and video file;
    根据所述起始时间和所述终止时间,调整所述数字人表达手语动作的速度,使得所述数字人表达的手语动作和所述音频信号在时间轴上对齐。According to the start time and the end time, the speed at which the digital human expresses sign language movements is adjusted so that the sign language movements expressed by the digital human and the audio signal are aligned on the time axis.
  4. 根据权利要求1所述的方法,其中,根据所述第二手语文本驱动数字人之后,所述方法还包括:The method of claim 1, wherein after driving the digital person according to the second sign language text, the method further includes:
    若所述多媒体信息是实时的音频流或音视频流,则生成所述数字人的流式手语播报视频流,并将所述流式手语播报视频流实时的发送给终端;If the multimedia information is a real-time audio stream or audio and video stream, generate the digital human's streaming sign language broadcast video stream, and send the streaming sign language broadcast video stream to the terminal in real time;
    若所述多媒体信息是非实时的音频文件、音视频文件或文本文件,则生成所述数字人的手语播报视频文件,并将所述手语播报视频文件发送给终端。If the multimedia information is a non-real-time audio file, audio and video file or text file, a sign language broadcast video file of the digital person is generated, and the sign language broadcast video file is sent to the terminal.
  5. 根据权利要求4所述的方法,其中,生成所述数字人的流式手语播报视频流,包括:The method of claim 4, wherein generating the digital human's streaming sign language broadcast video stream includes:
    根据所述数字人的配置信息,生成所述数字人的流式手语播报视频流;Generate a streaming sign language broadcast video stream of the digital person according to the configuration information of the digital person;
    生成所述数字人的手语播报视频文件,包括:Generate the sign language broadcast video file of the digital person, including:
    根据所述数字人的配置信息,生成所述数字人的手语播报视频文件;Generate a sign language broadcast video file of the digital person according to the configuration information of the digital person;
    其中,所述数字人的配置信息包括如下至少一种:Wherein, the configuration information of the digital human includes at least one of the following:
    所述数字人的背景、颜色、所述数字人在用户界面中的位置和尺寸。The background, color, and position and size of the digital human in the user interface.
  6. 一种数字人手语播报装置,其中,包括:A digital human sign language broadcasting device, which includes:
    获取模块,用于获取多媒体信息;Acquisition module, used to obtain multimedia information;
    确定模块,用于确定所述多媒体信息对应的自然语言文本;A determination module, used to determine the natural language text corresponding to the multimedia information;
    翻译模块,用于将所述自然语言文本翻译为第一手语文本;A translation module for translating the natural language text into a first sign language text;
    处理模块,用于对所述第一手语文本进行语义精简处理,得到第二手语文本; A processing module for performing semantic simplification processing on the first sign language text to obtain a second sign language text;
    驱动模块,用于根据所述第二手语文本驱动数字人,使得所述数字人通过肢体将所述第二手语文本对应的手语动作表达出来。A driving module is used to drive the digital human according to the second sign language text, so that the digital human expresses the sign language movements corresponding to the second sign language text through the body.
  7. 根据权利要求6所述的装置,其中,所述处理模块还用于在所述确定模块确定所述多媒体信息对应的自然语言文本之后,对所述自然语言文本进行语义精简处理,得到精简处理后的自然语言文本;The device according to claim 6, wherein the processing module is further configured to perform semantic streamlining processing on the natural language text after the determining module determines the natural language text corresponding to the multimedia information, and obtain the simplified process. natural language text;
    相应的,所述翻译模块具体用于:将所述精简处理后的自然语言文本翻译为第一手语文本。Correspondingly, the translation module is specifically configured to translate the streamlined natural language text into a first sign language text.
  8. 根据权利要求6所述的装置,其中,所述驱动模块包括获取单元和调整单元;The device according to claim 6, wherein the driving module includes an acquisition unit and an adjustment unit;
    所述获取单元用于当所述多媒体信息是非实时的音频文件或音视频文件时,获取所述音频文件或所述音视频文件中每个音频信号的起始时间和终止时间;The acquisition unit is configured to acquire the start time and end time of each audio signal in the audio file or the audio and video file when the multimedia information is a non-real-time audio file or audio and video file;
    所述调整单元用于根据所述起始时间和所述终止时间,调整所述数字人表达手语动作的速度,使得所述数字人表达的手语动作和所述音频信号在时间轴上对齐。The adjustment unit is configured to adjust the speed at which the digital human expresses sign language movements according to the start time and the end time, so that the sign language movements expressed by the digital human and the audio signal are aligned on the time axis.
  9. 一种电子设备,其中,包括:An electronic device, including:
    存储器;memory;
    处理器;以及processor; and
    计算机程序;Computer program;
    其中,所述计算机程序存储在所述存储器中,并被配置为由所述处理器执行以实现如权利要求1-5中任一项所述的方法。Wherein, the computer program is stored in the memory and configured to be executed by the processor to implement the method according to any one of claims 1-5.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1-5中任一项所述的方法。 A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the method according to any one of claims 1-5 is implemented.
PCT/CN2023/105575 2022-07-04 2023-07-03 Digital human sign language broadcasting method and apparatus, device, and storage medium WO2024008047A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210785961.2A CN115359796A (en) 2022-07-04 2022-07-04 Digital human voice broadcasting method, device, equipment and storage medium
CN202210785961.2 2022-07-04

Publications (1)

Publication Number Publication Date
WO2024008047A1 true WO2024008047A1 (en) 2024-01-11

Family

ID=84030342

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/105575 WO2024008047A1 (en) 2022-07-04 2023-07-03 Digital human sign language broadcasting method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN115359796A (en)
WO (1) WO2024008047A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115359796A (en) * 2022-07-04 2022-11-18 阿里巴巴(中国)有限公司 Digital human voice broadcasting method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210026006A (en) * 2019-08-29 2021-03-10 조용구 Sign language translation system and method for converting voice of video into avatar and animation
CN113835522A (en) * 2021-09-10 2021-12-24 阿里巴巴达摩院(杭州)科技有限公司 Sign language video generation, translation and customer service method, device and readable medium
CN114157920A (en) * 2021-12-10 2022-03-08 深圳Tcl新技术有限公司 Playing method and device for displaying sign language, smart television and storage medium
CN114546326A (en) * 2022-02-22 2022-05-27 浙江核新同花顺网络信息股份有限公司 Virtual human sign language generation method and system
CN115359796A (en) * 2022-07-04 2022-11-18 阿里巴巴(中国)有限公司 Digital human voice broadcasting method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210026006A (en) * 2019-08-29 2021-03-10 조용구 Sign language translation system and method for converting voice of video into avatar and animation
CN113835522A (en) * 2021-09-10 2021-12-24 阿里巴巴达摩院(杭州)科技有限公司 Sign language video generation, translation and customer service method, device and readable medium
CN114157920A (en) * 2021-12-10 2022-03-08 深圳Tcl新技术有限公司 Playing method and device for displaying sign language, smart television and storage medium
CN114546326A (en) * 2022-02-22 2022-05-27 浙江核新同花顺网络信息股份有限公司 Virtual human sign language generation method and system
CN115359796A (en) * 2022-07-04 2022-11-18 阿里巴巴(中国)有限公司 Digital human voice broadcasting method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115359796A (en) 2022-11-18

Similar Documents

Publication Publication Date Title
CN110634483B (en) Man-machine interaction method and device, electronic equipment and storage medium
WO2020216107A1 (en) Conference data processing method, apparatus and system, and electronic device
JP2003345379A6 (en) Audio-video conversion apparatus and method, audio-video conversion program
WO2003079328A1 (en) Audio video conversion apparatus and method, and audio video conversion program
JP6987124B2 (en) Interpreters and methods (DEVICE AND METHOD OF TRANSLATING A LANGUAGE)
JP2013521523A (en) A system for translating spoken language into sign language for the hearing impaired
JP2019534492A (en) Interpretation device and method (DEVICE AND METHOD OF TRANSLATING A LANGUAGE INTO ANOTHER LANGUAGE)
CN109862302B (en) Method and system for switching accessible audio of client equipment in online conference
WO2024008047A1 (en) Digital human sign language broadcasting method and apparatus, device, and storage medium
US20220414349A1 (en) Systems, methods, and apparatus for determining an official transcription and speaker language from a plurality of transcripts of text in different languages
US20220286310A1 (en) Systems, methods, and apparatus for notifying a transcribing and translating system of switching between spoken languages
CN108648754B (en) Voice control method and device
CN110730360A (en) Video uploading and playing methods and devices, client equipment and storage medium
CN112581965A (en) Transcription method, device, recording pen and storage medium
JP7417272B2 (en) Terminal device, server device, distribution method, learning device acquisition method, and program
JP2015115879A (en) Remote control system, and user terminal and viewing device thereof
WO2023216119A1 (en) Audio signal encoding method and apparatus, electronic device and storage medium
US20230141096A1 (en) Transcription presentation
JP2024509873A (en) Video processing methods, devices, media, and computer programs
US20240154833A1 (en) Meeting inputs
CN115550705A (en) Audio playing method and device
KR20010017865A (en) Method Of Visual Communication On Speech Translating System Based On Avatar
KR102546532B1 (en) Method for providing speech video and computing device for executing the method
JP2003339034A (en) Network conference system, network conference method, and network conference program
WO2022237381A1 (en) Method for saving conference record, terminal, and server

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23834814

Country of ref document: EP

Kind code of ref document: A1