WO2015117373A1 - 一种语音消息可视化服务的实现方法及装置 - Google Patents

一种语音消息可视化服务的实现方法及装置 Download PDF

Info

Publication number
WO2015117373A1
WO2015117373A1 PCT/CN2014/088985 CN2014088985W WO2015117373A1 WO 2015117373 A1 WO2015117373 A1 WO 2015117373A1 CN 2014088985 W CN2014088985 W CN 2014088985W WO 2015117373 A1 WO2015117373 A1 WO 2015117373A1
Authority
WO
WIPO (PCT)
Prior art keywords
message
module
information
facial expression
dynamic video
Prior art date
Application number
PCT/CN2014/088985
Other languages
English (en)
French (fr)
Inventor
李超
何栩翊
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to US15/328,208 priority Critical patent/US20170270948A1/en
Priority to EP14881955.0A priority patent/EP3174052A1/en
Publication of WO2015117373A1 publication Critical patent/WO2015117373A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals

Definitions

  • the present invention relates to the technical field of implementation of a voice message visualization service, and in particular to a comprehensive unified voice message service.
  • the voice message service can transfer the incoming call that the user fails to answer in time to the voice message, so that the caller can leave a message; and at some time in the future, the called party is prompted to let the called party conveniently listen to the message.
  • the basic architecture of the popular voice messaging system is shown in Figure 2.
  • the core components include an information receiving module, an information storage module, and an information delivery module.
  • the basic working principle is as follows: the user (the voice message sender) sends a message to the voice message system, the information receiving module receives the message, and invokes the information storage module to store, and then the information delivery module sends the voice message to the voice message receiver.
  • the technical problem to be solved by the present invention is to provide a method and a device for implementing video message videoization, so as to implement video information generated by a user based on his own facial features.
  • An apparatus for implementing voice message visualization includes an information receiving module and a dynamic video generating module, wherein:
  • the information receiving module is configured to: receive an original message sent by a message sender or locally stored, and a portrait picture, where the original message is a text message or a voice message;
  • the dynamic video generating module is configured to: extract a facial feature from the portrait image, generate a facial expression, and synthesize the facial expression and the original message into dynamic video information, wherein the generated facial expression and the original Corresponding to the content of the message, the dynamic video information is sent to the message recipient and displayed at the terminal of the message recipient.
  • the dynamic video generation module includes a facial feature extraction sub-module, a facial expression generation sub-module, and an information conversion sub-module, wherein:
  • the facial feature extraction sub-module is configured to: extract a facial feature from the portrait image
  • the facial expression generation sub-module is configured to: generate a facial expression according to the extracted facial features
  • the information conversion sub-module is configured to: split the text or the voice message into a single word according to the word library, analyze the context and emotion according to the word, select a corresponding facial expression picture from the generated facial expression according to the context and the emotion, and face the image
  • the emoticon picture and the text message or voice message are synthesized into the dynamic video information.
  • the device is placed on the side of the voice messaging system.
  • the device further includes an information storage module and an information delivery module, where:
  • the information storage module is configured to: store an original message sent by the message sender and a portrait picture, and store dynamic video information generated by the dynamic video generation module and corresponding receiver information;
  • the information sending module is configured to: send the dynamic video information stored by the information storage module to the message receiver.
  • the message recipient is a mobile terminal user or an interactive network television (IPTV) user.
  • IPTV interactive network television
  • a method for implementing voice message visualization includes:
  • the dynamic video information is sent to the message recipient and displayed at the terminal of the message recipient.
  • the step of synthesizing the generated facial expression and the original message into dynamic video information comprises:
  • the text or the voice message is split into a single word, and according to the word analysis context and emotion, the corresponding facial expression picture is selected from the generated facial expression according to the context and the emotion, and the facial expression picture and the text message or voice are selected.
  • the message is synthesized into the dynamic video information.
  • the method further includes:
  • the synthesized dynamic video information is sent to the message recipient.
  • the message recipient is a mobile terminal user or an interactive network television (IPTV) user.
  • IPTV interactive network television
  • the above technical solution converts text messages and voice messages into video messages generated based on facial features of users, and utilizes resources to a large extent, so that users can more conveniently and more interestingly send information, improve market competitiveness, and have obvious economic benefits and Social benefits.
  • 1 is a schematic diagram of the principle of a current voice message system
  • FIG. 2 is a schematic structural diagram of a current voice message system
  • FIG. 3 is a schematic diagram of a schematic diagram of an improved voice message system according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of an improved voice message system according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of a message sent by a user according to an embodiment of the present invention.
  • FIG. 6 is a flow chart of receiving a message by a user according to an embodiment of the present invention.
  • FIG. 8 is a flowchart of converting a voice message into a dynamic video in the embodiment.
  • FIG. 10 is a flowchart of converting a text message into a dynamic video in the embodiment.
  • FIG. 11 is a flowchart of an application scenario 3 of the present invention.
  • the message sender can only simply send information. Therefore, the inventor of the present application considers that if a voice message system as shown in FIG. 3 is constructed, the user can upload his own photos while sending the information, so that the system generates dynamic video information generated based on his facial features, and then sends the information to the system. Message recipient. In this way, various data will be utilized to a large extent, and user satisfaction and market competitiveness will be improved.
  • the embodiment provides a device for implementing voice message visualization.
  • the device includes at least an information receiving module 401 and a dynamic video generating module 402.
  • the information receiving module 401 is configured to: receive the original message sent by the sender of the message, and the portrait picture.
  • the original message is a text message or a voice message.
  • the dynamic video generating module 402 is configured to: extract facial features from the received portrait images, generate facial expressions, and synthesize the generated facial expressions and the received original messages into dynamic video information, wherein the generated facial expressions and original messages are generated. Content corresponding;
  • the dynamic video information is displayed or transmitted to the message recipient at the terminal of the message recipient.
  • the dynamic video generation module 402 includes a facial feature extraction sub-module 4021 and a face list.
  • the information receiving module 401 is connected to the information storage module 403 and the facial feature extraction sub-module 4021 mentioned below.
  • the information receiving module 401 is configured to: receive an original message (ie, a text or voice message) sent by a user (message sender), and a portrait picture.
  • the facial feature extraction sub-module 4021 is first invoked to perform the information conversion processing flow, and finally returns the result to the message sender.
  • the facial feature extraction sub-module 4021 is connected to the information receiving module 401 and the facial expression generating sub-module 4022.
  • the facial feature extraction sub-module 4021 is configured to extract a facial feature from a picture uploaded by the user, and then invoke the facial expression generating sub-module 4022.
  • the facial expression generation sub-module 4022 is connected to the facial feature extraction sub-module 4021 and the information conversion sub-module 4023.
  • the facial expression generation sub-module 4022 is configured to generate a facial expression based on the facial features and then invoke the information conversion sub-module 4023.
  • the information conversion sub-module 4023 is connected to the facial expression generation sub-module 4022.
  • the information conversion sub-module 4023 is configured to synthesize the original message information and facial expressions of the sender into new dynamic video information. It splits the text or voice message into a single word according to the word library, analyzes the context and emotion according to the word, selects the corresponding facial expression picture from the generated facial expression according to the context and emotion, and synthesizes the facial expression picture with the text or voice message.
  • the dynamic video generated by the information conversion sub-module 4023 can reflect the content of the original text or voice message, so that the user can also obtain the message content through the dynamic picture.
  • the device in this embodiment may be placed on the voice message system side.
  • the information storage module 403 and the information sending module 404 may be further included.
  • the sub-module 4023 and the information delivery module 404 are connected (the entire device architecture is as shown in FIG. 4).
  • the information storage module 403 is mainly responsible for saving the generated dynamic video information and the corresponding receiving party, so that the information sending module queries the user message.
  • the information storage module 403 is further configured to store the original message and the portrait picture sent by the message sender.
  • the information delivery module 404 can send the dynamic video information stored in the information storage module 403 to the corresponding recipient.
  • the information sending module 404 after detecting that the user (ie, the receiving party) is powered on, invokes the information storage module 403 to query the voice message of the user (ie, the dynamic video information in this embodiment). Then send the message.
  • the flow of sending information by a user in the apparatus provided in this embodiment includes the following steps:
  • Step 501 Send a portrait image while the user sends the information.
  • the portrait image can be sent within a set time.
  • Step 502 The device for implementing voice message visualization on the voice message system side receives the message.
  • Step 503 The facial feature extraction sub-module extracts a facial feature of the sender according to the picture.
  • Step 504 The facial expression generation sub-module generates a corresponding facial expression according to the facial features of the sender.
  • Step 505 The information conversion sub-module synthesizes the original information of the sender and the facial expression into new video information.
  • the generated facial expression corresponds to the original message content. That is, the generated dynamic video can reflect the content of the original text or voice message, and the user can also obtain the message content through the dynamic picture.
  • Step 506 The information storage module stores the information.
  • the process for receiving information by a user in the apparatus provided in this embodiment includes the following steps:
  • Step 601 The user logs in to the voice message system.
  • Step 602 The device for implementing the voice message visualization determines whether the user has a message to receive, and if there is a message that meets the condition, the device delivers the message.
  • Step 603 The user receives the information sent by another person.
  • the apparatus of the embodiment can convert the voice message into a view generated based on the facial features of the user.
  • the frequency is then sent to the user (receiver).
  • the process is as shown in Figure 7, and includes the following operations:
  • Step 701 User A sends a voice message to User B while uploading his own photo.
  • Step 702 The device for visualizing the voice message receives the message and converts the message into a video.
  • the dynamic video information is synthesized by synthesizing the voice message and the portrait picture.
  • the specific process is as shown in FIG. 8 and includes the following operations:
  • Step 800 converting a voice message into text according to audio analysis
  • Step 802 split the text into a single word according to the word library
  • Step 804 analyzing context and emotion according to words
  • Step 806 selecting a corresponding facial expression picture according to the context and the emotion
  • the facial expression picture is generated by extracting facial features from the portrait picture.
  • step 808 the facial expression picture is synthesized into a dynamic video.
  • Step 703 Store the converted message and wait for user B to receive.
  • Step 704 After user B logs in to the voice message system, the converted message is received.
  • the content of the original message that is, the content of the voice message, can be obtained through the synthesized dynamic video information.
  • the device of the embodiment can convert the text message into a video generated based on the facial features of the user, and then deliver the video to the user (receiver).
  • the process is as shown in FIG. 9 and includes the following operations:
  • Step 901 User A sends a text message to User B while uploading a portrait picture.
  • Step 902 The voice message visualization implementation device receives the message and converts the message to a video.
  • Synthesize dynamic video information with text messages and portrait images including the following operations, as shown in Figure 10:
  • Step 1000 splitting the text message into a single word according to the word library
  • Step 1002 analyzing context and emotion according to words
  • Step 1004 Select a corresponding facial expression picture according to the context and the emotion
  • the facial expression picture is generated by extracting facial features from the portrait picture.
  • step 1006 the facial expression picture is synthesized into a dynamic video.
  • Step 903 The device for visualizing the voice message stores the converted message, and the user B receives the message.
  • Step 904 After user B logs in to the voice message system, the converted message is received.
  • the device in this embodiment has an interface interworking message with the IPTV system. Therefore, the device in this embodiment can be combined with the IPTV system, so that the IPTV user receives the converted information on the television.
  • the specific implementation process is as shown in FIG. Do the following:
  • Step 1101 User A sends a text message to User B while uploading a portrait picture.
  • Step 1102 The voice message visualization implementation device receives the message and converts the message into a video.
  • the text message and the portrait picture are combined to synthesize the dynamic video information, and the specific operation is as described above.
  • Step 1103 The voice mail service implementation device stores the converted message and forwards the message to the IPTV system.
  • Step 1104 After user B logs in to the IPTV system, the converted message is received.
  • This embodiment provides a method for implementing voice message visualization, which can be implemented based on the apparatus in Embodiment 1 above.
  • the method includes the following operations:
  • the dynamic video information is displayed or transmitted to the message recipient at the terminal of the message recipient.
  • the synthesized dynamic video information and the corresponding receiver information may also be stored, so as to query the corresponding dynamic video information when the user sends a message to the user.
  • the synthesized dynamic video information is sent to the corresponding recipient.
  • the process of extracting a facial feature from a portrait image, generating a facial expression, and synthesizing the generated facial expression and the original message into dynamic video information is as follows:
  • the original message (such as text or voice message) is split into a single word, and the context and emotion are analyzed according to the word, and the corresponding facial expression picture is selected from the generated facial expression according to the context and the emotion, and the facial expression picture is Text or voice messages are synthesized into dynamic video.
  • the receiver involved in the foregoing method may be a mobile terminal user or an IPTV user.
  • the above technical solution converts text messages and voice messages into video messages generated based on facial features of users, and utilizes resources to a large extent, so that users can more conveniently and more interestingly send information and improve market competitiveness, so it has a strong industry. Practicality.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • Data Mining & Analysis (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种语音消息可视化服务的实现方法及装置,该装置至少包括:信息接收模块(401),接收消息发送方发送的或者本地存储的原始消息以及人像图片,其中,所述原始消息为文本消息或语音消息;动态视频生成模块(402),从所述人像图片中提取面部特征,生成面部表情,并将所述面部表情与所述原始消息合成为动态视频信息,其中,生成的面部表情与原始消息内容相对应;将所述动态视频信息发送给消息接收方并在消息接收方的终端显示。该技术方案较大限度地利用了资源,使用户更方便更有趣的发送信息。

Description

一种语音消息可视化服务的实现方法及装置 技术领域
本发明涉及语音消息可视化服务的实现技术领域,具体地说涉及一种综合的统一的语音消息服务。
背景技术
在信息通讯快速发展的今天,人们在使用电话进行通讯时,可能经常困扰于如下的情形:因为对方人不在,长时间拨打无人接听;因为外出办事错过了重要电话;在进行重要工作、会议时,不方便接听电话。于是,语音消息业务便产生了。语音消息服务可将用户未能及时接听的来电转至语音消息中,让来电者留言;并在将来的某个时间,提示被叫,让被叫方便地收听留言。
这是语音消息业务发展的初始阶段。但是,随着3G技术、下一代网络技术的不断成熟并走向商用,基于3G网络上的业务应用也越来越丰富。智能手机的出现,更加丰富了用户与语音消息业务之间的交互手段,最大的特点就是用户可以通过智能手机上传位置信息、图片等多种数据。
如图1所示,用户使用现今的语音消息系统时,用户发送什么信息就接收什么信息。现今比较流行的语音消息系统基本架构如图2所示,其核心组成模块包括信息接收模块、信息存储模块、信息下发模块。基本工作原理如下:用户(语音消息发送者)发送消息给语音消息系统,信息接收模块接收消息,并调用信息存储模块存储,然后信息下发模块将语音消息下发给语音消息接收者。
发明内容
本发明所要解决的技术问题是,提供一种语音消息可视频化的实现方法及装置,以实现用户发送基于自己面部特征生成的视频信息。
为了解决上述技术问题,采用如下技术方案:
一种语音消息可视化的实现装置,包括信息接收模块和动态视频生成模块,其中:
所述信息接收模块设置成:接收消息发送方发送的或者本地存储的原始消息以及人像图片,其中,所述原始消息为文本消息或语音消息;
所述动态视频生成模块设置成:从所述人像图片中提取面部特征,生成面部表情,并将所述面部表情与所述原始消息合成为动态视频信息,其中,生成的面部表情与所述原始消息的内容相对应,将所述动态视频信息发送给消息接收方,并在所述消息接收方的终端显示。
可选地,所述动态视频生成模块包括面部特征提取子模块、面部表情生成子模块和信息转换子模块,其中:
所述面部特征提取子模块设置成:从所述人像图片中提取面部特征;
所述面部表情生成子模块设置成:根据提取的所述面部特征生成面部表情;
所述信息转换子模块设置成:根据词语库将文本或语音消息拆为单个词语,根据词语分析语境、情感,根据语境、情感从生成的面部表情中选择相应的面部表情图片,将面部表情图片与所述文本消息或语音消息合成为所述动态视频信息。
可选地,该装置置于语音消息系统侧。
可选地,该装置还包括信息存储模块和信息下发模块,其中:
所述信息存储模块设置成:存储所述消息发送方发送的原始消息以及人像图片,以及存储所述动态视频生成模块所生成的动态视频信息以及对应的接收方信息;
所述信息下发模块设置成:将所述信息存储模块存储的动态视频信息下发给所述消息接收方。
可选地,所述消息接收方为移动终端用户或交互式网络电视(IPTV)用户。
一种语音消息可视化的实现方法,包括:
接收消息发送方发送的或者本地存储的原始消息以及人像图片,其中,所述原始消息为文本消息或语音消息;
从所述人像图片中提取面部特征,生成面部表情,并将生成的面部表情与所述原始消息合成为动态视频信息,其中,生成的面部表情与所述原始消息的内容相对应;
将所述动态视频信息发送给消息接收方,并在所述消息接收方的终端显示。
可选地,所述将生成的面部表情与所述原始消息合成为动态视频信息的步骤包括:
根据词语库将文本或语音消息拆为单个词语,根据词语分析语境、情感,根据语境、情感从生成的面部表情中选择相应的面部表情图片,将面部表情图片与所述文本消息或语音消息合成为所述动态视频信息。
可选地,该方法还包括:
将合成的动态视频信息下发给所述消息接收方。
可选地,所述消息接收方为移动终端用户或交互式网络电视(IPTV)用户。
上述技术方案将文本消息、语音消息转换了基于用户面部特征生成的视频消息,较大限度地利用了资源,使用户更方便更有趣的发送信息,提高市场的竞争力,有明显的经济效益和社会效益。
附图概述
图1为目前语音消息系统的原理示意图;
图2为目前语音消息系统的架构示意图;
图3为本发明实施例的改进后的语音消息系统原理示意图;
图4为本发明实施例的改进后的语音消息系统结构示意图;
图5为本发明实施例中用户发送消息流程图;
图6为本发明实施例中用户接收消息流程图;
图7为本发明应用场景一流程图;
图8为本实施例中语音消息转化为动态视频的流程图;
图9为本发明应用场景二流程图;
图10为本实施例中文本消息转化为动态视频的流程图;
图11为本发明应用场景三流程图。
本发明的较佳实施方式
下文将结合附图对本发明技术方案作进一步详细说明。需要说明的是,在不冲突的情况下,本申请的实施例和实施例中的特征可以任意相互组合。
实施例1
由于在图1所示的传统语音消息系统中,消息发送者只能单纯发送信息。因此本申请发明人考虑,如果架构一种如图3所示的语音消息系统,让用户发送信息的同时可以上传自己的照片,从而让系统生成基于自己面部特征生成的动态视频信息,再发送给消息接收者。这样,将较大限度地利用各种数据,提高用户的满意度和市场的竞争力。
基于上述思想,本实施例提供一种语音消息可视化的实现装置,如图4所示,至少包括信息接收模块401和动态视频生成模块402。
其中,信息接收模块401设置成:接收消息发送方发送的原始消息、以及人像图片,本实施例中原始消息为文本消息或语音消息;
动态视频生成模块402设置成:从收到的人像图片中提取面部特征,生成面部表情,并将生成的面部表情与收到的原始消息合成为动态视频信息,其中,生成的面部表情与原始消息内容相对应;
将所述动态视频信息在消息接收方的终端显示或发送给消息接收方。
具体地,动态视频生成模块402包括面部特征提取子模块4021、面部表 情生成子模块4022和信息转换子模块4023。
其中,信息接收模块401与下面提到的信息存储模块403、面部特征提取子模块4021相连。信息接收模块401设置成:接收用户(消息发送者)发送的原始消息(即为文本或语音消息)、人像图片。当接收到消息上传请求后,首先调用面部特征提取子模块4021进行信息转换处理流程,最后向消息发送者返回结果。
而面部特征提取子模块4021与信息接收模块401、面部表情生成子模块4022相连。面部特征提取子模块4021设置成:从用户上传的图片中提取面部特征,然后调用面部表情生成子模块4022。
面部表情生成子模块4022与面部特征提取子模块4021、信息转换子模块4023相连。面部表情生成子模块4022设置成:根据面部特征生成面部表情,然后调用信息转换子模块4023。
信息转换子模块4023与面部表情生成子模块4022相连,信息转换子模块4023设置成:将发送者的原始消息息、面部表情合成为新的动态视频信息。其根据词语库将文本或语音消息拆为单个词语,根据词语分析语境、情感,根据语境、情感从生成的面部表情中选择相应的面部表情图片,将面部表情图片与文本或语音消息合成为动态视频。也就是说,信息转换子模块4023生成的动态视频可反应出原始的文本或者语音消息的内容,以用户通过动态画面也可以获取消息内容。
需要说明的是,本实施例中的装置可以置于语音消息系统侧,此时,其还可以包括信息存储模块403和信息下发模块404,该信息存储模块403与信息接收模块401、信息转换子模块4023、信息下发模块404相连(此时整个装置架构如图4所示)。信息存储模块403主要负责保存生成的动态视频信息以及对应的接收方上、以便信息下发模块查询用户消息。优选地,该信息存储模块403,还设置成存储消息发送方发送的原始消息和人像图片。此时,信息下发模块404,将信息存储模块403存储的动态视频信息下发给对应的接收方即可。
具体地,信息下发模块404在探测到用户(即接收方)开机后,调用信息存储模块403查询出该用户的语音消息(即本实施例中的动态视频信息), 然后下发该消息。
以下以用户发送信息流程、用户接收信息流程为例,对本实施例的装置中信息发送进行详细说明:
如图5所示为本实施例提供的装置中用户发送信息流程,包括如下步骤:
步骤501、用户发送信息的同时,发送人像图片。
该步骤中,用户发送原始消息后,在一设定时间内发送人像图片均可。
步骤502、语音消息系统侧的语音消息可视化的实现装置接收消息。
步骤503、面部特征提取子模块根据图片提取发送者的面部特征。
步骤504、面部表情生成子模块根据发送者的面部特征生成相应的面部表情。
步骤505、信息转换子模块将发送者的原始信息、面部表情合成为新的视频信息。
其中,生成的面部表情与原始消息内容相对应。即生成的动态视频可反应出原始的文本或者语音消息的内容,以用户通过动态画面也可以获取消息内容。
步骤506、信息存储模块将信息存储起来。
如图6所示为本实施列提供的装置中用户接收信息流程,包括如下步骤:
步骤601、用户登陆语音消息系统。
步骤602、语音消息可视化的实现装置判断用户是否有消息需要接收,如果有符合条件的消息,则下发。
步骤603、用户接收到他人发送的信息。
下面再以具体应用场景说明上述语音消息系统的工作过程。
应用场景一:
通过本实施例的装置可以将语音消息转换为基于用户面部特征生成的视 频,再下发给用户(接收者),该过程如图7所示,包括如下操作:
步骤701、用户A发送语音消息给用户B的同时上传自己的照片。
步骤702、语音消息可视化的实现装置接收消息,并转换消息为视频。
即将语音消息与人像图片合成动态视频信息,具体过程如图8所示,包括如下操作:
步骤800,根据音频分析,将语音消息转换为文本;
步骤802,根据词语库将文本拆为单个词语;
步骤804,根据词语分析语境、情感;
步骤806,根据语境、情感选择相应的面部表情图片;
其中,面部表情图片是由人像图片中提取面部特征所生成的。
步骤808,将面部表情图片合成为动态视频。
步骤703、存储转换后的消息,待用户B接收。
步骤704、用户B登录语音消息系统后,接收到转换后的消息。
此应用场景中,用户B是残障人士,不能收听语音,但是可以读唇语,故通过合成的动态视频信息即可获知原始消息的内容,即语音消息的内容。
应用场景二:
通过本实施例的装置可以将文本消息转换为基于用户面部特征生成的视频,再下发给用户(接收者),该过程如图9所示,包括如下操作:
步骤901、用户A发送文本消息给用户B的同时上传一张人像图片。
步骤902、语音消息可视化的实现装置接收消息,并转换消息为视频。
即将文本消息与人像图片合成动态视频信息,具体包括如下操作,如图10所示:
步骤1000,根据词语库将文本消息拆为单个词语;
步骤1002,根据词语分析语境、情感;
步骤1004,根据语境、情感选择相应的面部表情图片;
其中,面部表情图片是由人像图片中提取面部特征所生成的。
步骤1006,将面部表情图片合成为动态视频。
步骤903、语音消息可视化的实现装置存储转换后的消息,待用户B接收。
步骤904、用户B登录语音消息系统后,接收到转换后的消息。
应用场景三:
本实施例中的装置与IPTV系统具有接口互通消息,因此可以将本实施例中的装置与IPTV系统结合,使IPTV用户在电视上接收转换后的信息,具体实现过程如图11所示,包括如下操作:
步骤1101、用户A发送文本消息给用户B的同时上传一张人像图片。
步骤1102、语音消息可视化的实现装置接收消息,并转换消息为视频。
即将文本消息与人像图片合成动态视频信息,具体操作如前文所述。
步骤1103、语音邮件服务的实现装置存储转换后的消息,并将消息转发给IPTV系统。
步骤1104、用户B登录IPTV系统后,接收到转换后的消息。
实施例2
本实施例提供一种语音消息可视化的实现方法,可基于上述实施例1中的装置实现。该方法包括如下操作:
接收消息发送方发送的原始消息、以及人像图片,其中,所述原始消息为文本消息或语音消息;
从所收到的人像图片中提取面部特征,生成面部表情,并将生成的面部表情与所收到的原始消息合成为动态视频信息,其中,生成的面部表情与原始消息内容相对应;
将所述动态视频信息在消息接收方的终端显示或发送给消息接收方。
在上述方法的基础上,还可以存储所合成的动态视频信息以及对应的接收方信息,以便向用户下发消息时,查询对应的动态视频信息。
当然,还可以存储消息发送方发送的原始消息以及人像图片。
最后,将合成的动态视频信息下发给对应的接收方即可。
具体地,本实施例中,从人像图片中提取面部特征,生成面部表情,并将生成的面部表情与原始消息合成为动态视频信息的过程如下:
首先,从所接收的人像图片中提取面部特征;
再根据提取的面部特征生成面部表情;
最后根据词语库将原始消息(例如文本或语音消息)拆为单个词语,根据词语分析语境、情感,根据语境、情感从生成的面部表情中选择相应的面部表情图片,将面部表情图片与文本或语音消息合成为动态视频。
还要说明的是,上述方法中所涉及到的接收方可以为移动终端用户或者IPTV用户。
上述方法的具体实现还可参见上述实施例1的相应内容,在此不再赘述。
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件完成,所述程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现。相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本申请不限制于任何特定形式的硬件和软件的结合。
以上所述,仅为本发明的较佳实例而已,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
工业实用性
上述技术方案将文本消息、语音消息转换了基于用户面部特征生成的视频消息,较大限度地利用了资源,使用户更方便更有趣的发送信息,提高市场的竞争力,因此具有很强的工业实用性。

Claims (11)

  1. 一种语音消息可视化的实现装置,包括信息接收模块和动态视频生成模块,其中:
    所述信息接收模块设置成:接收消息发送方发送的或者本地存储的原始消息以及人像图片,其中,所述原始消息为文本消息或语音消息;
    所述动态视频生成模块设置成:从所述人像图片中提取面部特征,生成面部表情,并将所述面部表情与所述原始消息合成为动态视频信息,其中,生成的面部表情与所述原始消息的内容相对应,将所述动态视频信息发送给消息接收方,并在所述消息接收方的终端显示。
  2. 如权利要求1所述的实现装置,其中,所述动态视频生成模块包括面部特征提取子模块、面部表情生成子模块和信息转换子模块,其中:
    所述面部特征提取子模块设置成:从所述人像图片中提取面部特征;
    所述面部表情生成子模块设置成:根据提取的所述面部特征生成面部表情;
    所述信息转换子模块设置成:根据词语库将文本或语音消息拆为单个词语,根据词语分析语境、情感,根据语境、情感从生成的面部表情中选择相应的面部表情图片,将面部表情图片与所述文本消息或语音消息合成为所述动态视频信息。
  3. 如权利要求1或2所述的实现装置,该装置置于语音消息系统侧。
  4. 如权利要求3所述的实现装置,该装置还包括信息存储模块和信息下发模块,其中:
    所述信息存储模块设置成:存储所述消息发送方发送的原始消息以及人像图片,以及存储所述动态视频生成模块所生成的动态视频信息以及对应的接收方信息;
    所述信息下发模块设置成:将所述信息存储模块存储的动态视频信息下发给所述消息接收方。
  5. 如权利要求4所述的实现装置,其中,所述消息接收方为移动终端用 户或交互式网络电视(IPTV)用户。
  6. 一种语音消息可视化的实现方法,包括:
    接收消息发送方发送的或者本地存储的原始消息以及人像图片,其中,所述原始消息为文本消息或语音消息;
    从所述人像图片中提取面部特征,生成面部表情,并将生成的面部表情与所述原始消息合成为动态视频信息,其中,生成的面部表情与所述原始消息的内容相对应;
    将所述动态视频信息发送给消息接收方,并在所述消息接收方的终端显示。
  7. 如权利要求6所述的实现方法,其中,所述将生成的面部表情与所述原始消息合成为动态视频信息的步骤包括:
    根据词语库将文本或语音消息拆为单个词语,根据词语分析语境、情感,根据语境、情感从生成的面部表情中选择相应的面部表情图片,将面部表情图片与所述文本消息或语音消息合成为所述动态视频信息。
  8. 如权利要求6或7所述的实现方法,该方法还包括:
    将合成的动态视频信息下发给所述消息接收方。
  9. 如权利要求8所述的实现方法,其中,所述消息接收方为移动终端用户或交互式网络电视(IPTV)用户。
  10. 一种计算机程序,包括程序指令,当该程序指令被语音消息可视化的实现装置执行时,使得该语音消息可视化的实现装置可执行权利要求6-9中任一项所述的语音消息可视化的实现方法。
  11. 一种载有权利要求10所述计算机程序的载体。
PCT/CN2014/088985 2014-07-22 2014-10-20 一种语音消息可视化服务的实现方法及装置 WO2015117373A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US15/328,208 US20170270948A1 (en) 2014-07-22 2014-10-20 Method and device for realizing voice message visualization service
EP14881955.0A EP3174052A1 (en) 2014-07-22 2014-10-20 Method and device for realizing voice message visualization service

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410351478.9A CN105282621A (zh) 2014-07-22 2014-07-22 一种语音消息可视化服务的实现方法及装置
CN201410351478.9 2014-07-22

Publications (1)

Publication Number Publication Date
WO2015117373A1 true WO2015117373A1 (zh) 2015-08-13

Family

ID=53777193

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/088985 WO2015117373A1 (zh) 2014-07-22 2014-10-20 一种语音消息可视化服务的实现方法及装置

Country Status (4)

Country Link
US (1) US20170270948A1 (zh)
EP (1) EP3174052A1 (zh)
CN (1) CN105282621A (zh)
WO (1) WO2015117373A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190180859A1 (en) * 2016-08-02 2019-06-13 Beyond Verbal Communication Ltd. System and method for creating an electronic database using voice intonation analysis score correlating to human affective states
CN107977928B (zh) * 2017-12-21 2022-04-19 Oppo广东移动通信有限公司 表情生成方法、装置、终端及存储介质
CN109189985B (zh) * 2018-08-17 2020-10-09 北京达佳互联信息技术有限公司 文本风格处理方法、装置、电子设备及存储介质
CN109698962A (zh) * 2018-12-10 2019-04-30 视联动力信息技术股份有限公司 实时视频通信方法和系统
JP7356005B2 (ja) * 2019-09-06 2023-10-04 日本電信電話株式会社 音声変換装置、音声変換学習装置、音声変換方法、音声変換学習方法及びコンピュータプログラム
US11996113B2 (en) 2021-10-29 2024-05-28 Snap Inc. Voice notes with changing effects
CN114979054B (zh) * 2022-05-13 2024-06-18 维沃移动通信有限公司 视频生成方法、装置、电子设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003153159A (ja) * 2001-11-09 2003-05-23 Monolith Co Ltd 画像処理方法、画像処理装置およびコンピュータプログラム
CN1460232A (zh) * 2001-03-29 2003-12-03 皇家菲利浦电子有限公司 至可视语音系统的文字和加入面部情绪的方法
US6675145B1 (en) * 1999-10-26 2004-01-06 Advanced Telecommunications Research Institute International Method and system for integrated audiovisual speech coding at low bitrate
CN1607829A (zh) * 2003-05-20 2005-04-20 株式会社Ntt都科摩 便携式终端和图像通信程序
US20060281064A1 (en) * 2005-05-25 2006-12-14 Oki Electric Industry Co., Ltd. Image communication system for compositing an image according to emotion input
CN1971621A (zh) * 2006-11-10 2007-05-30 中国科学院计算技术研究所 语音和文本联合驱动的卡通人脸动画生成方法

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6330023B1 (en) * 1994-03-18 2001-12-11 American Telephone And Telegraph Corporation Video signal processing systems and methods utilizing automated speech analysis
US6473628B1 (en) * 1997-10-31 2002-10-29 Sanyo Electric Co., Ltd. Telephone set
AU2002216240A1 (en) * 2000-12-22 2002-07-08 Anthropics Technology Limited Communication system
WO2003021924A1 (en) * 2001-08-29 2003-03-13 Roke Manor Research Limited A method of operating a communication system
US8660247B1 (en) * 2009-04-06 2014-02-25 Wendell Brown Method and apparatus for content presentation in association with a telephone call
US9082400B2 (en) * 2011-05-06 2015-07-14 Seyyer, Inc. Video generation based on text
CN102271241A (zh) * 2011-09-02 2011-12-07 北京邮电大学 一种基于面部表情/动作识别的图像通信方法及系统
CN103647922A (zh) * 2013-12-20 2014-03-19 百度在线网络技术(北京)有限公司 虚拟视频通话方法和终端

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6675145B1 (en) * 1999-10-26 2004-01-06 Advanced Telecommunications Research Institute International Method and system for integrated audiovisual speech coding at low bitrate
CN1460232A (zh) * 2001-03-29 2003-12-03 皇家菲利浦电子有限公司 至可视语音系统的文字和加入面部情绪的方法
JP2003153159A (ja) * 2001-11-09 2003-05-23 Monolith Co Ltd 画像処理方法、画像処理装置およびコンピュータプログラム
CN1607829A (zh) * 2003-05-20 2005-04-20 株式会社Ntt都科摩 便携式终端和图像通信程序
US20060281064A1 (en) * 2005-05-25 2006-12-14 Oki Electric Industry Co., Ltd. Image communication system for compositing an image according to emotion input
CN1971621A (zh) * 2006-11-10 2007-05-30 中国科学院计算技术研究所 语音和文本联合驱动的卡通人脸动画生成方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3174052A4 *

Also Published As

Publication number Publication date
EP3174052A4 (en) 2017-05-31
EP3174052A1 (en) 2017-05-31
CN105282621A (zh) 2016-01-27
US20170270948A1 (en) 2017-09-21

Similar Documents

Publication Publication Date Title
WO2015117373A1 (zh) 一种语音消息可视化服务的实现方法及装置
US10757050B2 (en) System and method for topic based segregation in instant messaging
JP5021042B2 (ja) アニメーションsmsメッセージを提供し表示するための装置及び方法
US7792253B2 (en) Communications involving devices having different communication modes
KR101880656B1 (ko) 모바일단말기, 디스플레이장치 및 그 제어방법
US7463723B2 (en) Method to enable instant collaboration via use of pervasive messaging
CN111935443B (zh) 一种视频会议实时直播分享到即时通讯工具的方法和装置
US11482240B2 (en) Presentation of communications
WO2020124725A1 (zh) 基于WebRTC协议的音视频推送方法和推流客户端
JP2006528804A (ja) 電話ユーザがインスタント・メッセージングベースの会議に参加できるようにするための方法、システム、およびコンピュータ・プログラム(テレチャット・システムを使用する拡張会議サービスへのアクセス)
WO2008079505A2 (en) Method and apparatus for hybrid audio-visual communication
US9172795B1 (en) Phone call context setting
CN107548552B (zh) 使用移动设备处理传真传输
US10362173B2 (en) Web real-time communication from an audiovisual file
JP2007201906A (ja) 携帯端末装置及び画像表示方法
CN102264044A (zh) 一种视频短信发送方法、装置及系统
CN106302083B (zh) 即时通讯方法和服务器
WO2012088789A1 (zh) 一种在可视电话中传送信息的方法及通信终端
EP2536176B1 (en) Text-to-speech injection apparatus for telecommunication system
KR100691861B1 (ko) Pc를 이용한 휴대 단말기의 호 제어 시스템 및 방법
WO2015196552A1 (zh) 一种消息处理方法、装置及终端
KR100872076B1 (ko) 영상 통화 중 대체 영상 서비스를 제공하는 방법과 그를 위한 시스템 및 이동통신 단말기
WO2015196547A1 (zh) 一种消息处理方法、装置及终端
JP5327881B2 (ja) ボイスメールファイル送受信システム、ボイスメールファイル送受信端末、その方法及びそのプログラム
WO2015161559A1 (zh) 一种实现信箱业务的方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14881955

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2014881955

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014881955

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 15328208

Country of ref document: US