WO2022041192A1 - 语音消息处理方法、设备及即时通信客户端 - Google Patents

语音消息处理方法、设备及即时通信客户端 Download PDF

Info

Publication number
WO2022041192A1
WO2022041192A1 PCT/CN2020/112463 CN2020112463W WO2022041192A1 WO 2022041192 A1 WO2022041192 A1 WO 2022041192A1 CN 2020112463 W CN2020112463 W CN 2020112463W WO 2022041192 A1 WO2022041192 A1 WO 2022041192A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
image data
voice message
streaming media
voice
Prior art date
Application number
PCT/CN2020/112463
Other languages
English (en)
French (fr)
Inventor
马宇尘
Original Assignee
深圳市永兴元科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市永兴元科技股份有限公司 filed Critical 深圳市永兴元科技股份有限公司
Publication of WO2022041192A1 publication Critical patent/WO2022041192A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/04Real-time or near real-time messaging, e.g. instant messaging [IM]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/07User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
    • H04L51/10Multimedia information

Definitions

  • the present invention relates to the technical field of communication interaction.
  • IM Instant Messaging
  • Various instant messaging software not only supports the instant transmission of text messages, but also enables the transmission of voice messages and video messages between users.
  • the user When interacting with voice messages through the IM tool, the user can activate the terminal's microphone and other voice collection settings to record the voice message, and then transmit the voice message to the target receiving end user through the Internet. After the receiving end user enters the play instruction, he can play the Voice message, the recipient user can also reply to the message by voice.
  • the text conversion function of the voice message is also added, and the converted text content and the recorded audio file can be sent to the receiving end user as an instant communication message.
  • Some communication tools also have a speech synthesis function that converts text into speech—Text To Speech (TTS for short).
  • TTS Text To Speech
  • the former uses a large number of recorded voice fragments, combined with the text analysis results, and splices the recorded fragments to obtain synthetic voice; while the latter uses the results of text analysis to generate voice parameters through the model, such as basic frequency, etc., and then convert it into a waveform.
  • the existing voice message function only combines the features of text conversion, and does not consider further information such as expressions, emotional states, tone of voice, etc. when the user's voice is recorded, which is difficult to meet the needs of users, especially for users who like to use the dynamic image function for For young people in Doutu, voice messages lack fun.
  • the purpose of the present invention is to overcome the deficiencies of the prior art and provide a voice message processing method, device and instant messaging client.
  • streaming media data can be obtained for multiple parts of the semantic content of the voice, so as to improve the convenience, intelligence and interest of voice message interaction, and improve user experience.
  • the present invention provides the following technical solutions:
  • a voice message processing method comprising the steps of: collecting voice messages; identifying the voice messages to obtain semantic content; respectively obtaining corresponding streaming media data for multiple locations of the semantic content, and converting all or part of the multiple streaming media data Create a dynamic image output, or combine multiple streaming media data into a composite image output.
  • the user's selection operation on the aforementioned dynamic image or composite image is collected, and the dynamic image or composite image is sent to the target client.
  • the aforementioned dynamic image or composite image is sent to the target client together with the aforementioned voice message, or the corresponding content in the voice message is replaced with the dynamic image or the composite image and then sent to the target client.
  • the image receiving operation of the target client is obtained, and when the editing operation of the image by the recipient is collected, the edited image data is fed back to the sender.
  • the image information when the sender records the voice is collected, the emotional image of the sender is acquired, and the emotional image or some elements thereof are filled into the acquired streaming media data to make a dynamic image or a composite image.
  • the text content of the voice message is acquired, and the dynamic image or the composite image is displayed by moving at different positions of the text content.
  • streaming media data is image data
  • the way of making all or part of multiple image data into dynamic images is:
  • one of the multiple image data is selected as the base layer, and other image data is displayed on the base layer with dynamic effects.
  • the streaming media data is image data
  • the method of forming a composite image from multiple image data is:
  • the streaming media data is image data
  • the step of respectively acquiring corresponding image data for multiple locations of the semantic content includes:
  • the image data is obtained as the matching image of the keyword, and when the keyword corresponds to multiple image data, the image data ranked first is used as the matching image of the keyword.
  • the sorting method of the images is one of the following methods:
  • the update time of the image data is acquired, and the image data is sorted based on the update time, and the newest image data is ranked first.
  • audio analysis is performed on the voice message to obtain intonation features, speech speed features and/or volume features, and keywords in the semantic content are adjusted based on the intonation features, speech speed features and/or volume features;
  • audio analysis is performed on the voice message to obtain the user's emotional state feature, and keywords in the semantic content are adjusted based on the emotional state feature.
  • the present invention also provides a voice message processing device, including the following structure:
  • an audio acquisition module for acquiring the voice message input by the user
  • a speech recognition module for recognizing the speech message to obtain semantic content
  • the image output module is used to obtain corresponding streaming media data for multiple parts of the semantic content, make all or part of multiple streaming media data into dynamic image output, or form multiple streaming media data into composite image output.
  • the present invention also provides an instant messaging client for instant messaging interaction, which includes the following structure:
  • the voice message trigger module is used to collect the user's voice trigger operation
  • the speech recognition module is used to recognize the user's voice message to obtain semantic content
  • the image output module is used to obtain corresponding streaming media data for multiple locations of the semantic content, make all or part of multiple streaming media data into dynamic image output, or form multiple streaming media data into composite image output;
  • the message sending module is configured to send the outputted image to the target client according to the user's selection operation, or send the outputted image combined with the voice message to the target client.
  • the present invention has the following advantages and positive effects as an example due to the adoption of the above technical solutions: in the process of user voice interaction, streaming media data is obtained and output from multiple locations of the semantic content of the voice, improving the The convenience, intelligence and fun of voice message interaction improve user experience.
  • FIG. 1 is a flowchart of a voice message processing method provided by an embodiment of the present invention.
  • FIG. 2 is a module structure diagram of an instant messaging client provided by an embodiment of the present invention.
  • FIG. 3 to FIG. 7 are diagrams illustrating operation examples of instant messaging interaction provided by an embodiment of the present invention.
  • 8 to 11 are exemplary diagrams when a voice message including image data is received according to an embodiment of the present invention.
  • Instant messaging client 100 voice message triggering module 110, voice recognition module 120, image output module 130, message sending module 140; user terminal 200, desktop 210, instant messaging tool icon 211, contact 220, microphone 230; communication interaction interface 300.
  • a method for processing voice messages including the following steps:
  • the audio capture device can be activated to record the voice.
  • IM tool instant messaging tool
  • WeChat instant messaging tool
  • the message at this time is an instant messaging message.
  • the user enters WeChat he can trigger the voice recording button to start the audio collection device of the terminal where he is located. After the pickup is activated, the user's voice information can be collected.
  • the terminal may be various commonly used mobile terminals such as mobile phones, palmtop computers, and tablet computers, and various smart wearable electronic devices, such as smart glasses, smart watches, and the like.
  • a mobile phone is used as the mobile terminal, and the mobile phone has an audio collection structure, an image collection structure and a display structure.
  • the aforementioned voice message is recognized based on the voice recognition technology, and the semantic content of the voice message is acquired.
  • Speech recognition technology is mainly based on the analysis of three basic properties of speech: physical properties, physiological properties and social properties.
  • the physical properties of speech mainly include four elements: pitch, length, intensity and timbre.
  • Pitch refers to the height of the sound, which is mainly determined by the speed of the vibration of the sounding body;
  • the length of the sound refers to the length of the sound, which is mainly determined by the duration of the vibration of the sounding body;
  • the intensity of the sound refers to the strength of the sound, which is mainly determined by the pronunciation.
  • the timbre refers to the characteristics of the sound, which is mainly determined by the different tortuous forms of the sound wave ripples formed by the vibration of the sounding object.
  • the physiological properties of speech mainly refer to the influence of vocal organs on speech, including the lungs and trachea, head and vocal cords, as well as the vocal organs such as the oral cavity, nasal cavity and pharynx.
  • the social attributes of phonetics are mainly reflected in three aspects. First, there is no necessary connection between phonetics and meaning, and their corresponding relationship is established by social members; second, various languages or dialects have their own phonetic systems; third, Voice has the function of distinguishing meaning.
  • the basic process of speech recognition may include three steps: preprocessing of speech signals, feature extraction, and pattern matching.
  • Preprocessing usually includes speech signal sampling, anti-aliasing bandpass filtering, removal of individual pronunciation differences and noise effects caused by equipment and environment, etc., and involves the selection of speech recognition primitives and endpoint detection.
  • Feature extraction is used to extract acoustic parameters that reflect essential features in speech, such as average energy, average zero-crossing rate, formants, etc.
  • the extracted feature parameters must meet the following requirements: the extracted feature parameters can effectively represent the speech features and have good discrimination; the parameters of each order have good independence; the feature parameters should be easy to calculate, preferably with high efficiency. Algorithms to ensure real-time implementation of speech recognition.
  • a model is established for each entry and saved as a template library.
  • the speech signal passes through the same channel to obtain speech feature parameters, generates a test template, matches with the reference template, and takes the reference template with the highest matching score as the recognition result.
  • the accuracy of recognition can be improved.
  • Pattern matching is the core of the entire speech recognition system. It calculates the similarity between input features and inventory patterns according to certain rules (such as a certain distance measure) and expert knowledge (such as word formation rules, grammar rules, semantic rules, etc.). degree (such as matching distance, likelihood probability) to determine the semantic information of the input speech.
  • rules such as a certain distance measure
  • expert knowledge such as word formation rules, grammar rules, semantic rules, etc.
  • degree such as matching distance, likelihood probability
  • Obtaining the semantic content of the voice message may be obtained after performing semantic analysis and/or situational analysis on the content of the speech recognition.
  • S300 Acquire corresponding streaming media data for multiple locations of the semantic content, and make all or part of the multiple streaming media data into a dynamic image for output; or form multiple streaming media data into a composite image for output.
  • the streaming media data may include still images, dynamic images, audio information, or other multimedia information.
  • it is preferably image data.
  • image data As an example but not a limitation, for example, both "Yangcheng Lake” and “hairy crab” in the semantic content have matching images, then multiple matching images can be made into a dynamic image "hairy crabs crawling on the lake of Yangcheng Lake", or a composite image "Multiple hairy crabs are located in Yangcheng Lake".
  • the step of acquiring corresponding image data for multiple locations of the semantic content may specifically include the following steps:
  • the image data is obtained as the matching image of the keyword, and when the keyword corresponds to multiple image data, the image data ranked first is used as the matching image of the keyword.
  • the sorting method of the images is one of the following methods:
  • the update time of the image data is acquired, and the image data is sorted based on the update time, and the newest image data is ranked first.
  • the keywords may be words expressing emotions, words expressing emotions, words expressing preferences, words expressing intentions, words expressing plans, and the like.
  • the method of extracting the keywords in the voice message may be as follows:
  • the first way is to perform semantic analysis on the text of the voice message, and obtain keyword features based on the semantic analysis.
  • the voice message is subjected to audio analysis to obtain intonation characteristics, speech rate characteristics and/or volume characteristics, and keywords in the semantic content are adjusted based on the intonation characteristics, speech rate characteristics and/or volume characteristics.
  • the third method is to perform audio analysis on the voice message to obtain the user's emotional state feature, and adjust the keywords in the semantic content based on the emotional state feature.
  • Voices can reflect people's emotions to a certain extent. For example, generally speaking, irritable and loud speech often means that the speaker is more angry, while cheerful and soft speech often means that the speaker is more happy. Accordingly, the important content that the user needs to express can be obtained by analyzing the emotional information in the user's voice information.
  • the way of identifying the emotional information in the voice information is one or more of the following ways:
  • the first way is to analyze the user's volume change in the voice information, and analyze the emotional state feature according to the volume change.
  • the second method is to analyze the pitch change in the speech information, and analyze the emotional state feature according to the pitch change.
  • the third method is to analyze the speech rate information in the speech information, and analyze the emotional state characteristics according to the speech information.
  • the fourth method is to analyze the rhythm changes in the speech information, and analyze the emotional state characteristics according to the rhythm changes.
  • the user's voice message collected is "This product is much cheaper than the one I bought before, I'm really happy.”
  • the obtained keyword feature can be "Too much happy”.
  • the user does not express emotions explicitly, but the voice messages contain emotional tendencies, the implied emotions may be used as keyword features based on situational analysis. It is limited as an example.
  • the user's voice message collected is: "This bun is much smaller than before”, and the emotional tendency contained in the above text message is "dissatisfied and unhappy". Therefore, based on the emotional tendency, "dissatisfied and unhappy" is used as a keyword.
  • all or part of the multiple image data can be made into a dynamic image in the following way: According to the speech time axis of the semantic content, multiple corresponding image data are made into an animation image according to the aforementioned time axis. Alternatively, a plurality of image data are produced into a dynamic image based on a preset dynamic description file. Alternatively, one of the multiple image data is selected as the base layer, and other image data is displayed on the base layer with dynamic effects.
  • the manner of forming a composite image from a plurality of image data may be as follows: combining the plurality of image data into one image by overlapping layers. Alternatively, one of the plurality of image data is selected as a base picture, and the other image data is displayed on the base picture as a mark or a picture. Or, when multiple image data are displayed in different areas of the same background image.
  • a step may also be performed: collecting the user's selection operation on the aforementioned dynamic image or composite image, and sending the dynamic image or composite image to the target client.
  • the image data can be sent to the client of the contact object.
  • the image receiving operation of the target client can be further acquired, and when the editing operation of the image by the recipient is collected, the edited image data is fed back to the sender.
  • the sender it is convenient for the sender to obtain richer image information, especially for users who like Doutu, more interesting image resources can be collected.
  • the image information when the sender records the voice can also be collected, the emotional image of the sender can be obtained, and the emotional image or some elements thereof can be filled into the obtained streaming media data to make a dynamic image or Composite image.
  • a virtual image containing the user's own emotions or expressions is generated on the basis of protecting the user's privacy.
  • facial expressions, mouth shapes, head contours, etc. of the user may be extracted as part of the elements to create a dynamic image or a composite image.
  • the image data that forms a combination of virtual and reality increases the interest and expression of the message.
  • the text content of the voice message may also be acquired, and the dynamic image or composite image may be displayed in a moving manner at different positions of the text content.
  • rich and colorful image information can be displayed based on the text content of the voice, making the message interesting and vivid.
  • the present invention also provides an instant messaging client 100 for performing instant messaging interaction, which includes the following structure:
  • the voice message triggering module 110 is used for collecting user's voice triggering operation.
  • the speech recognition module 120 is used for recognizing the user's speech message to obtain semantic content.
  • the image output module 130 is used to obtain corresponding streaming media data for multiple locations of the semantic content, make all or part of the multiple streaming media data into dynamic image output, or form multiple streaming media data into composite image output .
  • the message sending module 140 is configured to send the aforementioned output image to the target client according to the user's selection operation, or send the aforementioned output image combined with the voice message to the target client.
  • the audio collection device When the user enters the instant communication tool and needs to send a voice message, the audio collection device is activated to record the voice. Specifically, the voice recording button can be triggered to activate the audio collection device of the terminal where it is located, and the user's voice information can be collected after the audio pickup is activated.
  • the terminal may be various commonly used mobile terminals such as mobile phones, palmtop computers, and tablet computers, and various smart wearable electronic devices, such as smart glasses, smart watches, and the like.
  • a mobile phone is used as the mobile terminal, and the mobile phone has an audio collection structure, an image collection structure and a display structure.
  • the aforementioned voice message is recognized based on the voice recognition technology, and the semantic content of the voice message is acquired.
  • Obtaining the semantic content of the voice message may be obtained after performing semantic analysis and/or situational analysis on the content of the speech recognition.
  • the streaming media data may include still images, dynamic images, audio information, or other multimedia information. In this embodiment, it is preferably image data.
  • the step of acquiring corresponding image data for multiple locations of the semantic content may specifically include the following steps:
  • the image data is obtained as the matching image of the keyword, and when the keyword corresponds to multiple image data, the image data ranked first is used as the matching image of the keyword.
  • the keywords may be words expressing emotions, words expressing emotions, words expressing preferences, words expressing intentions, words expressing plans, and the like.
  • the method of extracting the keywords in the voice message may be as follows:
  • the first way is to perform semantic analysis on the text of the voice message, and obtain keyword features based on the semantic analysis.
  • the voice message is subjected to audio analysis to obtain intonation characteristics, speech rate characteristics and/or volume characteristics, and keywords in the semantic content are adjusted based on the intonation characteristics, speech rate characteristics and/or volume characteristics.
  • the third method is to perform audio analysis on the voice message to obtain the user's emotional state feature, and adjust the keywords in the semantic content based on the emotional state feature.
  • Voices can reflect people's emotions to a certain extent. For example, generally speaking, irritable and loud speech often means that the speaker is more angry, while cheerful and soft speech often means that the speaker is more happy. Accordingly, the important content that the user needs to express can be obtained by analyzing the emotional information in the user's voice information.
  • all or part of a plurality of image data can be made into a dynamic image in the following manner: According to the speech time axis of the semantic content, multiple corresponding image data are made into an animation image according to the aforementioned time axis. Alternatively, a plurality of image data is produced into a dynamic image based on a preset dynamic description file. Alternatively, one of the multiple image data is selected as the base layer, and other image data is displayed on the base layer with dynamic effects.
  • the manner of forming a composite image from a plurality of image data may be as follows: combining the plurality of image data into one image by overlapping layers. Alternatively, one of the plurality of image data is selected as a base picture, and the other image data is displayed on the base picture as a mark or a picture. Or, when multiple image data are displayed in different areas of the same background image.
  • a message synthesis unit may also be included, which is used for recognizing the text content of the voice, and integrating the text content and the audio file of the voice into a multimedia message.
  • the text content is displayed in the message box of the multimedia message, and an audio file play button may be set corresponding to the message box, and triggering the play button can trigger the audio file to play.
  • the user enters the instant messaging tool “Quick Message” through the user terminal 200 carried by the user.
  • the user terminal 200 is preferably a mobile phone in this embodiment.
  • the desktop 210 of the user terminal 200 outputs a user interface to the user, on which all communication messages are displayed, and the communication messages display the contacts 220, the latest interactive messages, and a virtual microphone 230 (voice trigger control).
  • the virtual microphone 230 corresponding to leo can be triggered, and then the voice message collection function can be directly started.
  • a voice message input box is displayed in the user interface, and the input box displays the user's voice being entered, the text content corresponding to the voice, and related operation keys.
  • the text content of the speech also identifies the parts corresponding to streaming media data, such as "Forest Park”, “UAV” and "Xiaoduoduo".
  • the voice message input box can be directly displayed on the current user interface as shown in FIG. 5 , or can be displayed after generating a separate voice message interface for the contact leo.
  • the voice message interface shows that there are contacts. Person information, voice message input box, virtual microphone, and corresponding generated image information.
  • pressing and sliding the microphone up is a send operation
  • pressing and sliding the microphone to the right is a pause operation.
  • the manner in which the generated dynamic image or composite image is sent together with the aforementioned voice message may be as follows:
  • the voice message and the generated image are sent together as two separate messages.
  • the image is inserted into the voice message as streaming media to form a multimedia message and sent to the contact.
  • a play button can also be set in the message box. After triggering the play button, the user can play the image data through the floating window or based on the current area.
  • a floating window is set corresponding to the voice message, and the image data is directly displayed through the floating window when the message is output.
  • the text content of the voice message may also be acquired, and the corresponding matching images may be moved and displayed at different positions of the text content.
  • rich and colorful image information can be displayed based on the text content of the voice, making the message interesting and vivid.
  • the text content is displayed in the message box of the multimedia message, and an audio file play button may also be set corresponding to the message box, and triggering the play button can trigger the audio file to play.
  • the instant messaging client may also be set with other functional modules as required, and the specific functions can be found in the previous embodiments, which will not be repeated here.
  • Another embodiment of the present invention also provides a voice message processing device.
  • the voice message processing device includes the following structure:
  • an audio acquisition module for acquiring the voice message input by the user
  • a speech recognition module for recognizing the speech message to obtain semantic content
  • the image output module is used to obtain corresponding streaming media data for multiple parts of the semantic content, make all or part of multiple streaming media data into dynamic image output, or form multiple streaming media data into composite image output.
  • the voice message processing device may also be provided with other functional modules as required. For details, refer to the foregoing embodiments, which will not be repeated here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音消息处理方法、设备及即时通信客户端,涉及通信交互技术领域。所述语音消息处理方法,包括如下步骤:采集语音消息(S100);识别所述语音消息获取语义内容(S200);针对所述语义内容的多处分别获取对应的流媒体数据,将多个流媒体数据的全部或部分制作成动态图像输出,或者将多个流媒体数据形成合成图像输出(S300)。利用所述方法,在用户语音交互的过程中能够针对语音的语义内容的多处获取流媒体数据,提高语音消息交互的便捷度、智能度和趣味性,提升用户体验。

Description

语音消息处理方法、设备及即时通信客户端 技术领域
本发明涉及通信交互技术领域。
背景技术
即时通信(Instant Messaging,IM)是移动互联网时代最为流行的通信方式,各种各样的即时通信软件不仅支持文字消息的即时传输,还能够实现用户间的语音消息、视频消息传输。
在通过IM工具进行语音消息交互时,用户可以启动终端的麦克风等语音采集设置录入语音消息,然后通过互联网将该语音消息传输给目标接收端用户,接收端用户在输入播放指令后,能够播放该语音消息,接收端用户也可以通过语音对该消息进行回复。
目前,为便于用户根据场合选择是否接听语音消息,还增加了语音消息的文字转换功能,并能够将转换的文本内容与录制的音频文件一起作为即时通信消息发送至接收端用户。有些通信工具,还设置了将文本转换成语音功的语音合成能——文本转换语音(Text To Speech,简称TTS)。语音合成解决方案主要有两类,一类是拼接系统,另外一类是参数生成系统。两类系统均需要进行文本分析,前者是利用大量录制的片段语音,结合文本分析结果,将录音片段进行拼接得到合成语音;而后者是利用文本分析的结果,通过模型产生语音的参数,如基频等,进而转化成波形。
现有的语音消息功能,只结合了文本转换的特征,并没有考虑用户语音录制时的表情、情感状态、语气语调等进一层次的信息,难以满足用户需求,尤其对于喜欢用动态图像功能进行斗图的年轻人来说,语音消息缺少了趣味性。
如何结合上述现有技术向用户提供一种更智能便捷的通信方式是亟待解决的问题。
技术问题
本发明的目的在于:克服现有技术的不足,提供了一种语音消息处理方法、设备及即时通信客户端。利用本发明,在用户语音交互的过程中能够针对语音的语义内容的多处获取流媒体数据,提高语音消息交互的便捷度、智能度和趣味性,提升用户体验。
技术解决方案
为实现上述目标,本发明提供了如下技术方案:
一种语音消息处理方法,包括如下步骤:采集语音消息;识别所述语音消息获取语义内容;针对所述语义内容的多处分别获取对应的流媒体数据,将多个流媒体数据的全部或部分制作成动态图像输出,或者将多个流媒体数据形成合成图像输出。
进一步,采集用户对前述动态图像或合成图像的选择操作,将动态图像或合成图像发送至目标客户端。
进一步,将前述动态图像或合成图像与前述语音消息一起发送至目标客户端,或者用动态图像或合成图像替换所述语音消息中对应内容后发送至目标客户端。
优选的,获取目标客户端的收图操作,在采集到接收方对图像的编辑操作时,将编辑后的图像数据反馈发送至发送方。
进一步,采集发送方录制语音时的图像信息,获取发送方的情绪图像,将所述情绪图像或其部分元素填充至获取的流媒体数据中制作动态图像或合成图像。
进一步,获取所述语音消息的文字内容,将所述动态图像或合成图像在所述文字内容的不同位置处移动展示。
进一步,所述流媒体数据为图像数据,将多个图像数据的全部或部分制作成动态图像的方式为,
根据语义内容的语音时间轴,将多处对应的图像数据依据前述时间轴制作成动画图像;
或者,将多个图像数据基于预设的动态描述文件制作成动态图像;
或者,在多个图像数据中选择一个作为基础图层,将其他图像数据在所述基础图层上以动态效果展示。
进一步,所述流媒体数据为图像数据,将多个图像数据形成合成图像的方式为,
通过图层重叠的方式将多个图像数据合成一个图像;
或者,在多个图像数据中选择一个作为基底图片,将其他图像数据作为标记或图片显示在所述基底图片上;
或者,在多个图像数据在同一张背景图片的不同区域显示。
进一步,所述流媒体数据为图像数据,针对所述语义内容的多处分别获取对应的图像数据的步骤包括,
针对语义内容,获取多个关键词;
搜索本地资源和/或网络资源,获取与所述关键词匹配的图像数据,每个关键词对应一个或多个图像数据;
当关键词对应一个图像数据时,获取该图像数据作为该关键词的匹配图像,当关键词对应多个图像数据时,以排在首位的图像数据作为该关键词的匹配图像。
优选的,关键词对应多个图像数据时,图像的排序方式为如下方式之一,
获取用户的交互记录,基于交互记录中图像数据的使用频率对图像数据排序;
或者,采用通信工具推荐的图像数据排序规则;
或者,获取图像数据的更新时间,基于所述更新时间对图像数据进行排序,时间新的图像数据排序靠前。
优选的,对语音消息进行音频分析获取语调特征、语速特征和/或音量特征,基于语调特征、语速特征和/或音量特征调整语义内容中的关键词;
或者,对语音消息进行音频分析获取用户的情绪状态特征,基于所述情绪状态特征调整语义内容中的关键词。
本发明还提供了一种语音消息处理设备,包括如下结构:
音频采集模块,用以获取用户输入的语音消息;
语音识别模块,识别所述语音消息获取语义内容;
图像输出模块,用以针对所述语义内容的多处分别获取对应的流媒体数据,将多个流媒体数据的全部或部分制作成动态图像输出,或者将多个流媒体数据形成合成图像输出。
本发明还提供了一种即时通信客户端,用以进行即时通信交互,其包括如下结构:
语音消息触发模块,用以采集用户的语音触发操作;
语音识别模块,用以识别用户的语音消息获取语义内容;
图像输出模块,用以针对所述语义内容的多处分别获取对应的流媒体数据,将多个流媒体数据的全部或部分制作成动态图像输出,或者将多个流媒体数据形成合成图像输出;
消息发送模块,用以根据用户的选择操作将前述输出的图像发送至目标客户端,或者将前述输出的图像结合语音消息发送至目标客户端。
有益效果
本发明由于采用以上技术方案,与现有技术相比,作为举例,具有以下的优点和积极效果:在用户语音交互的过程中,针对语音的语义内容的多处获取流媒体数据并输出,提高语音消息交互的便捷度、智能度和趣味性,提升用户体验。
附图说明
图1为本发明实施例提供的语音消息处理方法的流程图。
图2为本发明实施例提供的即时通信客户端的模块结构图。
图3至图7为本发明实施例提供的即时通信交互的操作示例图。
图8至图11为本发明实施例提供的包含图像数据的语音消息接收时的示例图。
附图标记说明:
即时通信客户端100,语音消息触发模块110,语音识别模块120,图像输出模块130,消息发送模块140;用户终端200,桌面210,即时通信工具图标211,联系人220,话筒230;通信交互界面300。
本发明的实施方式
以下结合附图和具体实施例对本发明提供的语音消息处理方法、设备及即时通信客户端作进一步详细说明。应当注意的是,下述实施例中描述的技术特征或者技术特征的组合不应当被认为是孤立的,它们可以被相互组合从而达到更好的技术效果。在下述实施例的附图中,各附图所出现的相同标号代表相同的特征或者部件,可应用于不同实施例中。因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
需说明的是,本说明书所附图中所绘示的结构、比例、大小等,均仅用以配合说明书所揭示的内容,以供熟悉此技术的人士了解与阅读,并非用以限定发明可实施的限定条件,任何结构的修饰、比例关系的改变或大小的调整,在不影响发明所能产生的功效及所能达成的目的下,均应落在发明所揭示的技术内容所能涵盖的范围内。本发明的优选实施方式的范围包括另外的实现,其中可以不按所述的或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本发明的实施例所属技术领域的技术人员所理解。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为授权说明书的一部分。在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。
实施例
参见图1所示,公开了一种语音消息处理方法,包括如下步骤:
S100,采集语音消息。
用户需要发送语音消息时,可以启动音频采集设备录制语音。以即时通信工具(IM工具)微信为例进行说明,此时所述消息为即时通信消息。用户进入微信后,可以触发语音录制按钮来启动所在终端的音频采集设备,拾音器被启动后即可以采集用户的声音信息。
所述终端,作为举例而非限制,可以为手机、掌上电脑、平板电脑等各种常用的移动终端,以及各种智能穿戴式电子设备,比如智能眼镜、智能手表等。在本实施例中,采用手机作为移动终端,所述手机具有音频采集结构、图像采集结构和显示结构。
S200,识别所述语音消息获取语义内容。
基于语音识别技术识别前述语音消息,获取所述语音消息的语义内容。
语音识别技术主要是基于对语音的物理属性、生理属性和社会属性三个个基本属性的分析。语音的物理属性,主要包括音高、音长、音强和音色4个要素。音高是指声音的高低,主要决定于发音体振动速度的快慢;音长是指声音的长短,主要决定于发音体振动时间的久暂;音强是指声音的强弱,主要决定于发音体振动幅度的大小;音色是指声音的特色,主要决定于发音物体振动所形成的音波波纹曲折形式不同。语音的生理属性,主要指发音器官对语音的影响,包括肺和气管、候头和声带以及口腔、鼻腔和咽腔等发音气官。语音的社会属性,主要表现在3个方面,一是语音与意义之间并无必然联系,它们的对应关系是社会成员约定俗成的;二是各种语言或方言都有自己的语音系统;三是语音具有区别意义的作用。
通常而言,语音识别的基本过程可以包括:语音信号的预处理、特征提取、模式匹配三个步骤。
预处理通常可以包括语音信号采样、反混叠带通滤波、去除个体发音差异和设备、环境引起的噪声影响等,并涉及到语音识别基元的选取和端点检测问题。
特征提取,用于提取语音中反映本质特征的声学参数,如平均能量、平均跨零率、共振峰等。提取的特征参数必须满足以下的要求:提取的特征参数能有效地代表语音特征,具有很好的区分性;各阶参数之间有良好的独立性;特征参数要计算方便,最好有高效的算法,以保证语音识别的实时实现。在训练阶段,将特征参数进行一定的处理后,为每个词条建立一个模型,保存为模板库。在识别阶段,语音信号经过相同的通道得到语音特征参数,生成测试模板,与参考模板进行匹配,将匹配分数最高的参考模板作为识别结果。同时,还可以在很多先验知识的帮助下,提高识别的准确率。
模式匹配,是整个语音识别系统的核心,它是根据一定规则(如某种距离测度)以及专家知识(如构词规则、语法规则、语义规则等),计算输入特征与库存模式之间的相似度(如匹配距离、似然概率),判断出输入语音的语意信息。
获取语音消息的语义内容,可以是针对语音识别的内容进行语义分析和/或情景分析后获得的。
S300,针对所述语义内容的多处分别获取对应的流媒体数据,将多个流媒体数据的全部或部分制作成动态图像输出;或者将多个流媒体数据形成合成图像输出。
所述流媒体数据,作为举例而非限制,可以包括静态图像、动态图像、音频信息或其它多媒体信息等。本实施例中,优选为图像数据。作为举例而非限制,比如语义内容中的“阳澄湖”和“大闸蟹”均对应有匹配图像,则可以将多个匹配的图像制作成动态图像“在阳澄湖湖面上爬行的大闸蟹”,或者,合成图像“多个大闸蟹位于阳澄湖中”。
本实施例中,针对所述语义内容的多处分别获取对应的图像数据的步骤具体可以包括如下步骤:
针对语义内容,获取多个关键词;
搜索本地资源和/或网络资源,获取与所述关键词匹配的图像数据,每个关键词对应一个或多个图像数据;
当关键词对应一个图像数据时,获取该图像数据作为该关键词的匹配图像,当关键词对应多个图像数据时,以排在首位的图像数据作为该关键词的匹配图像。
优选的,关键词对应多个图像数据时,图像的排序方式为如下方式之一,
获取用户的交互记录,基于交互记录中图像数据的使用频率对图像数据排序;
或者,采用通信工具推荐的图像数据排序规则;
或者,获取图像数据的更新时间,基于所述更新时间对图像数据进行排序,时间新的图像数据排序靠前。
所述关键词,作为举例而非限制,可以是表达情绪的词、表达心情的词、表达喜好的词、表达意图的词,或者表达计划的词等。提取所述语音消息中的关键词的方式可以为如下方式:
方式一,对语音消息的文本进行语义分析,基于语义分析获取关键词特征。
方式二,优选的,对语音消息进行音频分析获取语调特征、语速特征和/或音量特征,基于语调特征、语速特征和/或音量特征调整语义内容中的关键词。
声音在表达时会出现语调、语速和音量的变化,比如说到关键信息时用户通常会提高音量,加重语调,并放慢语速。根据上述变化,可以分析用户表达的重点内容作为关键词。
方式三,对语音消息进行音频分析获取用户的情绪状态特征,基于所述情绪状态特征调整语义内容中的关键词。
声音能在一定程度上反应人的情绪,比如通常而言,急躁而大声的语音往往代表说话者比较愤怒,而欢快而柔和的语音往往代表说话者比较开心。据此,可以通过分析用户语音信息中的情绪信息来获知用户需要表达的重要内容。
优选的,识别所述语音信息中的情绪信息的方式为如下方式一种或多种:
方式一,分析语音信息中用户的音量变化,根据音量变化分析情绪状态特征。
方式二,分析语音信息中的音调变化,根据音调变化分析情绪状态特征。
方式三,分析语音信息中的语速信息,根据语音信息分析情绪状态特征。
方式四,分析语音信息中的节奏变化,根据节奏变化分析情绪状态特征。
作为举例而限制,比如采集到的用户的语音消息为“这个商品比我之前买的优惠了很多呀,真是太开心了”,对该语音消息进行识别后,获取的关键词特征可以是“太开心了”。
或者,虽然用户没有明确地表达情绪,但是语音消息中包含有情绪倾向性,则可以基于情景分析将暗含的情绪作为关键词特征。作为举例而限制,比如采集到用户的语音消息为:“这个包子比以前小了太多了呀”,上述文字信息中包含的情绪倾向性为“不满意不开心”。于是,基于情绪倾向性“不满意不开心”作为关键词。
本实施例中,将多个图像数据的全部或部分制作成动态图像的方式可以为如下方式:根据语义内容的语音时间轴,将多处对应的图像数据依据前述时间轴制作成动画图像。或者,将多个图像数据基于预设的动态描述文件制作成动态图像。或者,在多个图像数据中选择一个作为基础图层,将其他图像数据在所述基础图层上以动态效果展示。
将多个图像数据形成合成图像的方式可以为如下方式:通过图层重叠的方式将多个图像数据合成一个图像。或者,在多个图像数据中选择一个作为基底图片,将其他图像数据作为标记或图片显示在所述基底图片上。或者,在多个图像数据在同一张背景图片的不同区域显示。
在步骤S300之后,还可以执行步骤:采集用户对前述动态图像或合成图像的选择操作,将动态图像或合成图像发送至目标客户端。作为举例而非限制,比如用户通过鼠标点选了其中的一个图像数据,并选择了相应的联系人对象,则可以将该图像数据发送至联系人对象的客户端。
或者执行步骤:将前述动态图像或合成图像与前述语音消息一起发送至目标客户端,或者用动态图像或合成图像替换所述语音消息中对应内容后发送至目标客户端。
优选的,可以进一步获取目标客户端的收图操作,在采集到接收方对图像的编辑操作时,将编辑后的图像数据反馈发送至发送方。如此,可以便于发送方获取更丰富的图像信息,尤其对于喜欢斗图的用户,可以收集到更多有趣的图像资源。
本实施例的另一实施方式中,还可以采集发送方录制语音时的图像信息,获取发送方的情绪图像,将所述情绪图像或其部分元素填充至获取的流媒体数据中制作动态图像或合成图像。以此,在保护用户隐私的基础上生成包含用户自身情绪或表情的虚拟图像。
作为举例而非限制,比如可以提取用户的五官表情、嘴型、头部轮廓等作为部分元素来制作动态图像或合成图像。形成虚拟和现实结合的图像数据,增加了消息的趣味性和表达性。
本实施例的另一实施方式中,还可以获取所述语音消息的文字内容,将所述动态图像或合成图像在所述文字内容的不同位置处移动展示。如此,可以基于语音的文字内容展示丰富多彩的图像信息,使得消息变得有趣生动。
参见图2所示,本发明还提供了一种即时通信客户端100,用以进行即时通信交互,其包括如下结构:
语音消息触发模块110,用以采集用户的语音触发操作。
语音识别模块120,用以识别用户的语音消息获取语义内容。
图像输出模块130,用以针对所述语义内容的多处分别获取对应的流媒体数据,将多个流媒体数据的全部或部分制作成动态图像输出,或者将多个流媒体数据形成合成图像输出。
消息发送模块140,用以根据用户的选择操作将前述输出的图像发送至目标客户端,或者将前述输出的图像结合语音消息发送至目标客户端。
用户进入即时通信工具,需要发送语音消息时,启动音频采集设备录制语音。具体的,可以触发语音录制按钮来启动所在终端的音频采集设备,拾音器被启动后即可以采集用户的声音信息。所述终端,作为举例而非限制,可以为手机、掌上电脑、平板电脑等各种常用的移动终端,以及各种智能穿戴式电子设备,比如智能眼镜、智能手表等。在本实施例中,采用手机作为移动终端,所述手机具有音频采集结构、图像采集结构和显示结构。
然后,基于语音识别技术识别前述语音消息,获取所述语音消息的语义内容。
获取语音消息的语义内容,可以是针对语音识别的内容进行语义分析和/或情景分析后获得的。
所述流媒体数据,作为举例而非限制,可以包括静态图像、动态图像、音频信息或其它多媒体信息等。本实施例中,优选为图像数据。
本实施例中,针对所述语义内容的多处分别获取对应的图像数据的步骤具体可以包括如下步骤:
针对语义内容,获取多个关键词;
搜索本地资源和/或网络资源,获取与所述关键词匹配的图像数据,每个关键词对应一个或多个图像数据;
当关键词对应一个图像数据时,获取该图像数据作为该关键词的匹配图像,当关键词对应多个图像数据时,以排在首位的图像数据作为该关键词的匹配图像。
所述关键词,作为举例而非限制,可以是表达情绪的词、表达心情的词、表达喜好的词、表达意图的词,或者表达计划的词等。提取所述语音消息中的关键词的方式可以为如下方式:
方式一,对语音消息的文本进行语义分析,基于语义分析获取关键词特征。
方式二,优选的,对语音消息进行音频分析获取语调特征、语速特征和/或音量特征,基于语调特征、语速特征和/或音量特征调整语义内容中的关键词。
声音在表达时会出现语调、语速和音量的变化,比如说到关键信息时用户通常会提高音量,加重语调,并放慢语速。根据上述变化,可以分析用户表达的重点内容作为关键词。
方式三,对语音消息进行音频分析获取用户的情绪状态特征,基于所述情绪状态特征调整语义内容中的关键词。
声音能在一定程度上反应人的情绪,比如通常而言,急躁而大声的语音往往代表说话者比较愤怒,而欢快而柔和的语音往往代表说话者比较开心。据此,可以通过分析用户语音信息中的情绪信息来获知用户需要表达的重要内容。
本实施例中,将多个图像数据的全部或部分制作成动态图像的方式可以为如下方式:根据语义内容的语音时间轴,将多处对应的图像数据依据前述时间轴制作成动画图像。或者,将多个图像数据基于预设的动态描述文件制作成动态图像。或者,在多个图像数据中选择一个作为基础图层,将其他图像数据在所述基础图层上以动态效果展示。
将多个图像数据形成合成图像的方式可以为如下方式:通过图层重叠的方式将多个图像数据合成一个图像。或者,在多个图像数据中选择一个作为基底图片,将其他图像数据作为标记或图片显示在所述基底图片上。或者,在多个图像数据在同一张背景图片的不同区域显示。
优选的,还可以包括消息合成单元,其用以识别所述语音的文字内容,并将所述文字内容与语音的音频文件整合成多媒体消息。所述多媒体消息的消息框中显示所述文字内容,对应该消息框可以设置有音频文件播放按钮,触发所述播放按钮能够触发音频文件播放。
结合图3至图7对本实施例的实施方式进行详细描述。
参见图3所示,用户通过携带的用户终端200进入即时通讯工具“快信”。所述用户终端200,在本实施例中优选为手机。
参见图4所示,用户终端200的桌面210向用户输出用户界面,用户界面上显示有所有通信消息,通信消息显示了联系人220、最新的交互消息、以及虚拟话筒230(语音触发控件)。
作为举例,参见图4所示,比如用户与联系人leo聊天,可以在触发leo对应的虚拟话筒230,便可直接启动语音消息采集功能。
参见图5所示,用户界面中显示了语音消息输入框,输入框中显示了用户的正在录入的语音,语音对应的文字内容以及相关的操作按键。语音的文字内容中同时标识了对应于有流媒体数据的部分,比如“森林公园”,“无人机”和“小多多”。
参见图6所示,基于所述语义内容的多处获取了对应的图像数据信息后,多个图像数据的全部制作成动态图像,并向用户输出。
所述语音消息输入框可以如图5中所示直接在当前用户界面显示,也可以针对联系人leo生成单独的语音消息界面后进行显示,参见图7所示,所述语音消息界面显示有联系人信息、语音消息输入框,以及虚拟话筒,以及对应生成的图像信息。
用户录制语音时,可以通过操作虚拟话筒230来进行发送、暂停操作。作为优选方式的举例,比如按住话筒向上滑动即为发送操作,按住话筒向右方滑动即为暂停操作。
本实施例中,生成的动态图像或合成图像与前述语音消息一起发送的方式可以为如下方式:
参见图8所示,将语音消息与所述生成的图像作为两条独立的消息一起发送。
或者,参见图9和图10所示,将所述图像作为流媒体插入到该条语音消息中形成一条多媒体消息发送至联系人。针对图像数据还可以在消息框中设置播放按键,用户在触发播放键后可以通过悬浮窗或基于当前区域播放图像数据。
或者,对应所述语音消息设置悬浮窗,在输出消息时,直接通过所述悬浮窗显示所述图像数据。
或者,参见图11所示,还可以获取所述语音消息的文字内容,将对应的匹配图像在所述文字内容的不同位置处移动展示。如此,可以基于语音的文字内容展示丰富多彩的图像信息,使得消息变得有趣生动。
图8至图11还将语音消息的音频文件在多媒体消息输出显示了。
所述多媒体消息的消息框中显示所述文字内容,对应该消息框还可以设置有音频文件播放按钮,触发所述播放按钮能够触发音频文件播放。
所述即时通信客户端还可以根据需要设置其它功能模块,具体功能可参见在前实施例,在此不再赘述。
本发明的另一实施例,还提供了一种语音消息处理设备。
所述语音消息处理设备包括如下结构:
音频采集模块,用以获取用户输入的语音消息;
语音识别模块,识别所述语音消息获取语义内容;
图像输出模块,用以针对所述语义内容的多处分别获取对应的流媒体数据,将多个流媒体数据的全部或部分制作成动态图像输出,或者将多个流媒体数据形成合成图像输出。
所述语音消息处理设备还可以根据需要设置其它功能模块,具体参见前述实施例,在此不再赘述。
在上面的描述中,虽然本公开内容的各方面的所有组件可以被解释为被装配或被操作地连接为一个电路,但是本公开内容并不旨在将其自身限于这些方面。而是,在本公开内容的目标保护范围内,各组件可以以任意数目选择性地且操作性地进行合并。这些组件中的每个组件自身还可以实现成硬件,同时各个组件可以部分地合并或选择性地总体合并且实现成具有用于执行硬件等同体的功能的程序模块的计算机程序。用以构建这种程序的代码或代码段可以由本领域技术人员容易地导出。这种计算机程序可以储存在计算机可读介质中,其可以被运行以实现本公开内容的各方面。计算机可读介质可以包括磁记录介质、光学记录介质以及载波介质。
另外,像“包括”、“囊括”以及“具有”的术语应当默认被解释为包括性的或开放性的,而不是排他性的或封闭性,除非其被明确限定为相反的含义。所有技术、科技或其他方面的术语都符合本领域技术人员所理解的含义,除非其被限定为相反的含义。在词典里找到的公共术语应当在相关技术文档的背景下不被太理想化或太不实际地解释,除非本公开内容明确将其限定成那样。
虽然已出于说明的目的描述了本公开内容的示例方面,但是本领域技术人员应当意识到,上述描述仅是对本发明较佳实施例的描述,并非对本发明范围的任何限定,本发明的优选实施方式的范围包括另外的实现,其中可以不按所述出或讨论的顺序来执行功能。本发明领域的普通技术人员根据上述揭示内容做的任何变更、修饰,均属于权利要求书的保护范围。

Claims (12)

  1. 一种语音消息处理方法,其特征在于包括如下步骤:
    采集语音消息;
    识别所述语音消息获取语义内容;
    针对所述语义内容的多处分别获取对应的流媒体数据,将多个流媒体数据的全部或部分制作成动态图像输出,或者将多个流媒体数据形成合成图像输出。
  2. 根据权利要求1所述的方法,其特征在于:采集用户对前述动态图像或合成图像的选择操作,将动态图像或合成图像发送至目标客户端。
  3. 根据权利要求1所述的方法,其特征在于:将前述动态图像或合成图像与前述语音消息一起发送至目标客户端,或者用动态图像或合成图像替换所述语音消息中对应内容后发送至目标客户端。
  4. 根据权利要求1所述的方法,其特征在于:采集发送方录制语音时的图像信息,获取发送方的情绪图像,将所述情绪图像或其部分元素填充至获取的流媒体数据中制作动态图像或合成图像。
  5. 根据权利要求1所述的方法,其特征在于:获取所述语音消息的文字内容,将所述动态图像或合成图像在所述文字内容的不同位置处移动展示。
  6. 根据权利要求1所述的方法,其特征在于:所述流媒体数据为图像数据,将多个图像数据的全部或部分制作成动态图像的方式为,
    根据语义内容的语音时间轴,将多处对应的图像数据依据前述时间轴制作成动画图像;
    或者,将多个图像数据基于预设的动态描述文件制作成动态图像;
    或者,在多个图像数据中选择一个作为基础图层,将其他图像数据在所述基础图层上以动态效果展示。
  7. 根据权利要求1所述的方法,其特征在于:所述流媒体数据为图像数据,将多个图像数据形成合成图像的方式为,
    通过图层重叠的方式将多个图像数据合成一个图像;
    或者,在多个图像数据中选择一个作为基底图片,将其他图像数据作为标记或图片显示在所述基底图片上;
    或者,在多个图像数据在同一张背景图片的不同区域显示。
  8. 根据权利要求1所述的方法,其特征在于:所述流媒体数据为图像数据,针对所述语义内容的多处分别获取对应的图像数据的步骤包括,
    针对语义内容,获取多个关键词;
    搜索本地资源和/或网络资源,获取与所述关键词匹配的图像数据,每个关键词对应一个或多个图像数据;
    当关键词对应一个图像数据时,获取该图像数据作为该关键词的匹配图像,当关键词对应多个图像数据时,以排在首位的图像数据作为该关键词的匹配图像。
  9. 根据权利要求8所述的方法,其特征在于:关键词对应多个图像数据时,图像的排序方式为如下方式之一,
    获取用户的交互记录,基于交互记录中图像数据的使用频率对图像数据排序;
    或者,采用通信工具推荐的图像数据排序规则;
    或者,获取图像数据的更新时间,基于所述更新时间对图像数据进行排序,时间新的图像数据排序靠前。
  10. 根据权利要求8所述的方法,其特征在于:对语音消息进行音频分析获取语调特征、语速特征和/或音量特征,基于语调特征、语速特征和/或音量特征调整语义内容中的关键词;
    或者,对语音消息进行音频分析获取用户的情绪状态特征,基于所述情绪状态特征调整语义内容中的关键词。
  11. 一种语音消息处理设备,其特征在于包括如下结构:
    音频采集模块,用以获取用户输入的语音消息;
    语音识别模块,识别所述语音消息获取语义内容;
    图像输出模块,用以针对所述语义内容的多处分别获取对应的流媒体数据,将多个流媒体数据的全部或部分制作成动态图像输出,或者将多个流媒体数据形成合成图像输出。
  12. 一种即时通信客户端,用以进行即时通信交互,其特征在于包括:
    语音消息触发模块,用以采集用户的语音触发操作;
    语音识别模块,用以识别用户的语音消息获取语义内容;
    图像输出模块,用以针对所述语义内容的多处分别获取对应的流媒体数据,将多个流媒体数据的全部或部分制作成动态图像输出,或者将多个流媒体数据形成合成图像输出;
    消息发送模块,用以根据用户的选择操作将前述输出的图像发送至目标客户端,或者将前述输出的图像结合语音消息发送至目标客户端。
PCT/CN2020/112463 2020-08-29 2020-08-31 语音消息处理方法、设备及即时通信客户端 WO2022041192A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010891002.XA CN112235180A (zh) 2020-08-29 2020-08-29 语音消息处理方法、设备及即时通信客户端
CN202010891002.X 2020-08-29

Publications (1)

Publication Number Publication Date
WO2022041192A1 true WO2022041192A1 (zh) 2022-03-03

Family

ID=74116604

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/112463 WO2022041192A1 (zh) 2020-08-29 2020-08-31 语音消息处理方法、设备及即时通信客户端

Country Status (2)

Country Link
CN (1) CN112235180A (zh)
WO (1) WO2022041192A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438212A (zh) * 2022-08-22 2022-12-06 蒋耘晨 一种影像投射系统、方法及设备
CN115497489A (zh) * 2022-09-02 2022-12-20 深圳传音通讯有限公司 语音交互方法、智能终端及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104022942A (zh) * 2014-06-26 2014-09-03 北京奇虎科技有限公司 处理交互式消息的方法、客户端、电子设备及系统
US20150269928A1 (en) * 2012-12-04 2015-09-24 Tencent Technology (Shenzhen) Company Limited Instant messaging method and system, communication information processing method, terminal, and storage medium
CN106020504A (zh) * 2016-05-17 2016-10-12 百度在线网络技术(北京)有限公司 信息输出方法和装置
CN106027485A (zh) * 2016-04-28 2016-10-12 乐视控股(北京)有限公司 基于语音交互的富媒体展示方法及系统
CN106888158A (zh) * 2017-02-28 2017-06-23 努比亚技术有限公司 一种即时通信方法和装置
CN110096701A (zh) * 2019-04-16 2019-08-06 珠海格力电器股份有限公司 消息转换处理方法、装置、存储介质及电子设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101268436B1 (ko) * 2011-01-31 2013-06-05 (주)티아이스퀘어 개인 휴대 단말에서의 음성 인식을 이용한 멀티미디어 콘텐츠 합성 영상 채팅 서비스 제공 방법 및 시스템
CN104780093B (zh) * 2014-01-15 2018-05-01 阿里巴巴集团控股有限公司 即时通讯过程中的表情信息处理方法及装置
WO2016000219A1 (zh) * 2014-07-02 2016-01-07 华为技术有限公司 信息传输方法及传输装置
CN106373569B (zh) * 2016-09-06 2019-12-20 北京地平线机器人技术研发有限公司 语音交互装置和方法
CN106531149B (zh) * 2016-12-07 2018-02-23 腾讯科技(深圳)有限公司 信息处理方法及装置
CN106910514A (zh) * 2017-04-30 2017-06-30 上海爱优威软件开发有限公司 语音处理方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150269928A1 (en) * 2012-12-04 2015-09-24 Tencent Technology (Shenzhen) Company Limited Instant messaging method and system, communication information processing method, terminal, and storage medium
CN104022942A (zh) * 2014-06-26 2014-09-03 北京奇虎科技有限公司 处理交互式消息的方法、客户端、电子设备及系统
CN106027485A (zh) * 2016-04-28 2016-10-12 乐视控股(北京)有限公司 基于语音交互的富媒体展示方法及系统
CN106020504A (zh) * 2016-05-17 2016-10-12 百度在线网络技术(北京)有限公司 信息输出方法和装置
CN106888158A (zh) * 2017-02-28 2017-06-23 努比亚技术有限公司 一种即时通信方法和装置
CN110096701A (zh) * 2019-04-16 2019-08-06 珠海格力电器股份有限公司 消息转换处理方法、装置、存储介质及电子设备

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438212A (zh) * 2022-08-22 2022-12-06 蒋耘晨 一种影像投射系统、方法及设备
CN115438212B (zh) * 2022-08-22 2023-03-31 蒋耘晨 一种影像投射系统、方法及设备
CN115497489A (zh) * 2022-09-02 2022-12-20 深圳传音通讯有限公司 语音交互方法、智能终端及存储介质

Also Published As

Publication number Publication date
CN112235180A (zh) 2021-01-15

Similar Documents

Publication Publication Date Title
US10977299B2 (en) Systems and methods for consolidating recorded content
US11475897B2 (en) Method and apparatus for response using voice matching user category
US5884267A (en) Automated speech alignment for image synthesis
CN113454708A (zh) 语言学风格匹配代理
CN110517689A (zh) 一种语音数据处理方法、装置及存储介质
CN112650831A (zh) 虚拟形象生成方法、装置、存储介质及电子设备
CN108242238B (zh) 一种音频文件生成方法及装置、终端设备
CN111145777A (zh) 一种虚拟形象展示方法、装置、电子设备及存储介质
WO2005069171A1 (ja) 文書対応付け装置、および文書対応付け方法
CN110097890A (zh) 一种语音处理方法、装置和用于语音处理的装置
WO2022170848A1 (zh) 人机交互方法、装置、系统、电子设备以及计算机介质
Mitra Introduction to multimedia systems
WO2019114015A1 (zh) 一种机器人的演奏控制方法及机器人
WO2022242706A1 (zh) 基于多模态的反应式响应生成
CN114121006A (zh) 虚拟角色的形象输出方法、装置、设备以及存储介质
WO2022041192A1 (zh) 语音消息处理方法、设备及即时通信客户端
CN110148406A (zh) 一种数据处理方法和装置、一种用于数据处理的装置
CN113538628A (zh) 表情包生成方法、装置、电子设备及计算机可读存储介质
CN114125506B (zh) 语音审核方法及装置
CN110910898B (zh) 一种语音信息处理的方法和装置
CN112235183B (zh) 通信消息处理方法、设备及即时通信客户端
CN112492400B (zh) 互动方法、装置、设备以及通信方法、拍摄方法
CN107123420A (zh) 一种语音识别系统及其交互方法
CN110795581B (zh) 图像搜索方法、装置、终端设备及存储介质
TWI377559B (en) Singing system with situation sound effect and method thereof

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/07/2023)

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20950861

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20950861

Country of ref document: EP

Kind code of ref document: A1