WO2022041177A1

WO2022041177A1 - Communication message processing method, device, and instant messaging client

Info

Publication number: WO2022041177A1
Application number: PCT/CN2020/112407
Authority: WO
Inventors: 马宇尘
Original assignee: 深圳市永兴元科技股份有限公司
Priority date: 2020-08-29
Filing date: 2020-08-31
Publication date: 2022-03-03
Also published as: CN112235183A; CN112235183B

Abstract

The present invention provides a communication message processing method, a device and an instant messaging client, relating to the technical field of communication interaction. A communication message processing method, comprising the following steps: acquiring a speech message acquired by an audio acquisition device; extracting a keyword feature in the speech message; and determining image data matching the keyword, and sending same together with the speech message, or replacing the keyword in the speech message with the image data, and then sending same. By means of the present invention, relevant image data can be intelligently loaded during the speech interaction process of users, improving the convenience, intelligence and interestingness of message interaction, and improving the user experience.

Description

Communication message processing method, device and instant messaging client

technical field

The present invention relates to the technical field of communication interaction.

Background technique

Type a background description paragraph here. Instant Messaging (IM) is the most popular communication method in the mobile Internet era. Various instant messaging software not only supports the instant transmission of text messages, but also enables the transmission of voice messages and video messages between users.

When interacting with voice messages through the IM tool, the user can activate the terminal's microphone and other voice collection settings to record the voice message, and then transmit the voice message to the target receiving end user through the Internet. After the receiving end user enters the play instruction, he can play the Voice message, the recipient user can also reply to the message by voice.

At present, in order to facilitate the user to choose whether to answer the voice message according to the occasion, the text conversion function of the voice message is also added, and the converted text content and the recorded audio file can be sent to the receiving end user as an instant communication message. Some communication tools also have a speech synthesis function that converts text into speech—Text To Speech (TTS for short). There are two main types of speech synthesis solutions, one is the splicing system, and the other is the parameter generation system. Both types of systems require text analysis. The former uses a large number of recorded voice fragments, combined with the text analysis results, and splices the recorded fragments to obtain synthetic voice; while the latter uses the results of text analysis to generate voice parameters through the model, such as basic frequency, etc., and then convert it into a waveform.

The existing voice message function only combines the features of text conversion, and does not consider further information such as expressions, emotional states, tone of voice, etc. when the user's voice is recorded, which is difficult to meet the needs of users, especially for users who like to use the dynamic image function for For young people in Doutu, voice messages lack fun.

With the continuous development of artificial intelligence technology and the continuous improvement of people's requirements for interactive experience, intelligent interaction methods have gradually begun to replace some traditional human-computer interaction methods, and have become a research hotspot. At present, it is possible to analyze user emotions based on user interaction content, and to analyze the deep-level emotional needs that user messages actually want to express according to the user's emotional state. How to provide users with a more intelligent and convenient communication method in combination with the above-mentioned existing technologies is an urgent problem to be solved.

technical problem

The purpose of the present invention is to overcome the deficiencies of the prior art and provide a communication message processing method, device and instant messaging client. With the present invention, relevant image data can be loaded intelligently in the process of user voice interaction, the convenience, intelligence and interest of message interaction can be improved, and user experience can be improved.

technical solutions

Type a technical solution description paragraph here. To achieve the above-mentioned goals, the present invention provides the following technical solutions:

A communication message processing method, comprising the steps of: acquiring a voice message collected by an audio collection device; extracting a keyword feature in the voice message; determining image data matching the keyword, and sending it together with the aforementioned voice message, or It is sent after replacing the keywords in the voice message with image data.

Further, collect the user's own image data when recording the voice or collect the image data on the preset associated path, and identify the collected image data as the matching image data; or, generate a composite image by adding or subtracting elements to the aforementioned collected image data as a matching image or, map a virtual image as the matched image data based on the aforementioned acquired image data.

Further, the volume information of the voice message is acquired, and the size when outputting the matching image data is adjusted according to the volume.

Further, semantic analysis is performed on the voice message, and when the semantic content obtained by the analysis includes two or more matching image data, a plurality of matched image data is obtained to produce a dynamic image output, or a plurality of images are formed into a composite image output.

Further, it also includes steps:

Analyzing the aforementioned voice message,

extracting sound clips corresponding to the aforementioned image data from the voice message;

The extracted sound clips are played corresponding to the image data, or the aforementioned sound clips are played after a triggering operation of the image data by the user is collected.

Further, the manner of sending together with the aforementioned voice message is:

sending the voice message and the image data together as two separate messages;

Or, insert the image data into the keyword position or adjacent positions and send it together;

Alternatively, a floating window is set corresponding to the voice message, and the image data is displayed through the floating window.

Further, the image data is pictures, videos, animations and/or other multimedia information.

Further, the text content of the voice message is acquired, and the text content and the audio file of the voice message are integrated into a multimedia message for output display.

Preferably, the text content is displayed in a message box of the multimedia message, an audio file play button is set corresponding to the message box, and triggering the play button can trigger the audio file to play.

Further, the method of extracting the keyword features in the voice message is:

Perform semantic analysis on voice messages, and obtain keyword features based on semantic analysis;

Or, perform audio analysis on the voice message to obtain intonation feature, speech rate feature and/or volume feature, and obtain keyword features in the voice message based on the intonation feature, speech speed feature and/or volume feature;

Alternatively, audio analysis is performed on the voice message to obtain the user's emotional state feature, and the emotional state feature is used as a keyword feature of the voice message.

Further, the method of determining the image data matching the keyword is,

Search for image data in the local resource file based on the keyword, and obtain image data matching the keyword;

And/or, searching for image data in a network resource file based on the keyword, to obtain image data matching the keyword;

And/or, based on the keyword, the historical image data sent and received by the user is searched, and image data matching the keyword is acquired.

Further, the communication message is an instant communication message.

The present invention also provides a communication message processing device, including the following structure:

an audio acquisition module for acquiring the voice message input by the user;

an information extraction module for extracting the keyword features in the voice message;

An information processing module, configured to determine the image data matching the keyword, and send it together with the aforementioned voice message, or replace the keyword in the voice message with image data and send it.

The present invention also provides an instant messaging client for performing instant messaging interaction, including the following structure:

The voice message trigger module is used to collect the user's voice trigger operation;

an information extraction module for extracting keyword features in the voice according to the voice input by the user;

An information processing module, configured to determine the image data matching the keyword, and send it together with the aforementioned voice, or replace the keyword in the voice with image data and send it as an instant communication message.

beneficial effect

Compared with the prior art, the present invention has the following advantages and positive effects as an example due to the adoption of the above technical solutions: by using the present invention, relevant image data can be loaded intelligently in the process of user voice interaction, and the efficiency of message interaction can be improved. Convenience, intelligence and fun, especially suitable for users who like to interact with bucket diagrams, improving the user experience.

Description of drawings

FIG. 1 is a flowchart of a communication message processing method provided by an embodiment of the present invention.

FIG. 2 is a module structure diagram of an instant messaging client provided by an embodiment of the present invention.

FIG. 3 to FIG. 7 are diagrams illustrating operation examples of instant messaging interaction provided by an embodiment of the present invention.

FIG. 8 to FIG. 10 are exemplary diagrams when a voice message including image data is received according to an embodiment of the present invention.

Description of reference numbers:

Instant messaging client 100 , voice message triggering module 110 , information extraction module 120 , information processing module 130 ; user terminal 200 , desktop 210 , instant messaging tool icon 211 , contact 220 , microphone 230 ;

Embodiments of the present invention

Type the paragraphs describing embodiments of the invention here. The communication message processing method, device and instant messaging client provided by the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the technical features or combinations of technical features described in the following embodiments should not be considered isolated, and they can be combined with each other to achieve better technical effects. In the drawings of the following embodiments, the same reference numerals appearing in the various drawings represent the same features or components, which may be used in different embodiments. Therefore, once an item is defined in one figure, it need not be discussed further in subsequent figures.

It should be noted that the structures, proportions, sizes, etc. shown in the accompanying drawings in this specification are only used to cooperate with the contents disclosed in the specification, so as to be understood and read by those who are familiar with the technology, and are not used to limit the invention. The limited conditions for implementation, any structural modification, change in proportional relationship or adjustment of size, shall fall within the scope of the technical content disclosed in the invention without affecting the efficacy and purpose of the invention. within the range. The scope of the preferred embodiments of the present invention includes additional implementations in which the functions may be performed out of the order described or discussed, including performing the functions in a substantially simultaneous manner or in the reverse order depending upon the functions involved, which should be Embodiments of the invention will be understood by those skilled in the art to which the embodiments of the invention pertain.

Techniques, methods, and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and devices should be considered part of the authorized description. In all examples shown and discussed herein, any specific value should be construed as illustrative only and not as limiting. Accordingly, other examples of exemplary embodiments may have different values.

Example

Referring to FIG. 1, a communication message processing method is disclosed, including the following steps:

S100: Acquire a voice message collected by an audio collection device.

When the user needs to send a voice message, start the audio capture device to record the voice. Taking instant messaging tool (IM tool) WeChat as an example for illustration, the message at this time is an instant messaging message. After the user enters WeChat, he can trigger the voice recording button to start the audio collection device of the terminal where he is located. After the pickup is activated, the user's voice information can be collected.

The terminal, by way of example and not limitation, may be various commonly used mobile terminals such as mobile phones, palmtop computers, and tablet computers, and various smart wearable electronic devices, such as smart glasses, smart watches, and the like. In this embodiment, a mobile phone is used as the mobile terminal, and the mobile phone has an audio collection structure, an image collection structure and a display structure.

S200, extracting keyword features in the voice message.

The aforementioned voice message is recognized based on the voice recognition technology, and the keyword features in the voice message are advanced.

Speech recognition technology is mainly based on the analysis of three basic properties of speech: physical properties, physiological properties and social properties. The physical properties of speech mainly include four elements: pitch, length, intensity and timbre. Pitch refers to the height of the sound, which is mainly determined by the speed of the vibration of the sounding body; the length of the sound refers to the length of the sound, which is mainly determined by the duration of the vibration of the sounding body; the intensity of the sound refers to the strength of the sound, which is mainly determined by the pronunciation. The size of the vibration amplitude of the body; the timbre refers to the characteristics of the sound, which is mainly determined by the different tortuous forms of the sound wave ripples formed by the vibration of the sounding object. The physiological properties of speech mainly refer to the influence of vocal organs on speech, including the lungs and trachea, head and vocal cords, as well as the vocal organs such as the oral cavity, nasal cavity and pharynx. The social attributes of phonetics are mainly reflected in three aspects. First, there is no necessary connection between phonetics and meaning, and their corresponding relationship is established by social members; second, various languages or dialects have their own phonetic systems; third, Voice has the function of distinguishing meaning.

Generally speaking, the basic process of speech recognition may include three steps: preprocessing of speech signals, feature extraction, and pattern matching.

Preprocessing usually includes speech signal sampling, anti-aliasing bandpass filtering, removal of individual pronunciation differences and noise effects caused by equipment and environment, etc., and involves the selection of speech recognition primitives and endpoint detection.

Feature extraction is used to extract acoustic parameters that reflect essential features in speech, such as average energy, average zero-crossing rate, formants, etc. The extracted feature parameters must meet the following requirements: the extracted feature parameters can effectively represent the speech features and have good discrimination; the parameters of each order have good independence; the feature parameters should be easy to calculate, preferably with high efficiency. Algorithms to ensure real-time implementation of speech recognition. In the training phase, after the feature parameters are processed to a certain extent, a model is established for each entry and saved as a template library. In the recognition stage, the speech signal passes through the same channel to obtain speech feature parameters, generates a test template, matches with the reference template, and takes the reference template with the highest matching score as the recognition result. At the same time, with the help of a lot of prior knowledge, the accuracy of recognition can be improved.

Pattern matching is the core of the entire speech recognition system. It calculates the similarity between input features and inventory patterns according to certain rules (such as a certain distance measure) and expert knowledge (such as word formation rules, grammar rules, semantic rules, etc.). degree (such as matching distance, likelihood probability) to determine the semantic information of the input speech.

The keyword feature in the advance voice message refers to obtaining the key content based on the content of the voice recognition. The keyword features, by way of example and not limitation, may be words expressing emotions, words expressing emotions, words expressing preferences, words expressing intentions, or words expressing plans, and the like.

In this embodiment, the method for extracting the keyword features in the voice message may be as follows:

The first method is to perform semantic analysis on the voice message, and obtain keyword features based on the semantic analysis.

Method 2: Perform audio analysis on the voice message to obtain intonation features, speed features and/or volume features, and obtain keyword features in the voice message based on the intonation features, speed features and/or volume features.

Voice changes in pitch, speed, and volume when expressing. For example, when it comes to key information, users usually raise the volume, accentuate the intonation, and slow down the speech. According to the above changes, the key content expressed by the user can be analyzed as a keyword feature.

Manner 3: Perform audio analysis on the voice message to obtain the user's emotional state feature, and use the emotional state feature as a keyword feature of the voice message.

Voices can reflect people's emotions to a certain extent. For example, generally speaking, irritable and loud speech often means that the speaker is more angry, while cheerful and soft speech often means that the speaker is more happy. Accordingly, the important content that the user needs to express can be obtained by analyzing the emotional information in the user's voice information.

Preferably, the way of identifying the emotional information in the voice information is one or more of the following ways:

The first way is to analyze the user's volume change in the voice information, and analyze the emotional state feature according to the volume change.

The second method is to analyze the pitch change in the speech information, and analyze the emotional state feature according to the pitch change.

The third method is to analyze the speech rate information in the speech information, and analyze the emotional state characteristics according to the speech information.

The fourth method is to analyze the rhythm changes in the speech information, and analyze the emotional state characteristics according to the rhythm changes.

It is limited as an example. For example, the user's voice message collected is "This product is much cheaper than the one I bought before, I'm really happy." After the voice message is recognized, the obtained keyword feature can be "Too much happy".

Alternatively, although the user does not express emotions explicitly, but the voice messages contain emotional tendencies, the implied emotions may be used as keyword features based on situational analysis.

It is limited as an example. For example, the user's voice message collected is: "This bun is much smaller than before", and the emotional tendency contained in the above text message is "dissatisfied and unhappy". Therefore, based on the emotional tendency, "dissatisfied and unhappy" is used as a keyword feature.

S300: Determine the image data matching the keyword, and send it together with the aforementioned voice message, or replace the keyword in the voice message with image data and send it.

Specifically, the manner of determining the image data matching the keyword may be as follows:

In another implementation of this embodiment, the user's own image data when recording voice or image data on a preset associated path can be collected, and the collected image data can be identified and used as matching image data.

Alternatively, collect the user's own image data when recording voice or collect image data on a preset associated path, identify the collected image data, and then add or subtract elements from the collected image data to generate a composite image as matching image data. In this way, a composite image including real elements and virtual elements can be formed, which improves the interest.

Alternatively, the user's own image data when recording voice or image data on a preset associated path is collected, and a virtual image is mapped as matching image data based on the aforementioned collected image data. In this way, a virtual image containing the user's own emotions or expressions is generated on the basis of protecting the user's privacy, such as a cartoon shape, which improves the fun.

In another implementation of this embodiment, the volume information of the voice message may also be acquired, and the size of the matching image data when output is adjusted according to the volume.

In this manner, the correspondence between the volume and the image size can be established in advance. As an example and not a limitation, for example, the sound is divided into 5 levels based on the volume, from low to high: bass, mid-bass, mid-range, mid-high and treble. The image sizes corresponding to bass, mid-bass, mid-tone, mid-high and high-pitched sounds increase in sequence.

After identifying which volume level the user's volume in the voice information belongs to, the image size corresponding to the volume level can be obtained based on the correspondence between the volume level and the image size.

In another implementation of this embodiment, semantic analysis may also be performed on the voice message, and when the semantic content obtained by the analysis includes more than two matching image data, a plurality of matching image data are obtained to produce a dynamic image for output, Or combine multiple images into composite image output.

As an example but not a limitation, for example, both "Yangcheng Lake" and "hairy crab" in the semantic content have matching images, then multiple matching images can be made into a dynamic image "hairy crabs crawling on Yangcheng Lake", or a composite image "Multiple hairy crabs are located in Yangcheng Lake".

In another implementation manner of this embodiment, the following steps are also included:

Analyzing the aforementioned voice message,

That is, sound information is set on the output image data, and the sound information can be automatically played when the receiving end user receives the information, or, when the receiving end user triggers the image data—for example, the user clicks on the area where the image data is located—played.

In this embodiment, the manner of sending together with the aforementioned voice message may be as follows:

The voice message and the image data are sent together as two separate messages. Alternatively, the image data is inserted into the keyword position or adjacent positions and then sent together. Alternatively, a floating window is set corresponding to the voice message, and the image data is displayed through the floating window.

The image data may be pictures, videos, animations and/or other multimedia information.

In another implementation manner of this embodiment, further, the text content of the voice message may be obtained, and the text content and the audio file of the voice message may be integrated into a multimedia message for output display.

Referring to FIG. 2 , the present invention also provides an instant messaging client for performing instant messaging interaction. The instant messaging client 100 includes the following structure:

The voice message triggering module 110 is used for collecting user's voice triggering operation.

The information extraction module 120 is used for extracting the keyword features in the speech according to the speech input by the user.

The information processing module 130 is configured to determine the image data matching the keyword, and send it together with the aforementioned voice, or replace the keyword in the voice with image data and send it as an instant communication message.

When the user enters the instant communication tool and needs to send a voice message, the audio collection device is activated to record the voice. Specifically, the voice recording button can be triggered to activate the audio collection device of the terminal where it is located, and the user's voice information can be collected after the audio pickup is activated. The terminal, by way of example and not limitation, may be various commonly used mobile terminals such as mobile phones, palmtop computers, and tablet computers, and various smart wearable electronic devices, such as smart glasses, smart watches, and the like. In this embodiment, a mobile phone is used as the mobile terminal, and the mobile phone has an audio collection structure, an image collection structure and a display structure.

Then, the aforementioned voice message is recognized based on the voice recognition technology, and the keyword features in the voice message are advanced.

As an example, the method of extracting the keyword features in the voice message may be as follows:

Preferably, the information processing module 130 may include a message synthesis unit, which is used for recognizing the text content of the voice, and integrating the text content and the audio file of the voice into a multimedia message.

Further, the text content is displayed in a message box of the multimedia message, an audio file play button is set corresponding to the message box, and triggering the play button can trigger the audio file to play.

Preferably, the information extraction module 120 may include an emotion recognition unit. The emotion recognition unit is used for recognizing emotion information in the voice message. Preferably, the emotion recognition unit includes a voice volume analysis sub-circuit, a voice pitch analysis sub-circuit, a voice speech rate analysis sub-circuit and/or a voice rhythm analysis sub-circuit.

The implementation of this embodiment will be described in detail with reference to FIGS. 3 to 7 .

Referring to FIG. 3 , the user enters the instant messaging tool “Quick Message” through the user terminal 200 carried by the user. The user terminal 200 is preferably a mobile phone in this embodiment.

4 , the desktop 210 of the user terminal 200 outputs a user interface to the user, on which all communication messages are displayed, and the communication messages display the contacts 220, the latest interactive messages, and a virtual microphone 230 (voice trigger control).

As an example, as shown in FIG. 4 , for example, when a user chats with a contact leo, the virtual microphone 230 corresponding to leo can be triggered, and then the voice message collection function can be directly started.

Referring to FIG. 5 , a voice message input box is displayed in the user interface, and the input box displays the user's voice being entered, the text content corresponding to the voice, and related operation keys.

The voice message input box can be displayed directly on the current user interface, or can be displayed after generating a separate voice message interface for the contact leo, as shown in FIG. 6 , the voice message interface displays contact information, voice message input box, as well as virtual microphone and current recording quality information.

Referring to FIG. 7 , when a user records a voice, he or she can perform sending and pausing operations by operating the virtual microphone 230 . As an example of a preferred manner, for example, pressing and sliding the microphone up is a send operation, and pressing and sliding the microphone to the right is a pause operation.

In this embodiment, the manner in which the image data is sent together with the aforementioned voice message may be as follows:

Referring to Figure 8, the voice message and the image data are sent together as two separate messages.

Alternatively, as shown in FIG. 9 , the image data is inserted into the keyword position or adjacent positions and then sent together. The inserted image data can be played directly or played after the user triggers the keyword position.

Alternatively, as described in FIG. 10 , the keywords in the voice are replaced with image data and then sent as an instant communication message. At this time, the message sent to the receiving end includes text content, audio files and image data.

In this embodiment, referring to FIG. 8 to FIG. 10 , the text content of the voice message is also obtained, and the text content and the audio file of the voice message are integrated into a multimedia message for output display.

The text content is displayed in the message box of the multimedia message, and an audio file play button may also be set corresponding to the message box, and triggering the play button can trigger the audio file to play.

The instant messaging client may also be set with other functional modules as required, and the specific functions can be found in the previous embodiments, which will not be repeated here.

Another embodiment of the present invention also provides a communication message processing device.

The message processing settings include the following structure:

an audio acquisition module for acquiring the voice message input by the user;

The message processing device may also be provided with other functional modules as required. For details, refer to the foregoing embodiments, which will not be repeated here.

In the above description, although all components of various aspects of the present disclosure may be explained as being assembled or operatively connected as a circuit, the present disclosure is not intended to limit itself to these aspects. Rather, the various components may be selectively and operatively combined in any number within the intended scope of this disclosure. Each of these components may also itself be implemented in hardware, while the individual components may be combined in part or selectively collectively and implemented as a computer program having program modules for performing the functions of the hardware equivalent. Code or code segments to construct such programs can be readily derived by those skilled in the art. Such a computer program can be stored in a computer-readable medium, which can be executed to implement various aspects of the present disclosure. The computer-readable media may include magnetic recording media, optical recording media, and carrier wave media.

Additionally, terms like "includes," "includes," and "has" should by default be construed as inclusive or open, rather than exclusive or closed, unless explicitly defined to the contrary. All technical, scientific or other terms have the meaning as understood by those skilled in the art unless they are defined to the contrary. Common terms found in dictionaries should not be interpreted too ideally or too practically in the context of related technical documents, unless this disclosure explicitly defines them as such.

While exemplary aspects of the present disclosure have been described for illustrative purposes, those skilled in the art will appreciate that the foregoing description is merely a description of preferred embodiments of the present invention and is not intended to limit the The scope of the embodiments includes additional implementations in which the functions may be performed out of the order described or discussed. Any changes and modifications made by those of ordinary skill in the field of the present invention according to the above disclosure fall within the protection scope of the claims.

Claims

A communication message processing method, characterized in that it comprises the following steps:

Obtain the voice message collected by the audio collection device;

extracting keyword features in the voice message;

Image data matching the keyword is determined, and sent together with the aforementioned voice message, or sent after replacing the keyword in the voice message with image data.
The method according to claim 1, wherein: collecting self-image data of the user when recording voice or collecting image data on a preset associated path, and identifying the collected image data as matching image data; A composite image is generated as the matched image data by adding or subtracting elements of the image data; or, a virtual image is mapped as the matched image data based on the previously collected image data.
The method according to claim 1, wherein the volume information of the voice message is obtained, and the size of the output image data is adjusted according to the volume.
The method according to claim 1, wherein: semantic analysis is performed on the voice message, and when the semantic content obtained by the analysis includes more than two matching image data, a plurality of matching image data are obtained and made into a dynamic image output , or combine multiple images into a composite image output.
The method of claim 1, further comprising the steps of:

Analyzing the aforementioned voice message,

extracting sound clips corresponding to the aforementioned image data from the voice message;

The extracted sound clips are played corresponding to the image data, or the aforementioned sound clips are played after a triggering operation of the image data by the user is collected.
The method according to claim 1, wherein the method of sending together with the aforementioned voice message is:

sending the voice message and the image data together as two separate messages;

Or, insert the image data into the keyword position or adjacent positions and send it together;

Alternatively, a floating window is set corresponding to the voice message, and the image data is displayed through the floating window.
The method according to claim 1, wherein the image data is a picture, a video, an animation and/or other multimedia image information, and the communication message is an instant communication message.
The method according to claim 1, wherein the text content of the voice message is acquired, and the text content and the audio file of the voice message are integrated into a multimedia message for output display.
The method according to claim 8, wherein the text content is displayed in a message box of the multimedia message, an audio file play button is set corresponding to the message box, and triggering the play button can trigger the audio file to play.
The method according to claim 1, wherein the method of extracting the keyword features in the voice message is:

Perform semantic analysis on voice messages, and obtain keyword features based on semantic analysis;

Or, perform audio analysis on the voice message to obtain intonation feature, speech rate feature and/or volume feature, and obtain keyword features in the voice message based on the intonation feature, speech speed feature and/or volume feature;

Alternatively, audio analysis is performed on the voice message to obtain the user's emotional state feature, and the emotional state feature is used as a keyword feature of the voice message.
The method according to claim 1, wherein the method of determining the image data matching the keyword is:

Search for image data in the local resource file based on the keyword, and obtain image data matching the keyword;

And/or, searching for image data in a network resource file based on the keyword, to obtain image data matching the keyword;

And/or, based on the keyword, the historical image data sent and received by the user is searched, and image data matching the keyword is acquired.
A communication message processing device, characterized in that it includes:

an audio acquisition module for acquiring the voice message input by the user;

an information extraction module for extracting the keyword features in the voice message;

An information processing module, configured to determine the image data matching the keyword, and send it together with the aforementioned voice message, or replace the keyword in the voice message with image data and send it.
An instant messaging client for instant messaging interaction, characterized by comprising:

The voice message trigger module is used to collect the user's voice trigger operation;

an information extraction module for extracting keyword features in the voice according to the voice input by the user;

An information processing module, configured to determine the image data matching the keyword, and send it together with the aforementioned voice, or replace the keyword in the voice with image data and send it as an instant communication message.