CN111835621A

CN111835621A - Session message processing method and device, computer equipment and readable storage medium

Info

Publication number: CN111835621A
Application number: CN202010662684.7A
Authority: CN
Inventors: 程龙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-10-27

Abstract

The application provides a session message processing method, a session message processing device, computer equipment and a readable storage medium, and belongs to the technical field of computers. According to the method and the device, when the input voice message of any one participant user of a target conversation is received, the input voice message is recognized, so that a text label capable of expressing the semantic tendency of the input voice message is obtained, at least one expression picture capable of visually expressing the semantic tendency is obtained and serves as the expression picture recommended to the user, the obtained at least one expression picture is sent to the participant user terminal, expression recommendation based on the voice message input by the user is achieved, the participant user can see the recommended expression picture through the terminal after inputting the voice message, characters do not need to be input by the user, the operation process is simplified, the human-computer interaction efficiency is improved, and the user experience is improved.

Description

Session message processing method and device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing a session message, a computer device, and a readable storage medium.

Background

With the continuous development of computers and networks, various types of social software such as communication software come into the air. People can communicate through communication software, so that the influence of distance on communication can be eliminated, and people are also more and more accustomed to communicating through the communication software. When using communication software, people can communicate by sending various forms of conversational messages, such as voice messages, text messages, etc. In the communication process, the user can also strengthen the emotion to be expressed by sending the expression and increase the interest of communication.

At present, when sending a emotion, a terminal can recommend expressions matched with a text message to a user based on the text message input by the user, the user selects the expressions to be sent from the recommended expressions, and the terminal responds to the selection operation of the user to send the selected expressions.

In the implementation process, the expression recommendation can be performed only based on the text message input by the user, so that the user can see the expression to be recommended only by inputting characters, and the operation is complex, so that the human-computer interaction efficiency is low, and the user experience is poor.

Disclosure of Invention

The application provides a session message processing method, a session message processing device, a computer device and a readable storage medium, which can improve the human-computer interaction efficiency in the session process, thereby improving the user experience. The technical scheme is as follows:

in one aspect, a method for processing a session message is provided, where the method includes:

responding to the received input voice message of any one participant user of the target conversation, and identifying the input voice message to obtain a text label corresponding to the input voice message;

determining at least one emoticon matched with the text label according to the text label corresponding to the input voice message, wherein the emoticon is used as the at least one emoticon matched with the input voice message;

and sending the at least one expression picture to the terminal of the participating user.

In a possible implementation manner, the determining, according to a text tag corresponding to the input voice message, at least one emoticon matching the text tag includes:

determining at least one expression label, the similarity of which to the text label corresponding to the input voice information meets a preset condition, in an expression label library;

and determining at least one emoticon corresponding to the at least one emoticon as at least one emoticon matched with the text label.

In one possible implementation, the method further includes:

and based on the playing information, coding the input voice message and any expression picture through a target coder to obtain a target message for sending, wherein the target coder is used for coding the input voice message and the expression picture together, and the target message comprises the coded input voice message and any expression picture.

acquiring an input voice message of a target session;

acquiring at least one expression picture matched with the input voice message;

displaying the at least one expression picture to be selected;

and responding to the selection operation of any expression picture in the at least one expression picture, and sending the any expression picture and the input voice message to the target conversation.

In a possible implementation manner, after the displaying the at least one emoticon to be selected in the form of a thumbnail in the sub-area of the voice input area, the method further includes:

and responding to the selection operation of any thumbnail in the thumbnails of the at least one expression picture, and acquiring the expression picture corresponding to any thumbnail.

In a possible implementation manner, the expression picture is a dynamic expression picture.

In one aspect, a session message processing apparatus is provided, where the apparatus includes:

the recognition module is used for responding to the received input voice message of any participant user of the target conversation, recognizing the input voice message and obtaining a text label corresponding to the input voice message;

the picture determining module is used for determining at least one expression picture matched with the text label according to the text label corresponding to the input voice message, and the at least one expression picture is used as the at least one expression picture matched with the input voice message;

and the picture sending module is used for sending the at least one expression picture to the terminal of the participating user.

In a possible implementation manner, the recognition module is configured to perform voice recognition on the input voice message to obtain a text content corresponding to the input voice message, and perform semantic recognition on the text content to obtain a text tag corresponding to the text content, where the text tag is used as the text tag corresponding to the input voice message.

In a possible implementation manner, the image determining module is configured to determine, in an expression tag library, at least one expression tag whose similarity to a text tag corresponding to the input voice information satisfies a preset condition, and determine at least one expression image corresponding to the at least one expression tag as at least one expression image matched with the text tag.

In one possible implementation, the apparatus further includes:

the receiving module is used for receiving any expression picture sent by the terminal of the participating user based on the selection operation of any expression picture in the at least one expression picture;

and the information determining module is used for determining the playing information of any expression picture according to the duration of the input voice message, wherein the playing information is used for indicating the playing times and the playing speed of any expression picture.

In one possible implementation, the apparatus further includes:

the information determining module is further configured to determine playing information of the at least one emoticon according to the duration of the input voice message, where the playing information is used to indicate the playing times and the playing speed of the at least one emoticon;

and the information sending module is used for sending the at least one expression picture and the playing information of the at least one expression picture to the terminal of the participating user.

In one possible implementation, the information determination module includes a number determination module and a speed determination module;

the number determining module is used for determining the playing number of the expression picture in the duration of the input voice message according to the duration of the input voice message and the playing duration required by playing the expression picture once, and the speed determining module is used for determining the playing speed of the expression picture based on the playing number.

In a possible implementation manner, the number-of-times determining module is configured to determine a ratio between a duration of the input voice message and a playing duration required for playing the expression picture once, determine the ratio as the playing number of times of the expression picture in the duration of the input voice message if the ratio is an integer value, perform an integer on the ratio if the ratio is not an integer value, and determine the ratio after the integer is the playing number of times of the expression picture in the duration of the input voice message.

In one possible implementation, the apparatus further includes:

and the coding module is used for coding the input voice message and any expression picture through a target coder based on the playing information to obtain a target message for sending, the target coder is used for coding the input voice message and the expression picture together, and the target message comprises the coded input voice message and any expression picture.

In one aspect, a session message processing apparatus is provided, the apparatus including:

the message acquisition module is used for acquiring the input voice message of the target session;

the first picture acquisition module is used for acquiring at least one emotion picture matched with the input voice message;

the first display module is used for displaying the at least one expression picture to be selected;

and the sending module is used for responding to the selection operation of any expression picture in the at least one expression picture and sending the any expression picture and the input voice message to the target conversation.

In a possible implementation manner, the first picture obtaining module is configured to send the input voice message to a server, and receive at least one emoticon sent by the server and matched with a text tag of the input voice message; or, recognizing the input voice message to obtain at least one expression picture matched with the input voice message.

In one possible implementation, the interface corresponding to the target session includes a voice input area;

the first display module is used for displaying the at least one expression picture to be selected in a thumbnail mode in the sub-area of the voice input area.

In one possible implementation, the apparatus further includes:

and the second picture acquisition module is used for responding to the selection operation of any thumbnail in the thumbnails of the at least one expression picture and acquiring the expression picture corresponding to the thumbnail.

In one possible implementation, the interface corresponding to the target session includes a session display area;

the device also includes:

and the second display module is used for displaying the any expression picture and the input voice message in the conversation display area.

In a possible implementation manner, the second display module is configured to display the input voice message at a target position of the conversation display area, and display a target animation based on the target position corresponding to the input voice message, where the target animation is used to represent a display effect of any emoticon from being occluded to being revealed by the input voice message and from being revealed to being occluded again by the input voice message.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having at least one program code stored therein, the program code being loaded and executed by the one or more processors to implement the operations performed by the conversational message processing method.

In an aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer readable storage medium. The processor of the computer device reads the computer program code from the computer-readable storage medium, and the processor executes the computer program code, causing the computer device to perform the operations performed by the conversation message processing method.

According to the scheme, when the input voice message of any one participant user of the target conversation is received, the input voice message is recognized, so that a text label capable of expressing the semantic tendency of the input voice message is obtained, at least one expression picture capable of visually expressing the semantic tendency is obtained and serves as the expression picture recommended to the user, the obtained at least one expression picture is sent to the participant user terminal, expression recommendation based on the voice message input by the user is achieved, the participant user can see the recommended expression picture through the terminal after inputting the voice message, characters do not need to be input by the user, the operation process is simplified, the human-computer interaction efficiency is improved, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to be able to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a session message processing method according to an embodiment of the present application;

fig. 2 is a flowchart of a session message processing method according to an embodiment of the present application;

fig. 3 is a flowchart of a session message processing method according to an embodiment of the present application;

fig. 4 is a flowchart of a session message processing method according to an embodiment of the present application;

FIG. 5 is a schematic interface diagram of a target session according to an embodiment of the present disclosure;

fig. 6 is a schematic interface diagram of a target session in a recording process according to an embodiment of the present application;

fig. 7 is a flowchart of a method for recommending an emoticon according to an embodiment of the present application;

fig. 8 is a schematic interface diagram for displaying an emoticon according to an embodiment of the present application;

fig. 9 is a schematic diagram illustrating a method for determining emotion picture play information according to an embodiment of the present application;

FIG. 10 is a diagram illustrating encoding value ranges of various types of messages according to an embodiment of the present application;

FIG. 11 is a diagram illustrating a range of encoded values of a target encoder according to an embodiment of the present application;

fig. 12 is a schematic diagram of a processing procedure for inputting a voice message and a dynamic emoticon according to an embodiment of the present application;

fig. 13 is a schematic diagram of a dynamic expression display method according to an embodiment of the present application;

fig. 14 is a schematic diagram of a dynamic expression display method according to an embodiment of the present application;

fig. 15 is a schematic diagram of a dynamic expression display method according to an embodiment of the present application;

fig. 16 is a schematic diagram of a dynamic expression display method according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a session message processing apparatus according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of a session message processing apparatus according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an implementation environment of a session message processing method provided in an embodiment of the present application, and referring to fig. 1, the implementation environment includes: a terminal 101 and a server 102.

The terminal 101 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart television, and the like. The terminal 101 and the server 102 can be directly or indirectly connected by wired or wireless communication, and the present application is not limited thereto. Various social software, such as instant messaging software, forum software, and the like, may be installed on the terminal 101. The user can communicate with other participating users through the social software, the terminal 101 can display a conversation interface, and the user can input text, voice, text, or the like in the conversation interface to communicate with other participating users. The user can trigger a voice input request by triggering a recording button of the session interface, and the terminal 101 can record the voice of the user through the microphone assembly in response to the voice input request and generate a corresponding input voice message, and then send the input voice message to the server 102, so that the server 102 sends the input voice message to the terminals of other participating users. Optionally, the microphone assembly may be either built in or externally connected to the terminal 101, and the present application is not limited herein. The user can also input characters to be sent in an input box of the session interface, after the input is completed, the sending button is triggered, the terminal 101 responds to the triggering operation of the user, obtains a text message input by the user, sends the text message to the server 102, and the server 102 sends the text message to terminals of other participating users. The terminal 101 may be associated with a plurality of emoticons, such as emoticons used or downloaded by the user in the past. Optionally, after acquiring the text message input by the user, the terminal 101 may acquire at least one expression picture matched with the text message, such as a moving image interchange Format (GIF) expression picture, a static GIF expression picture, an Emoji (Emoji) expression, and display the at least one expression picture, the user may select one expression picture from the displayed at least one expression picture, the terminal 101, in response to a selection operation of the user, replaces the acquired text message with the selected expression picture, sends the selected expression picture to the server 102, and the server 102 sends the expression picture to terminals of other participating users. In addition, the user can also trigger an expression display button of the session interface, the terminal 101 can respond to the trigger operation of the user to acquire expression pictures and display the acquired expression pictures on the session interface, the user selects one expression picture from the expression pictures to send the selected expression picture, the terminal 101 can respond to the selection operation of the user to send the selected expression picture to the server 102, and the server 102 sends the expression picture to terminals of other participating users. Optionally, the user may also trigger a downloading emoticon request by triggering a downloading emoticon button of the session interface, and acquire an emoticon stored in the server 102 through the downloading emoticon request.

The terminal 101 may be generally referred to as one of a plurality of terminals, and the embodiment is illustrated with the terminal 101. Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. For example, the number of the terminals is only one, or the number of the terminals is several tens or several hundreds, or more, and the number of the terminals and the type of the device are not limited in the embodiment of the present application.

The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The server 102 and the terminal 101 can be directly or indirectly connected by wired or wireless communication, and the present application is not limited thereto. The server 102 can receive a voice message, a text message, an emoticon message, etc. transmitted from the terminal 101, and then transmit the received message to the terminals of the respective participating users. The server 102 can be associated with an expression picture database for storing various types of expression pictures, such as dynamic GIF expression pictures, static GIF expression pictures, Emoji expressions, and the like. The server 102 can respond to the emotion downloading request sent by the terminal 101, acquire an emotion image corresponding to the emotion identifier carried by the emotion downloading request, and further send the emotion image to the terminal 101. Optionally, the number of the servers may be more or less, and the embodiment of the present application does not limit this. Of course, the server 102 may also include other functional servers to provide more comprehensive and diverse services.

Fig. 2 is a flowchart of a session message processing method provided in an embodiment of the present application, and referring to fig. 2, the method includes:

201. the terminal acquires an input voice message of a target session.

It should be noted that, optionally, the target session is a session between any two users, or the target session is a session between any multiple users, and the any two or multiple users can input a voice message through their respective terminals. The user can trigger a voice input request by triggering a recording button in an interface corresponding to the target session, and the terminal can record the voice of the user through the microphone assembly in response to the voice input request so as to acquire the input voice message of the user based on the target session.

Optionally, the microphone assembly is externally disposed on the terminal, or the microphone assembly is internally connected to the terminal, which is not limited in this embodiment of the application.

202. And the terminal acquires at least one emotion picture matched with the input voice message.

It should be noted that the expression picture is a dynamic expression picture, such as a GIF dynamic expression picture. Optionally, the expression picture may also be of other types, which is not limited in this embodiment of the application. After the terminal acquires the input voice message, the terminal can send the input voice message to the server, the server acquires at least one expression picture matched with the input voice message, and the terminal receives the at least one expression picture sent by the server, namely, the terminal can acquire the at least one expression picture matched with the input voice message.

203. And the terminal displays the at least one expression picture to be selected.

The method comprises the steps of displaying at least one emotion picture to be selected so that a user can select from the displayed at least one emotion picture to determine the emotion picture which the user wants to send.

204. And the terminal responds to the selection operation of any expression picture in the at least one expression picture and sends the any expression picture and the input voice message to the target conversation.

It should be noted that the user can trigger the selection operation by clicking any one of the at least one expression picture to be selected, and the terminal can respond to the selection operation and send the selected expression picture and the input voice message. And sending the emoticon and the input voice message to a target session, namely displaying the emoticon and the input voice message in an interface corresponding to the target session.

According to the scheme provided by the embodiment of the application, the input voice message of the target session is obtained, the at least one expression picture matched with the input voice message is obtained, the at least one expression picture to be selected is displayed, so that the expression picture is recommended based on the voice message input by the user, the user can select the expression picture to be sent from the at least one expression picture, the selection operation of any expression picture in the at least one expression picture is responded, any expression picture and the input voice message are sent to the target session, the input voice message and the expression picture are displayed, the recommendation of the expression picture can be achieved without inputting characters by the user, the man-machine interaction efficiency is improved, and the user experience is improved.

Fig. 3 is a flowchart of a session message processing method provided in an embodiment of the present application, and referring to fig. 3, the method includes:

301. the server responds to the received input voice message of any one participant user of the target conversation, identifies the input voice message, and obtains a text label corresponding to the input voice message, wherein the text label is used for expressing the semantics of the input voice message.

It should be noted that, when the input voice message is recognized, the input voice message may be first subjected to voice recognition to obtain text content of the input voice message, and then subjected to semantic recognition to obtain a text tag corresponding to the input voice message. Optionally, other ways may also be adopted to recognize the input voice message, which is not limited in this embodiment of the application.

The text label corresponding to the input voice message can be obtained by identifying the input voice message, so that the emotion picture recommendation can be carried out based on the text label, and the emotion picture recommendation based on the input voice message can be realized.

302. And the server determines at least one emoticon matched with the text label according to the text label corresponding to the input voice message, and the emoticon is used as the at least one emoticon matched with the input voice message.

303. And the server sends the at least one expression picture to the terminal of the participating user.

According to the scheme provided by the embodiment of the application, when the input voice message of any one participant user of a target conversation is received, the input voice message is recognized, so that a text label capable of expressing the semantic tendency of the input voice message is obtained, at least one expression picture capable of visually expressing the semantic tendency is obtained and serves as the expression picture recommended to the user, the obtained at least one expression picture is sent to the participant user terminal, expression recommendation based on the voice message input by the user is achieved, the participant user can see the recommended expression picture through the terminal after inputting the voice message, characters do not need to be input by the user, the operation process is simplified, the man-machine interaction efficiency is improved, and the user experience is improved. For example, through the scheme, a user who is difficult to input characters can quickly input expression pictures expressing semantics or emotions, so that messages sent to friends by the user are more emotional, but not characters of cold ice, and the friends can intuitively know the meaning the user wants to express or the mood of the user when sending the messages through the expression pictures.

Fig. 4 is a flowchart of a session message processing method provided in an embodiment of the present application, and referring to fig. 4, the method includes:

401. the terminal acquires an input voice message of a target session.

In one possible implementation, the interface for the target conversation includes a conversation display area, a text entry area, a function selection area, and a voice entry area. The voice input area is provided with a recording button, a user can trigger a voice input request by triggering the recording button, the terminal can respond to the voice input request, record the voice of the user through the microphone assembly, and encode the recorded voice to obtain an input voice message. For example, the interface of the target session may refer to fig. 5, and fig. 5 is an interface schematic diagram of a target session provided in an embodiment of the present application, where the interface of the target session includes a session display area 501, a text input area 502, a function selection area 503, and a voice input area 504, and the voice input area 504 is provided with a record button, and a user can enable the terminal to obtain an input voice message of the user by triggering the record button. A user can enter a voice for a certain duration by pressing the recording button for a long time, an interface schematic diagram in the recording process can be as shown in fig. 6, fig. 6 is an interface schematic diagram of a target session in the recording process provided in the embodiment of the present application, an interface of the target session includes a session display area 601, a text input area 602, a function selection area 603, and a voice input area 604, the user can record a voice by pressing a recording button 605 set in the voice input area 604 for a long time, after the recording is completed, the recording button 605 can be released, and after detecting that the user releases the button, the terminal can acquire an input voice message corresponding to the voice recorded by the user.

It should be noted that, optionally, the interface of the target session further includes other areas, which is not limited in this embodiment of the application.

402. The terminal transmits the input voice message to the server.

It should be noted that, after sending the input voice message to the server, the terminal does not need to display the input voice message in the display interface of the session area, and after completing the processing in steps 403 to 413, the terminal may display the input voice message and the target message including the emotion image while sending the input voice message by the server, so as to implement the integrated display of the input voice message and the emotion image.

403. The server receives the input voice message, responds to the received input voice message, and carries out voice recognition on the input voice message to obtain text content corresponding to the input voice message.

In a possible implementation manner, after receiving the input voice message, the server may obtain a spectral feature of the input voice message, so as to input the spectral feature to the voice recognition model, further obtain the feature of the spectral feature through a convolutional layer of the voice recognition model, so as to determine phoneme information of the input voice message based on the feature, further determine words constituting text content corresponding to the input voice message based on a corresponding relationship between factor information and pronunciation in a dictionary through the language model, and sort the words based on a sequence corresponding to the factor information, so as to obtain text content corresponding to the input voice message.

The speech recognition model is a Convolutional Neural Network (CNN) model, and optionally, the speech recognition model may be of another type, which is not limited in this embodiment of the present application.

When obtaining the spectral feature of the input voice information, the server may perform fast fourier transform on each voice frame of the input voice information to obtain a spectrum of the input voice message, obtain a square of an absolute value of each voice frame in the input voice message to obtain a power spectrum of the input voice message, obtain a reciprocal of the power spectrum of each voice frame in the input voice message to obtain a cepstrum coefficient of the input voice message, and use the cepstrum coefficient as the spectral feature of the input voice message.

Optionally, the cepstrum coefficient is a Linear cepstrum coefficient (LPCC), or the cepstrum coefficient is a Mel-Frequency cepstrum coefficient (MFCC), which is not limited in this embodiment.

It should be noted that the server can also perform preprocessing, such as pre-filtering, sampling and quantizing, windowing, endpoint detection, pre-emphasis, etc., on the input voice message before acquiring the spectral features of the input voice message. By preprocessing the input voice information, the quality of the input voice information can be improved, the accuracy of the acquired frequency spectrum characteristics is ensured, and the accuracy of voice recognition is improved.

The above process is provided as an exemplary speech recognition method, and in more possible implementation manners, the server may also perform speech recognition on the input speech message in other manners, and the embodiment of the present application does not limit which manner is specifically adopted.

404. And the server carries out semantic recognition on the text content to obtain a text label corresponding to the text content, and the text label is used as a text label corresponding to the input voice message.

It should be noted that, when performing semantic recognition on text content, a semantic recognition model may be used. The semantic recognition model can be used for determining the semantics or emotion of the text content, and further obtaining a text label used for representing the semantics or emotion of the input voice message.

In a possible implementation manner, the server inputs the text content into a semantic recognition model, extracts text features of the text content through the semantic recognition model, and determines a text label corresponding to the text content as a text label of the input voice message based on context features corresponding to the text features and the text features. The semantic recognition model is a CNN model, and optionally, the semantic recognition model may be of another type, which is not limited in the embodiment of the present application. For example, the server may determine, through the semantic recognition model, an emotion of the user who inputs the voice message while speaking, such as whether the emotion of speaking is happy or angry, based on the context feature and the tone keyword, and then obtain a text tag representing the emotion of the input voice. The server may also determine semantics of the input voice message through a semantic recognition model. For example, for an input voice message with text content of "Happy Birthday", the server performs semantic recognition on the text content to obtain a text label of "Happy Birthday", and then performs emotion matching based on the text label of "Happy Birthday".

405. And the server determines at least one expression label, the similarity of which with the text label corresponding to the input voice information meets the preset condition, in the expression label library.

It should be noted that the server can be associated with an emoticon tag library, where the emoticon tag library stores a plurality of emoticon tags, such as happy, sad, emotional, and the like, and optionally, the emoticon tags may further include other contents, which is not limited in this embodiment of the application.

In a possible implementation manner, the server can calculate similarity between the text label and the emoji labels stored in the emoji label library, and determine at least one emoji label whose similarity to the text label satisfies a preset condition. The preset condition is that the similarity is greater than a preset threshold, and optionally, the preset threshold may be any value, which is not limited in the embodiment of the present application.

In another possible implementation manner, after the similarity between the text label and the expression label is calculated, the server may sort the similarities in an order from large to small, and further obtain at least one expression label whose similarity with the text label meets a preset condition. Wherein the preset condition is that the similarity is ranked before the target position.

It should be noted that, after determining at least one emoji tag, the server can also determine the recommendation index of the at least one emoji tag. In a possible implementation manner, the server can determine the recommendation order of each emoji tag according to the similarity between each emoji tag and the text tag, that is, sort the at least one emoji tag according to the sequence of the similarity from high to low, and set the recommendation index of each emoji tag based on the sorting result to obtain the recommendation index of the at least one emoji tag.

For a plurality of expression labels with the same similarity, the server can sort the plurality of expression labels with the same similarity according to the semantic range or emotion range indicated by the expression label and the semantic range or emotion range of the text label to determine the recommendation index of each expression label. For example, the server sets the recommendation sequence of the expression tags with the indicated semantic range or emotion range being the same as that of the text tags as a minimum value, sets the recommendation sequence of the expression tags with the indicated semantic range or emotion range being larger than that of the text tags as a larger value, sets the recommendation sequence of the expression tags with the indicated semantic range or emotion range being smaller than that of the text tags as a maximum value to obtain the ranking sequence of each expression tag, further sets the recommendation index of the tag with the top ranking as a larger value, sets the recommendation index of the tag with the top ranking as a smaller value, and obtains the recommendation index of each expression tag. For example, a semantic range or an emotion range indicated by a text tag is denoted by a, and a semantic range or an emotion range indicated by an emoticon tag is denoted by B, then the recommendation index of an emoticon tag of a ═ B is the largest, the recommendation index of an emoticon tag of a < B is the second, and the recommendation index of an emoticon tag of a > B is the smallest. For example, if the text label is "laugh", after calculating the similarity between the text label and the expression label, it is determined that the similarity between the expression label of "haha" and the expression label of "happy" is the same, and the semantic range indicated by the expression label of "happy" is greater than the semantic range indicated by the text label of "laugh", and the semantic range indicated by the expression label of "haha" is greater than the semantic range indicated by the text label of "laugh", so that the recommendation index of the expression label of "happy" can be set to a larger value, and the recommendation index of the expression label of "haha" can be set to a smaller value.

It should be noted that, the processes of the above steps 403 to 405 may refer to a flowchart shown in fig. 7, where fig. 7 is a flowchart of a method for recommending an expression picture provided in this embodiment of the present application, the server identifies an acquired input voice message 701 through step 702 to obtain a text label 703, uses a to represent a semantic range or an emotion range indicated by the text label 703, further uses step 705 to match the text label a with a plurality of expression labels in an expression label library 704, uses B to represent the semantic range or the emotion range indicated by the expression label, determines an expression to be recommended according to a relationship between a and B, preferentially recommends a No. 1 expression picture 707 and a No. 2 expression picture 709 corresponding to a-B relationship in 706 and 708, further recommends a No. 3 expression picture 711 corresponding to a < B relationship in 710, and then recommends No. 4 expression picture 713 corresponding to a > B relationship in 712, by analogy, the first 30 emotion pictures 714 to be recommended are obtained, and then the 30 emotion pictures are recommended through step 715. Optionally, more or less expression pictures may also be recommended, which is not limited in the embodiment of the present application.

406. And the server determines at least one emoticon corresponding to the at least one emoticon as at least one emoticon matched with the text label.

It should be noted that the server may further be associated with an expression picture database, where the expression picture database stores a plurality of expression pictures, and the expression pictures are stored based on the expression tags corresponding to the expression pictures. For example, happy emoticons and sad emoticons respectively store a plurality of emoticons, so as to implement classified storage of the emoticons. By classifying and storing the plurality of expression pictures, the server can directly acquire the corresponding plurality of expression pictures based on the expression labels without searching the expression labels of each expression picture one by one, so that the processing pressure of the server is reduced, and the processing speed of the server is increased.

In a possible implementation manner, the terminal obtains, according to the at least one emotion tag determined in step 406, at least one emotion image corresponding to the at least one emotion tag from an emotion image database, and uses the at least one emotion image as at least one emotion image matched with the text tag.

407. And the server sends the at least one expression picture to the terminal.

It should be noted that, the foregoing steps 402 to 407 are described by taking an example of recognizing an input voice message through interaction between a terminal and a server, and in more possible implementation manners, the terminal may also directly recognize the input voice message to obtain a text tag corresponding to the input voice message, and further determine, according to the text tag corresponding to the input voice message, at least one expression picture matched with the text tag from expression pictures (for example, expression pictures downloaded by a user or having access rights) associated with the terminal, as at least one expression picture matched with the input voice message, so as to implement expression recommendation based on the input voice message, where a specific process is the same as the foregoing steps 402 to 407, and is not described here again. The terminal recommends the emoticons by itself without interaction with the server, so that the time required by recommending the emoticons can be reduced, and the conversation processing speed is increased.

Optionally, the terminal may further send the text label to the server after recognizing the text label, and the server recommends the emoticon based on the received text label. Because the data of the expression pictures are more in the expression picture database associated with the server, the server recommends the expression pictures, more and more diversified expression pictures can be recommended for the user, the recommendation effect is improved, and the user experience is further improved.

408. And the terminal receives at least one expression picture which is sent by the server and matched with the text label of the input voice message, and displays the at least one expression picture to be selected.

In one possible implementation manner, the terminal displays the at least one emoticon to be selected in a thumbnail form in a sub-area of the voice input area.

When the expression pictures are displayed in the form of thumbnails, the terminal can compress the received at least one expression picture to obtain the thumbnails of the at least one expression picture, so that the thumbnails of the at least one expression picture are displayed, and the expression pictures are displayed in the form of thumbnails.

When the at least one expression picture is displayed in the form of the thumbnail, a sliding function can be provided in a sub-area of the voice input area, and then the thumbnails of the target number can be displayed first, a user can perform a sliding operation to the left or the right in the sub-area, and the terminal can respond to the sliding operation of the user and then display other thumbnails of the target number. For example, the terminal may display the thumbnails corresponding to the expression pictures with the higher recommendation indexes of the target number first, the user may perform a rightward operation in the sub-area, and the terminal may respond to a sliding operation of the user and then display the thumbnails corresponding to the expression pictures with the next recommendation indexes of the target number, and so on, that is, the thumbnail corresponding to the at least one expression picture may be displayed. Referring to fig. 8, fig. 8 is an interface schematic diagram for displaying expression pictures provided in an embodiment of the present application, after acquiring a thumbnail of at least one expression picture, a terminal can display, in a sub-area below a voice input area 801, thumbnails corresponding to expression pictures with higher recommendation indexes, that is, a thumbnail 802 of a dynamic expression picture 1, a thumbnail 803 of a dynamic expression picture 2, a thumbnail 804 of a dynamic expression picture 3, a thumbnail 805 of a dynamic expression picture 4, and a thumbnail 806 of a dynamic expression picture 5, first, a user performs a right slide operation in the sub-area, and the terminal can respond to a slide operation of the user and then display thumbnails corresponding to 5 expression pictures with the next recommendation index. Optionally, the terminal may also display the thumbnail of the at least one emoticon in other manners, which is not limited in this embodiment of the application.

409. And the terminal responds to the selection operation of any expression picture in the at least one expression picture and acquires the any expression picture corresponding to the selection operation.

In a possible implementation manner, a user can trigger a selection operation on any thumbnail by clicking any thumbnail in the thumbnails of the at least one expression picture, and the terminal can respond to the selection operation on any thumbnail in the thumbnails of the at least one expression picture to acquire an expression picture corresponding to any thumbnail, that is, any expression picture corresponding to the selection operation.

410. And the terminal sends an expression publishing request to the server, wherein the expression publishing request carries the expression identifier of any expression picture.

In a possible implementation manner, after the terminal acquires the expression picture corresponding to any thumbnail, the terminal may acquire the expression identifier of the expression picture, and further generate an expression publishing request based on the expression identifier, and send the expression publishing request to the server, so that the server determines the selected expression picture according to the expression identifier carried by the received expression publishing request.

411. And the server receives the expression publishing request, and determines the playing information of the expression picture corresponding to the expression identifier carried by the expression publishing request according to the duration of the input voice message, wherein the playing information is used for indicating the playing times and the playing speed of the expression picture.

It should be noted that the server can identify (or record) the playing time length required for playing the received emoticon once, so as to determine the playing information of the emoticon based on the playing time length required for playing the emoticon once.

In a possible implementation manner, the terminal determines, according to an expression identifier carried by the expression publishing request, an expression picture corresponding to the expression identifier in an expression picture database, and further determines, according to the duration of the input voice message, the number of times of playing the expression picture within the duration of the input voice message in combination with the playing duration required for playing the expression picture once, and further determines, based on the number of times of playing, the playing speed of the expression picture.

It should be noted that, when determining the number of times of playing the expression picture in the duration of the input voice message, the terminal can determine the number of times of playing the expression picture by determining the ratio of the duration of the input voice message to the playing duration required for playing the expression picture once. For example, the terminal can determine the number of times of play by the following formula (1):

wherein, X represents the time length of inputting the voice message, Y represents the playing time length required by playing the expression picture once, and Z represents the playing times of the expression picture in the time length of inputting the voice message.

It should be noted that, after the ratio between the duration of the input voice message and the playing duration required for playing the expression picture once is obtained, if the ratio is an integer value, the expression picture can be played for an integer number of times in the duration of the input voice message, and thus the ratio can be determined as the playing time of the expression picture in the duration of the input voice message. If the ratio is not an integer value, the ratio can be rounded in a rounding mode, and the rounded ratio is determined as the playing times of the expression picture in the duration of the input voice message. That is, if one digit after the decimal point of the ratio is less than 5, the ratio can be rounded down to obtain the number of times of playing the expression picture in the duration of the input voice message; if one digit behind the decimal point of the ratio is more than 5, the decimal point can be rounded upwards to obtain the playing times of the expression picture in the duration of the input voice message.

It should be noted that, after determining the number of times that the expression picture is played within the duration of the input voice message, if the ratio is an integer value, the server directly determines the playing speed corresponding to the playing duration that is required for playing the expression picture once, as the playing speed of the expression picture. If the ratio is not an integer value, the server may reversely deduce a new playing time length Y 'required for playing the expression picture once through the above formula (1), and further determine a playing speed corresponding to the new playing time length required for playing the expression picture once as the playing speed of the expression picture, so as to realize that Y is equal to Y' by increasing or decreasing the playing speed of the expression picture. Referring to fig. 9, fig. 9 is a schematic diagram of a method for determining emotion picture play information according to an embodiment of the present application, for an input voice message 901, an emotion picture can be completely played for 2 times in the input voice message, but before the input voice message is played, the emotion picture has already been played for 2 times, and then the server can adjust the play speed of the emotion picture through the above process, so that after the emotion picture is played for 2 times, the input voice message is also just played completely.

By determining the playing times and the playing speed of any expression picture, an expression picture matched with the time length of the input voice message can be always obtained, so that when the input voice message and the expression picture are played in the follow-up process, the expression picture can be completely played for many times within the time length of the input voice message, the playing effect is improved, and the user experience is further improved.

It should be noted that, in the above steps 406 to 411, after determining at least one expression picture matched with the text tag, the server can determine the playing information of the at least one expression picture according to the duration of the input voice message, send the at least one expression picture and the playing information of the at least one expression picture to the terminal of the participating user, the terminal displays the at least one expression picture to be selected, the user selects an expression picture to be sent from the at least one expression picture, the terminal responds to the selection operation of any expression picture in the at least one expression picture, sends the any expression picture and the playing information of the any expression picture to the server, and the server directly performs further processing through the following step 412. For the specific process of the server determining the playing information of the at least one expression picture and the specific process of the terminal displaying the at least one expression picture to be selected, reference may be made to the contents in steps 406 to 411, which is not described herein again. The server directly determines the playing information of the at least one expression picture, and then the expression picture and the playing information are sent to the terminal together, so that the interaction times of the terminal and the server can be reduced, and the processing speed of the conversation message is improved.

412. The server encodes the input voice message and any expression picture through a target encoder based on the playing information to obtain a target message for sending, wherein the target encoder is used for encoding the input voice message and the expression picture together, and the target message comprises the encoded input voice message and any expression picture.

It should be noted that, during encoding, the encoder sets different encoding numerical values for each character, and different types of messages include different characters, so that during decoding, the message type to which the character belongs can be determined according to the numerical value range of the encoding numerical values. Because the numerical range of the expression picture and the English, Chinese and Emoji expressions and voice messages is greatly different, the expression picture and the English, Chinese and Emoji expressions and voice messages cannot be coded together, and therefore a target coder is needed for coding the expression picture and the English, Chinese and Emoji expressions and voice messages together. Fig. 10 is a schematic diagram of encoding numerical ranges of various messages according to an embodiment of the present application, and referring to fig. 10, a numerical range of an encoding numerical value corresponding to english 1001 is 0 to 1000, a numerical range of an encoding numerical value corresponding to chinese 1002 is 1000 to 2000, a numerical range of an encoding numerical value corresponding to Emoji expression 1003 is 2000 to 3000, a numerical range corresponding to an encoding numerical value of voice message 1001 is 3000 to 5000, and a numerical range corresponding to an encoding numerical value of dynamic expression picture is 5000 to 10000. Referring to fig. 11, fig. 11 is a schematic diagram of a coding numerical range of a target encoder according to an embodiment of the present application, where the target encoder can simultaneously code english 1101, chinese 1102, Emoji emoticons 1103, a voice message 1104, and a dynamic emoticon 1105, so as to obtain a target message simultaneously including the voice message and the dynamic emoticon, and by sending the target message, it is possible to implement integrated sending of the voice message and the dynamic emoticon.

Through the target encoder, the voice message and the dynamic expression picture can be integrally sent, and the voice message and the dynamic expression picture can be seen simultaneously after the terminal decodes the voice message and the dynamic expression picture, so that the display effect is improved, and the user experience is improved.

413. The server sends the target message to the terminal corresponding to each participating user of the target session.

414. And the terminal displays the target message in an interface corresponding to the target session.

It should be noted that, after receiving the target message, the terminal can decode the target message through a target decoder corresponding to the target encoder, so as to display the input voice message and the dynamic emoticon together. The processes of the above steps 412 to the step 414 are shown in fig. 12, where fig. 12 is a schematic diagram of a processing process of inputting a voice message and a dynamic emoticon provided in this embodiment of the present application, when the terminal sends the input voice message 1201 and the dynamic emoticon 1202, the terminal may send the input voice message 1201 and the dynamic emoticon 1202 to the server 1203, the server 1203 codes the input voice message 1201 and the dynamic emoticon 1202 together through the step 1204 to obtain a target message including the input voice message 1201 and the dynamic emoticon 1202, and then sends the target message to the terminal 1205, and the terminal 1205 decodes the target message through the step 1206 to obtain the input voice message and the dynamic emoticon in the step 1207.

In a possible implementation manner, after decoding a target message, the terminal sends the target message to a target session, so that any emoticon and the input voice message are displayed in a session display area of an interface corresponding to the target session.

It should be noted that, when displaying any emoticon and an input voice message, the terminal displays the input voice message in the conversation display area, the user triggers a play instruction for the input voice message by triggering the input semantic message, and the terminal responds to the play instruction, and based on the display position of the input voice message, the emoticon is gradually displayed from the top of the message frame corresponding to the input voice message in the order from bottom to top, so as to finally realize that the emoticon is completely displayed at the top of the message frame corresponding to the input voice message. The terminal can also set the display speed of the expression picture so as to set the time required for completely displaying the expression picture according to the sequence from bottom to top. The terminal can also completely display the required time according to the sequence from bottom to top according to the expression pictures, and at a preset time point before the input voice message is played, the expression pictures are gradually hidden at the top of a message frame corresponding to the input voice message according to the sequence from top to bottom until the conversation display area displays only the input voice message. The display speed of the expression picture is set, so that the terminal can determine the time for starting to hide the expression picture according to the display speed, and further when the voice playing is completed, the expression picture is just completely hidden, the display effect is improved, and the user experience is improved.

Taking an example of the effect of simultaneously displaying an input voice message and a dynamic emoticon in instant messaging software as an example, referring to fig. 13 to 16, fig. 13 is a schematic diagram of a dynamic emoticon display method provided in the embodiment of the present application, a terminal inputs the top of a message frame 1301 corresponding to a voice message in a session display area, and extracts the dynamic emoticon 1302 from the top of the message frame 1301 from bottom to top, thereby implementing a display effect of completely displaying the dynamic emoticon in fig. 14, fig. 14 is a schematic diagram of a dynamic emoticon display method provided in the embodiment of the present application, referring to fig. 14, a terminal completely displays the dynamic emoticon 1402 on the top of the message frame 1401. The terminal determines a time point for starting hiding the dynamic expression picture according to the display speed of the dynamic expression picture, and further starts to hide the expression picture from top to bottom at the corresponding time point, see fig. 15, where fig. 15 is a schematic diagram of a dynamic expression display method provided in the embodiment of the present application, the terminal starts to hide the dynamic expression picture 1502 from the top of a message frame 1501 in a sequence from top to bottom until the effect in fig. 16 is achieved, and fig. 16 is a schematic diagram of a dynamic expression display method provided in the embodiment of the present application, and referring to fig. 16, the terminal finally displays only a message frame 1601 corresponding to the input voice message.

Optionally, the terminal may also display the expression picture in other manners, which is not limited in this embodiment of the application. For example, the terminal may not hide the emoticon, that is, after the emoticon is completely displayed from bottom to top, the emoticon is completely displayed on the top of the message frame corresponding to the input voice message all the time, and the emoticon does not need to be hidden. By not hiding the expression picture, when the user plays the input voice message again, the terminal can directly play the expression picture based on the complete display at the top of the message frame corresponding to the input voice message without gradually displaying the expression picture from bottom to top, so that the processing pressure of the terminal is reduced, the display effect is improved, and the user experience is improved.

It should be noted that the foregoing steps 412 to 414 are only optional implementations, and in more possible implementations, after the server determines the playing information of the emoticon through step 411, the server may only encode the emoticon based on the playing information, and then send the encoded emoticon to the terminal, and the terminal displays the emoticon based on the recorded input voice message and the received encoded emoticon. Optionally, the server may also directly send the playing information to the terminal, and the terminal encodes the selected expression picture according to the playing information, and then displays the expression picture based on the recorded input voice message and the received encoded expression picture. By only coding the expression pictures, the repeated coding and sending of the voice message are not needed, the processing pressure of the server and the terminal is reduced, and the processing speed of the conversation message is improved.

It should be noted that the scheme provided by the embodiment of the present application can be applied to chat windows of various instant messaging tools, when a user clicks and records voice, after input of an input voice message is completed, a server can perform intelligent recognition based on the input voice message to recommend a related emoticon, after the user selects any emoticon through a terminal, the server can send the input voice message and the emoticon in the same message through a target encoder, and further the terminal can display the input voice message and the emoticon simultaneously, so that user experience can be improved, message sending activity of the user is improved, and further user stickiness is increased.

According to the scheme provided by the embodiment of the application, when the input voice message of any one participant user of a target conversation is received, the input voice message is recognized, so that a text label capable of expressing the semantic tendency of the input voice message is obtained, at least one expression picture capable of visually expressing the semantic tendency is obtained and serves as the expression picture recommended to the user, the obtained at least one expression picture is sent to the participant user terminal, expression recommendation based on the voice message input by the user is achieved, the participant user can see the recommended expression picture through the terminal after inputting the voice message, characters do not need to be input by the user, the operation process is simplified, the man-machine interaction efficiency is improved, and the user experience is improved. For example, through the scheme, a user who is difficult to input characters can quickly input expression pictures expressing semantics or emotions, so that messages sent to friends by the user are more emotional, but not characters of cold ice, and the friends can intuitively know the meaning the user wants to express or the mood of the user when sending the messages through the expression pictures. According to the embodiment of the application, the playing information of the expression picture is determined according to the duration of the input voice message, so that the expression picture can be completely displayed for many times in the playing process of the input voice message, and the display effect is improved. In addition, through optimizing the encoder, a target encoder for carrying out common encoding on the input voice message and the expression picture is provided, so that the terminal can display the input voice message and the expression picture simultaneously, a novel interactive chat playing method is provided, users can express themselves more conveniently, user experience is improved, and the purpose of improving user liveness is achieved.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Fig. 17 is a schematic structural diagram of a session message processing apparatus according to an embodiment of the present application, and referring to fig. 17, the apparatus includes:

a message acquiring module 1701 for acquiring an input voice message of a target session;

a first picture obtaining module 1702, configured to obtain at least one emoticon matched with the input voice message;

a first display module 1703, configured to display the at least one emoticon to be selected;

a sending module 1704, configured to send any emoticon and the input voice message to the target conversation in response to a selection operation of any emoticon of the at least one emoticon.

According to the device provided by the embodiment of the application, the input voice message of the target session is obtained, the at least one expression picture matched with the input voice message is obtained, the at least one expression picture to be selected is displayed, so that the expression picture is recommended based on the voice message input by the user, the user can select the expression picture to be sent from the at least one expression picture, the selection operation of any expression picture in the at least one expression picture is responded, any expression picture and the input voice message are sent to the target session, the input voice message and the expression picture are displayed, the recommendation of the expression picture can be achieved without inputting characters by the user, the man-machine interaction efficiency is improved, and the user experience is improved.

In a possible implementation manner, the first picture obtaining module 1702 is configured to send the input voice message to a server, and receive at least one emoticon sent by the server and matched with a text tag of the input voice message; or, recognizing the input voice message to obtain at least one expression picture matched with the input voice message.

the first display module 1703 is configured to display the at least one emoticon to be selected in a thumbnail form in a sub-area of the voice input area.

In one possible implementation, the apparatus further includes:

the device also includes:

In a possible implementation manner, the second display module is configured to display the input voice message at a target position of the session display area, and display any one of the expression pictures according to an order that the any one of the expression pictures is blocked, the any one of the expression pictures is not blocked, and the any one of the expression pictures is blocked again, based on the target position corresponding to the input voice message.

Fig. 18 is a schematic diagram of a result of a session message processing apparatus provided in an embodiment of the present application, and referring to fig. 18, the apparatus includes:

an identifying module 1801, configured to identify, in response to receiving an input voice message of any one of participating users of a target session, the input voice message to obtain a text tag corresponding to the input voice message;

a picture determining module 1802, configured to determine, according to a text tag corresponding to the input voice message, at least one emoticon matched with the text tag as the at least one emoticon matched with the input voice message;

a picture sending module 1803, configured to send the at least one emoticon to the terminal of the participating user.

According to the device provided by the embodiment of the application, when the input voice message of any one participant user of a target conversation is received, the input voice message is recognized, so that a text label capable of expressing the semantic tendency of the input voice message is obtained, at least one expression picture capable of visually expressing the semantic tendency is obtained and serves as the expression picture recommended to the user, the obtained at least one expression picture is sent to the participant user terminal, expression recommendation based on the voice message input by the user is achieved, the participant user can see the recommended expression picture through the terminal after inputting the voice message, characters do not need to be input by the user, the operation process is simplified, the man-machine interaction efficiency is improved, and the user experience is improved. For example, through the scheme, a user who is difficult to input characters can quickly input expression pictures expressing semantics or emotions, so that messages sent to friends by the user are more emotional, but not characters of cold ice, and the friends can intuitively know the meaning the user wants to express or the mood of the user when sending the messages through the expression pictures.

In a possible implementation manner, the picture determining module 1802 is configured to determine, in an expression tag library, at least one expression tag of which the similarity of a text tag corresponding to the input voice information satisfies a preset condition, and determine at least one expression picture corresponding to the at least one expression tag as at least one expression picture matched with the text tag.

In one possible implementation, the apparatus further includes:

It should be noted that: in the session message processing apparatus provided in the foregoing embodiment, when processing a session message, only the division of the above functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the terminal/server is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the session message processing apparatus and the session message processing method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

In an exemplary embodiment, a computer device is provided, which may include a terminal and a server, and the specific structure of the terminal and the server is as follows:

fig. 19 is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 1900 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1900 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so on.

Generally, terminal 1900 includes: one or more processors 1901 and one or more memories 1902.

The processor 1901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 1901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1901 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 1901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

The memory 1902 may include one or more computer-readable storage media, which may be non-transitory. The memory 1902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1902 is used to store at least one program code for execution by the processor 1901 to implement the conversational message processing methods provided by the method embodiments herein.

In some embodiments, terminal 1900 may further optionally include: a peripheral interface 1903 and at least one peripheral. The processor 1901, memory 1902, and peripheral interface 1903 may be connected by bus or signal lines. Various peripheral devices may be connected to peripheral interface 1903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1904, a display screen 1905, a camera assembly 1906, an audio circuit 1907, a positioning assembly 1908, and a power supply 1909.

The peripheral interface 1903 may be used to connect at least one peripheral associated with an I/O (Input/Output) to the processor 1901 and the memory 1902. In some embodiments, the processor 1901, memory 1902, and peripherals interface 1903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1901, the memory 1902, and the peripheral interface 1903 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1904 communicates with a communication network and other communication devices via electromagnetic signals. The rf circuit 1904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1904 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1904 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1905 is a touch display screen, the display screen 1905 also has the ability to capture touch signals on or above the surface of the display screen 1905. The touch signal may be input to the processor 1901 as a control signal for processing. At this point, the display 1905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1905 may be one, disposed on a front panel of terminal 1900; in other embodiments, the displays 1905 can be at least two, each disposed on a different surface of the terminal 1900 or in a folded design; in other embodiments, display 1905 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1900. Even more, the display 1905 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display 1905 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-emitting diode), or the like.

The camera assembly 1906 is used to capture images or video. Optionally, camera assembly 1906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera head assembly 1906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals into the processor 1901 for processing, or inputting the electric signals into the radio frequency circuit 1904 for realizing voice communication. The microphones may be provided in a plurality, respectively, at different locations of the terminal 1900 for stereo sound capture or noise reduction purposes. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1901 or the radio frequency circuitry 1904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1907 may also include a headphone jack.

The positioning component 1908 is configured to locate a current geographic location of the terminal 1900 for navigation or LBS (location based Service). The positioning component 1908 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 1909 is used to provide power to the various components in terminal 1900. The power source 1909 can be alternating current, direct current, disposable batteries, or rechargeable batteries. When power supply 1909 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1900 also includes one or more sensors 1910. The one or more sensors 1910 include, but are not limited to: acceleration sensor 1911, gyro sensor 1912, pressure sensor 1913, fingerprint sensor 1914, optical sensor 1915, and proximity sensor 1916.

Acceleration sensor 1911 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with terminal 1900. For example, the acceleration sensor 1911 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1901 may control the display screen 1905 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1911. The acceleration sensor 1911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1912 may detect a body direction and a rotation angle of the terminal 1900, and the gyro sensor 1912 may collect a 3D motion of the user on the terminal 1900 in cooperation with the acceleration sensor 1911. From the data collected by the gyro sensor 1912, the processor 1901 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensor 1913 may be disposed on a side bezel of terminal 1900 and/or underlying display 1905. When the pressure sensor 1913 is disposed on the side frame of the terminal 1900, the user can detect a grip signal of the terminal 1900, and the processor 1901 can perform right-left hand recognition or shortcut operation based on the grip signal collected by the pressure sensor 1913. When the pressure sensor 1913 is disposed at a lower layer of the display 1905, the processor 1901 controls the operability control on the UI interface according to the pressure operation of the user on the display 1905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1914 is configured to collect a fingerprint of the user, and the processor 1901 identifies the user according to the fingerprint collected by the fingerprint sensor 1914, or the fingerprint sensor 1914 identifies the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 1901 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for, and changing settings, etc. Fingerprint sensor 1914 may be disposed on a front, back, or side of terminal 1900. When a physical button or vendor Logo is provided on terminal 1900, fingerprint sensor 1914 may be integrated with the physical button or vendor Logo.

The optical sensor 1915 is used to collect the ambient light intensity. In one embodiment, the processor 1901 may control the display brightness of the display screen 1905 based on the ambient light intensity collected by the optical sensor 1915. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1905 is increased; when the ambient light intensity is low, the display brightness of the display screen 1905 is adjusted down. In another embodiment, the processor 1901 may also dynamically adjust the shooting parameters of the camera assembly 1906 according to the intensity of the ambient light collected by the optical sensor 1915.

Proximity sensor 1916, also referred to as a distance sensor, is typically disposed on the front panel of terminal 1900. Proximity sensor 1916 is used to gather the distance between the user and the front face of terminal 1900. In one embodiment, when proximity sensor 1916 detects that the distance between the user and the front surface of terminal 1900 gradually decreases, processor 1901 controls display 1905 to switch from the bright screen state to the dark screen state; when proximity sensor 1916 detects that the distance between the user and the front surface of terminal 1900 gradually becomes larger, processor 1901 controls display 1905 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in FIG. 19 is not intended to be limiting of terminal 1900 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 20 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 2000 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 2001 and one or more memories 2002, where the one or more memories 2002 store at least one program code, and the at least one program code is loaded and executed by the one or more processors 2001 to implement the methods provided by the foregoing method embodiments. Of course, the server 2000 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 2000 may also include other components for implementing device functions, which are not described herein again.

In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory, including program code, which is executable by a processor to perform the session message processing method in the above-described embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, which includes computer program code stored in a computer-readable storage medium, which is read by a processor of a terminal/server from the computer-readable storage medium, and which is executed by the processor, so that the terminal/server executes the method steps of the session message processing method provided in the above-mentioned respective method embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by hardware associated with program code, and the program may be stored in a computer readable storage medium, where the above mentioned storage medium may be a read-only memory, a magnetic or optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for processing a session message, the method comprising:

responding to the received input voice message of any participant user of the target conversation, and identifying the input voice message to obtain a text label corresponding to the input voice message;

2. The method of claim 1, wherein the recognizing the input voice message and obtaining the text label corresponding to the input voice message comprises:

performing voice recognition on the input voice message to obtain text content corresponding to the input voice message;

and performing semantic recognition on the text content to obtain a text label corresponding to the text content, wherein the text label is used as a text label corresponding to the input voice message.

3. The method of claim 1, wherein after sending the at least one emoticon to the participant user's terminal, the method further comprises:

receiving any expression picture sent by the terminal of the participating user based on the selection operation of any expression picture in the at least one expression picture;

and determining the playing information of any expression picture according to the duration of the input voice message, wherein the playing information is used for representing the playing times and the playing speed of any expression picture.

4. The method of claim 1, wherein after determining at least one emoticon matching the text label according to the text label corresponding to the input voice message, the method further comprises:

determining the playing information of the at least one expression picture according to the duration of the input voice message, wherein the playing information is used for representing the playing times and the playing speed of the at least one expression picture;

and sending the at least one expression picture and the playing information of the at least one expression picture to the terminals of the participating users.

5. The method according to claim 3 or 4, wherein the determining process of the playing information comprises:

determining the playing times of the expression picture in the duration of the input voice message according to the duration of the input voice message and the playing duration required by playing the expression picture once;

and determining the playing speed of the expression picture based on the playing times.

6. The method of claim 5, wherein the determining, according to the duration of the input voice message and in combination with the playing duration required for the emoticon to be played once, the number of times that the emoticon is played within the duration of the input voice message comprises:

determining the ratio of the duration of the input voice message to the playing duration required by playing the expression picture once;

if the ratio is an integer value, determining the ratio as the playing times of the expression picture in the duration of the input voice message;

and if the ratio is not an integer value, rounding the ratio, and determining the rounded ratio as the playing times of the expression picture in the duration of the input voice message.

7. A method for processing a session message, the method comprising:

acquiring an input voice message of a target session;

acquiring at least one expression picture matched with the input voice message;

displaying the at least one expression picture to be selected;

8. The method of claim 7, wherein the obtaining at least one emoticon matching the input voice message comprises:

sending the input voice message to a server, and receiving at least one expression picture which is sent by the server and matched with the input voice message;

or the like, or, alternatively,

and identifying the input voice message to obtain at least one expression picture matched with the input voice message.

9. The method of claim 7, wherein the interface corresponding to the target session comprises a voice input area;

the displaying the at least one expression picture to be selected includes:

and displaying the at least one expression picture to be selected in a thumbnail mode in a sub-area of the voice input area.

10. The method of claim 7, wherein the interface corresponding to the target session comprises a session display area;

after the sending of the any emoticon and the input voice message to the target session, the method further includes:

and displaying any expression picture and the input voice message in the conversation display area.

11. The method of claim 10, wherein the displaying, in the conversation display area, the any emoticon and the input voice message comprises:

displaying the input voice message at a target position of the conversation display area;

and displaying a target animation based on the target position corresponding to the input voice message, wherein the target animation is used for representing the display effect that any expression picture is blocked to appear by the input voice message and is blocked by the input voice message again from appearing.

12. A session message processing apparatus, characterized in that the apparatus comprises:

the identification module is used for responding to the received input voice message of any participant user of the target conversation, identifying the input voice message and obtaining a text label corresponding to the input voice message;

the determining module is used for determining at least one expression picture matched with the text label according to the text label corresponding to the input voice message, and the at least one expression picture is used as the at least one expression picture matched with the input voice message;

and the sending module is used for sending the at least one expression picture to the terminals of the participating users.

13. A session message processing apparatus, characterized in that the apparatus comprises:

14. A computer device comprising one or more processors and one or more memories having at least one program code stored therein, the program code being loaded and executed by the one or more processors to implement the operations performed by the conversational message processing method of any one of claims 1 to 6; or operations performed by the method of session message processing according to any of claims 7 to 11.

15. A computer-readable storage medium having at least one program code stored therein, the program code being loaded and executed by a processor to implement the operations performed by the conversation message processing method of any one of claims 1 to 6; or operations performed by the method of conversational message processing of claim 7 through claim 11.