CN111883131A

CN111883131A - Voice data processing method and device

Info

Publication number: CN111883131A
Application number: CN202010855563.4A
Authority: CN
Inventors: 汪辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-11-03
Anticipated expiration: 2040-08-20
Also published as: CN111883131B

Abstract

The embodiment of the application provides a method and a device for processing voice data. The processing method of the voice data comprises the following steps: performing intention analysis on a voice to be processed, and determining intention information corresponding to the voice to be processed; performing emotion analysis on the voice to be processed to obtain an emotion analysis result of the voice to be processed; calculating the emotion matching degree between the content to be pushed and the voice to be processed according to the emotion analysis result and the emotion feature vector of the content to be pushed; and determining feedback information aiming at the voice to be processed according to the intention information, the emotion matching degree and the content to be pushed. According to the technical scheme, the response can be carried out according to the current emotional state of the user, so that more humanized interactive experience is provided for the user.

Description

Voice data processing method and device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method and a device for processing voice data.

Background

With the development of artificial intelligence and the continuous improvement of interaction experience requirements of people, the traditional human-computer interaction mode is gradually replaced by the intelligent interaction mode. In the current technical scheme, the existing intelligent interaction scheme can only roughly analyze semantic content of user voice so as to make a corresponding response. However, it cannot analyze the emotional needs of the user according to the current emotion of the user. Therefore, how to respond according to the current emotion of the user and further provide more humanized interactive experience for the user becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a method and a device for processing voice data, and further the voice data can respond to the current emotion of a user at least to a certain extent, so that more humanized interactive experience is provided for the user.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided a method for processing voice data, the method including:

performing intention analysis on a voice to be processed, and determining intention information corresponding to the voice to be processed;

performing emotion analysis on the voice to be processed to obtain an emotion analysis result of the voice to be processed;

calculating the emotion matching degree between the content to be pushed and the voice to be processed according to the emotion analysis result and the emotion feature vector of the content to be pushed;

and determining feedback information aiming at the voice to be processed according to the intention information, the emotion matching degree and the content to be pushed.

According to an aspect of an embodiment of the present application, there is provided an apparatus for processing voice data, the apparatus including:

the intention analysis module is used for carrying out intention analysis on the voice to be processed and determining intention information corresponding to the voice to be processed;

the emotion analysis module is used for carrying out emotion analysis on the voice to be processed to obtain an emotion analysis result of the voice to be processed;

the matching degree calculation module is used for calculating the emotion matching degree between the content to be pushed and the voice to be processed according to the emotion analysis result and the emotion feature vector of the content to be pushed;

and the information determining module is used for determining feedback information aiming at the voice to be processed according to the access intention, the emotion matching degree and the content to be pushed.

In some embodiments of the present application, based on the foregoing, the emotion analysis module is configured to: performing emotion recognition according to the voice to be processed to obtain emotion matching values of the voice to be processed corresponding to all preset emotion types; and generating emotion analysis vectors of the voice to be processed according to the emotion matching values of the voice to be processed corresponding to the preset emotion types, and taking the emotion analysis vectors as emotion analysis results.

In some embodiments of the present application, based on the foregoing, the emotion analysis module is configured to: and multiplying the emotion feature vector of the content to be pushed with the emotion analysis vector to obtain the emotion matching degree between the content to be pushed and the voice to be processed.

In some embodiments of the present application, based on the foregoing, the information determination module is configured to: determining response information to the voice to be processed according to the intention information; calculating the content matching degree between the content to be pushed and the sender of the voice to be processed according to the interest characteristic vector of the sender of the voice to be processed and the content characteristic vector of the content to be pushed; determining target push content in the content to be pushed according to the emotion matching degree and the content matching degree; and combining the response information and the target push content to generate feedback information aiming at the voice to be processed.

In some embodiments of the present application, based on the foregoing, the information determination module is configured to: acquiring importance weight corresponding to the emotion matching degree and importance weight corresponding to the content matching degree; calculating a recommendation value of the content to be pushed according to the emotion matching degree and the corresponding importance weight thereof, and the content matching degree and the corresponding importance weight thereof; and selecting the content to be pushed with the maximum recommendation value from the content to be pushed as target pushing content.

In some embodiments of the present application, based on the foregoing, the intent analysis module is configured to: carrying out voice recognition on the voice to be processed to obtain text information corresponding to the voice to be processed; performing word segmentation on the text information to obtain keywords contained in the text information; matching the keywords with keyword templates preset in each field, and determining the intention matching degree of the text information and each field; and determining intention information corresponding to the voice to be processed according to the intention matching degree.

In some embodiments of the present application, based on the foregoing, the intent analysis module is configured to: comparing the keywords with keyword templates preset in each field, and determining target keyword templates containing the keywords in each field; acquiring the relevance weight of the keywords contained in each target keyword template in the corresponding field; and calculating the sum of the relevance weights of the keywords contained in each target keyword template to obtain the intention matching degree between the text information and each field.

In some embodiments of the present application, based on the foregoing, the emotion analysis module is further configured to: performing emotion matching on the content to be pushed by adopting a pre-trained neural network model to obtain emotion matching values of the content to be pushed, which correspond to all preset emotion types; generating emotion feature vectors corresponding to the content to be pushed according to emotion matching values of the content to be pushed corresponding to all preset emotion types, and associating the emotion feature vectors with the content to be pushed respectively.

In some embodiments of the application, based on the foregoing scheme, after performing emotion matching on the content to be pushed by using a pre-trained neural network model to obtain emotion matching values of the content to be pushed corresponding to respective preset emotion types, the emotion analysis module is further configured to: displaying an emotion matching value correction interface according to the correction request of the emotion matching value; and correcting the emotion matching values of the content to be pushed corresponding to the preset emotion types according to the correction information for the emotion matching values acquired by the emotion matching value correction interface, so as to obtain the corrected emotion matching values of the content to be pushed corresponding to the preset emotion types.

According to an aspect of embodiments of the present application, there is provided a computer-readable medium on which a computer program is stored, the computer program, when executed by a processor, implementing a method of processing speech data as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided an electronic device including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the processing method of voice data as described in the above embodiments.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the processing method of voice data provided in the above-described embodiments.

In the technical solutions provided in some embodiments of the present application, intention analysis is performed on a to-be-processed voice to determine intention information corresponding to the to-be-processed voice, emotion analysis is performed on the to-be-processed voice to obtain an emotion analysis result of the to-be-processed voice, an emotion matching degree between a to-be-pushed content and the to-be-processed voice is calculated according to the emotion analysis result and an emotion feature vector of the to-be-pushed content, and then feedback information for the to-be-processed voice is determined according to the intention information, the emotion matching degree, and the to-be-pushed content. Therefore, the emotion matching degree of each content to be pushed can be calculated according to the emotion analysis result of the voice to be processed, so that the feedback information aiming at the voice to be processed is determined, the emotion of the sender of the voice to be processed is responded, and more humanized interactive experience is provided for a user.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture to which aspects of embodiments of the present application may be applied;

FIG. 2 shows a flow diagram of a method of processing voice data according to an embodiment of the present application;

FIG. 3 shows a flowchart of step S220 of the method for processing voice data of FIG. 2 according to one embodiment of the present application;

FIG. 4 shows a flowchart of step S240 of the method of processing speech data of FIG. 2 according to one embodiment of the present application;

FIG. 5 shows a flowchart of step S430 of the method for processing speech data of FIG. 4 according to one embodiment of the present application;

FIG. 6 shows a flowchart of step S210 of the method for processing voice data of FIG. 2 according to one embodiment of the present application;

fig. 7 shows a flowchart of step S630 in the processing method of voice data of fig. 6 according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of determining an emotion feature vector of content to be pushed, further included in a processing method of voice data according to an embodiment of the present application;

FIG. 9 is a schematic flow chart illustrating a process of modifying an emotion matching value of content to be pushed, further included in the method for processing voice data according to an embodiment of the present application;

FIG. 10 is a diagram illustrating an exemplary system architecture to which aspects of embodiments of the present application may be applied;

FIG. 11 shows a flow diagram of a method of processing voice data according to an embodiment of the present application;

FIG. 12 shows a block diagram of a processing device of speech data according to an embodiment of the present application;

FIG. 13 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of the embodiments of the present application can be applied.

As shown in fig. 1, the system architecture may include a terminal device 101 having a voice signal collection module, a network 102, and a server 103. The terminal device 101 with the voice signal collection module may be a mobile phone, a portable computer, a tablet computer, a headset, a microphone, or other terminal devices; network 102 is the medium used to provide communication links between terminal devices 101 and server 103, and network 102 may include various connection types, such as wired communication links, wireless communication links, and so forth. In an embodiment of the present disclosure, the network 102 between the terminal device 101 and the server 103 may be a wireless communication link, in particular a mobile network.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired.

It should be noted that the server 103 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform.

In an embodiment of the disclosure, a user sends a to-be-processed voice to a terminal device 101 having a voice signal collection module, a server 103 may perform intent analysis on the to-be-processed voice, determine intent information corresponding to the to-be-processed voice, perform emotion analysis on the to-be-processed voice, obtain an emotion analysis result of the to-be-processed voice, and calculate an emotion matching degree between the to-be-pushed content and the to-be-processed voice according to the emotion analysis result and an emotion feature vector of the to-be-pushed content, thereby determining feedback information for the to-be-processed voice according to the intent information, the emotion matching degree, and the to-be-pushed content. The server 103 may transmit the feedback information to the terminal device 101, so that the terminal device 101 feeds the feedback information back to the user.

It should be noted that the processing method of the voice data provided in the embodiment of the present application is generally executed by the server 103, and accordingly, the processing device of the voice data is generally disposed in the server 103. However, in other embodiments of the present application, the terminal device may also have a similar function as the server, so as to execute the scheme of the processing method of voice data provided in the embodiments of the present application.

The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:

fig. 2 shows a flow diagram of a method of processing speech data according to an embodiment of the application. Referring to fig. 2, the processing method of the voice data at least includes steps S210 to S240, and the following details are described as follows:

in step S210, performing intent analysis on the voice to be processed, and determining intent information corresponding to the voice to be processed.

The intention analysis may be a process for analyzing an access intention corresponding to the voice to be processed, so as to know a request purpose of a sender, for example, the sender of the voice to be processed wants to chat, inquire weather, watch video or listen to music.

In one embodiment of the present application, a sender of pending speech may send speech toward a terminal device having a speech collection module by which the pending speech is collected. After the voice to be processed is acquired, voice recognition can be performed on the voice to be processed, and intention analysis is performed according to a voice recognition result, so that intention information corresponding to the voice to be processed is determined.

In step S220, emotion analysis is performed on the speech to be processed to obtain an emotion analysis result of the speech to be processed.

The emotion analysis may be a processing procedure for analyzing an emotion type corresponding to the to-be-processed speech, so as to know a current emotion of a sender of the to-be-processed speech, such as anger, joy, injury, fear, or the like.

In an embodiment of the present application, the to-be-processed speech may be analyzed to obtain a sound feature included in the to-be-processed speech, where the sound feature may include, but is not limited to, a speech rate, a pitch, and the like. It should be understood that the speech rate and pitch of speech of a person under different emotional states also change correspondingly, for example, when the person is angry, the speech rate of the person becomes slow, the pitch becomes deep, if the person is happy, the speech rate is fast, the pitch becomes high, and so on. And then, combining the acquired sound characteristics with the voice recognition result corresponding to the voice to be processed to perform emotion analysis, and determining the emotion state corresponding to the voice to be processed so as to obtain the emotion analysis result corresponding to the voice to be processed.

In step S230, an emotion matching degree between the content to be pushed and the speech to be processed is calculated according to the emotion analysis result and the emotion feature vector of the content to be pushed.

The content to be pushed may be various resources for the user to obtain, and may be resources in various forms, for example, the content to be pushed may include but is not limited to an audio resource, a video resource, or a text resource, and the like. In an example, the content to be pushed may be downloaded in advance and stored locally for subsequent acquisition; in other examples, the content to be pushed may also be obtained from the network in real time, so as to save storage resources.

The emotion feature vector can be vector information used for representing the matching degree between the content to be pushed and each emotion type. The emotion feature vector may include a matching degree between a corresponding content to be pushed and each emotion type, for example, a matching degree between a certain content to be pushed and an angry emotion type is 0.2, a matching degree with a happy emotion type is 0.8, and the like.

In an embodiment of the application, emotion recognition can be performed on the content to be pushed in advance, so that matching degrees of the content to be pushed and each emotion type are obtained, and corresponding emotion feature vectors are generated according to the matching degrees of the content to be pushed and each emotion type. And associating the emotional feature vector with the corresponding content to be pushed for subsequent query. In an example, a corresponding relationship table may be established according to identification information (e.g., a number, etc.) of the content to be pushed and the emotional feature vector, and during the obtaining, the emotional feature vector of the content to be pushed may be obtained by querying the corresponding relationship table according to the identification information of the content to be pushed.

In an embodiment of the application, after obtaining the emotion feature vector of each content to be pushed, the emotion feature vector can be respectively matched with the emotion analysis result of the speech to be processed, so that the emotion matching degree between the speech to be processed and each content to be pushed is calculated. It should be understood that a higher emotional matching degree indicates a higher degree of engagement between the two, and a lower emotional matching degree indicates a lower degree of engagement between the two.

In step S240, feedback information for the speech to be processed is determined according to the intention information, the emotion matching degree, and the content to be pushed.

In one embodiment of the present application, a domain that a sender of the voice to be processed wants to access, such as chatting, listening to music, watching a video, or listening to a novel, etc., can be determined according to the intention information corresponding to the voice to be processed. According to the field which the sender wants to access, the content to be pushed with the highest emotion matching degree with the voice to be processed can be selected from the content to be pushed corresponding to the field to serve as target pushing content, and feedback information aiming at the voice to be processed is generated according to the target pushing content.

For example, the voice to be processed is 'i do not feel happy today and want to listen to a song', intention analysis is performed on the voice to be processed, a sender of the voice to be processed can know that the sender wants to listen to music, emotion analysis is performed by combining the voice to be processed and sound characteristics contained in the voice to be processed, and the fact that the sender has a large current emotion is analyzed and possibly is distressing. Therefore, content to be pushed, namely song ' warm ', which is relatively matched with the emotion can be selected as target pushing content in the music field, and according to the target pushing content, feedback information for the voice to be processed is generated, wherein ' is the owner worrying about every day, and does not hear ' warm ' of a beam of quiet music? The server may send the feedback information to the terminal device, and the terminal device may transmit the feedback information to the sender in a voice playing mode or a video display mode. The sender can continue to operate according to the feedback information, for example, if the sender says "ok", the server obtains the song "warm" and plays the song by the terminal device, and the like.

In the embodiment shown in fig. 2, intention information and emotion analysis results corresponding to the voice to be processed are correspondingly determined by performing intention analysis and emotion analysis on the voice to be processed. And calculating the emotion matching degree between each content to be pushed and the voice to be processed according to the emotion analysis result and the emotion feature vector of each content to be pushed. And determining feedback information aiming at the voice to be processed according to the intention information, the emotion matching degree and the content to be pushed. Therefore, when intelligent interaction is carried out, not only the character information of the voice to be processed, but also the emotion type of the sender can be combined, and information matched with the emotion can be fed back to the sender, so that the current emotion of the sender can be responded, and more humanized interaction experience can be provided for the sender.

Based on the embodiment shown in fig. 2, fig. 3 shows a flowchart of step S220 in the processing method of voice data of fig. 2 according to an embodiment of the present application. Referring to fig. 3, step S220 at least includes steps S310 to S320, which are described in detail as follows:

in step S310, emotion recognition is performed according to the speech to be processed, so as to obtain emotion matching values of the speech to be processed corresponding to each preset emotion type.

In one embodiment of the present application, a plurality of emotion types can be preset by one skilled in the art, for example, the preset emotion types can include but are not limited to happy, favorite, surprised, neutral, sadness, fear, anger, and the like. When emotion recognition is performed on the voice to be processed, emotion matching values corresponding to the voice to be processed and each preset emotion type can be output. It should be understood that the same speech to be processed may correspond to a plurality of predetermined emotion types, but the emotion matching values between the two may be the same or different, for example, the emotion matching values of a certain speech to be processed corresponding to each predetermined emotion type (happy, liked, surprised, neutral, sadness, fear, and anger) are: 0.2, 0.3, 0.6, 0.3, 0.1, 0.2, and 0.1, and so on.

In step S320, generating an emotion analysis vector of the to-be-processed speech according to the emotion matching value of the to-be-processed speech corresponding to each preset emotion type, and using the emotion analysis vector as an emotion analysis result.

In an embodiment of the present application, the emotion matching values corresponding to the preset emotion types of the to-be-processed speech may be arranged according to a predetermined format, so as to generate an emotion analysis vector corresponding to the to-be-processed speech. For example, the emotion analysis feature vector corresponding to the speech a to be processed is: [ happy, fond, surprised, neutral, sadness, fear, anger ] ═ 0.2, 0.3, 0.6, 0.3, 0.1, 0.2, 0.1], and the like.

In other examples, the encoding may also be performed according to emotion matching values of the to-be-processed speech corresponding to the preset emotion types, for example, the emotion matching values are converted into binary or decimal values, and the converted values are arranged according to a predetermined format to obtain emotion analysis vectors corresponding to the to-be-processed speech, so that the emotion analysis vectors are used as emotion analysis results of the to-be-processed speech.

In the embodiment shown in fig. 3, emotion matching values of the to-be-processed speech corresponding to the preset emotion types are obtained by performing emotion analysis on the to-be-processed speech, and then, corresponding emotion analysis vectors are generated according to the emotion matching values. Therefore, the emotion analysis of the voice to be processed can cover the possibility of the emotion of the voice to be processed, and the conditions that the emotion is misjudged due to single output possibility are avoided.

Based on the embodiments shown in fig. 2 and fig. 3, in an embodiment of the present application, calculating an emotion matching degree between the content to be pushed and the speech to be processed according to the emotion analysis result and an emotion feature vector of the content to be pushed includes:

and multiplying the emotion feature vector of the content to be pushed with the emotion analysis vector to obtain the emotion matching degree between the content to be pushed and the voice to be processed.

In this embodiment, the emotion feature vector of each content to be pushed is multiplied by the emotion analysis vector of the speech to be processed, so as to calculate and obtain the emotion matching degree between the content to be pushed and the speech to be processed. For example, the emotion feature vector of the content to be pushed is [1, 0, 1, 0, 1, 1, 1], the emotion analysis vector corresponding to the speech to be processed is [0, 1, 1, 1, 0, 1, 0], the two corresponding positions are multiplied, and the products at the positions are added to obtain the emotion matching degree between the two, that is, 1 × 0+0 × 1+ 0+ 1+0 ═ 2.

It should be understood that if the emotion matching value of the two is higher, the emotion matching degree obtained by calculation is higher. Therefore, the emotion matching degree can be used as a standard for evaluating whether the content to be pushed is similar to or the same as the emotion types corresponding to the voice to be processed, a plurality of preset emotion types are set at the same time, and the accuracy of emotion matching degree calculation is improved.

Based on the embodiment shown in fig. 2, fig. 4 shows a flowchart of step S240 in the processing method of voice data of fig. 2 according to an embodiment of the present application. Referring to fig. 4, step S240 at least includes steps S410 to S440, which are described in detail as follows:

in step S410, response information to the speech to be processed is determined according to the intention information.

In one embodiment of the present application, a plurality of response templates may be preset corresponding to each domain to respond to the speech to be processed, so as to avoid mechanical responses. For example, in the music field, "pay every day, don't hear a XXX" and "let us hear a XXX together? "etc., in the video domain," XXX today's playback is XXX, to see together? "and" is XXX new today shown, see? ", and the like. Therefore, according to the intention information of the voice to be processed, the corresponding response template can be selected in the corresponding field as the response information, so that a certain content to be pushed is prevented from being directly recommended to the user, the user feels that the response is too hard, and the interaction experience of the user is ensured.

In step S420, a content matching degree between the content to be pushed and the sender of the voice to be processed is calculated according to the interest feature vector of the sender of the voice to be processed and the content feature vector of the content to be pushed.

The interest feature vector may be vector information describing a user's interest level in a certain type of content. It should be understood that even in the same domain, different contents to be pushed may have different types, for example, music classified by emotion in the music domain may be classified into impairment music, nostalgic music, happy music, healing music, relaxing music, and the like; videos can be classified into suspense, action, love, thriller, science fiction and the like in the video field. It should be understood that different users have different interests in different types of content, and content that the users are interested in should be preferentially recommended to improve user experience.

In an embodiment of the present application, a history access record of a sender of a voice to be processed may be obtained in advance, and an interest feature vector of the sender may be generated according to the history access record. In an example, the ratio of the number of times that the sender accesses different types of content to the total number of times of access in the historical access record may be counted to obtain the interestingness of the sender for the different types of content, and an interest feature vector of the sender may be generated according to the interestingness.

In addition, content identification can be performed on each content to be pushed in advance to obtain matching degrees of the content to be pushed corresponding to different types, so that a content feature vector of the content to be pushed is generated. And associating the content feature vector with the corresponding content to be pushed so as to obtain the content feature vector in the following.

And multiplying the acquired interest characteristic vector and the content characteristic vector of the content to be pushed, thereby calculating the content matching degree of each content to be pushed and the sender of the voice to be processed. It should be understood that the higher the content matching degree is, the more the content to be pushed meets the interest requirement of the sender of the voice to be processed.

In step S430, a target push content in the content to be pushed is determined according to the emotion matching degree and the content matching degree.

In an embodiment of the present application, the emotion matching degree and the content matching degree of each content to be pushed may be added to obtain a recommendation value of each content to be pushed, and the content to be pushed with the largest recommendation value is selected from the content to be pushed as a target pushed content. It should be understood that the target push content is the content which best meets the emotional requirements and interest requirements of the sender of the voice to be processed, and the accuracy of the pushed content is guaranteed.

In step S440, the response information and the target push content are combined, so as to generate feedback information for the to-be-processed speech.

In an embodiment of the present application, the response message may include a slot for the target push content to fill in, and the target push content may be filled in the corresponding slot, so as to form the feedback message for the to-be-processed voice. For example, if the selected response information is "please pay attention every day and no one XXX needs to be listened", and the target push content is song "warm", the feedback information obtained by combining the two is: do worry about doing every day, do not listen to the warm of Bian Liang Jing Ru? And so on.

Therefore, the target push content is selected from the contents to be pushed by calculating the content matching degree between the contents to be pushed and the sender of the voice to be processed and comprehensively considering the content matching degree and the emotion matching degree, so that the target push content can meet the emotion requirement of the sender and the interest requirement of the sender, the accuracy of the target push content is ensured, and the user experience is improved.

Based on the embodiments shown in fig. 2 and fig. 4, fig. 5 shows a flowchart of step S430 in the processing method of voice data of fig. 4 according to an embodiment of the present application. Referring to fig. 5, step S430 at least includes steps S510 to S530, which are described in detail as follows:

in step S510, an importance weight corresponding to the emotion matching degree and an importance weight corresponding to the content matching degree are obtained.

In an embodiment of the present application, corresponding importance weights may be set for the emotion matching degree and the content matching degree in advance to reflect the importance of the emotion matching degree and the content matching degree. In actual needs, if the emotion requirement of the sender of the to-be-processed speech is considered, the importance weight of the emotion matching degree can be set to be greater than the importance weight of the content matching degree, and if the interest requirement of the sender is considered, the importance weight of the content matching degree can be set to be greater than the importance weight of the emotion matching degree, and the like. Those skilled in the art can set the corresponding importance weight according to actual needs, and the present application is not limited to this.

In step S520, a recommendation value of the content to be pushed is calculated according to the emotion matching degree and the importance weight corresponding thereto, and the content matching degree and the importance weight corresponding thereto.

In an embodiment of the application, a recommendation value of the content to be pushed is calculated by performing weighted sum operation according to the emotion matching degree and the corresponding importance weight corresponding to the content to be pushed and the content matching degree and the corresponding importance weight. E.g. a sentiment match of S_eWith corresponding importance weight of I_eContent matching degree of S_iWith corresponding importance weight of I_iThen recommend the value S_r＝S_e*I_e+S_i*I_i。

In step S530, the content to be pushed with the largest recommendation value is selected from the contents to be pushed as the target push content.

In an embodiment of the application, according to the recommendation value of the to-be-pushed content obtained through calculation, the to-be-pushed content with the highest recommendation value in the to-be-pushed content is selected as the target push content, so that the accuracy of the target push content is ensured.

In other examples, a plurality of contents to be pushed may also be selected as target push contents, for example, contents to be pushed whose recommended value is top two or top three, so that the plurality of target push contents may be selected by a sender of the voice to be processed, so as to meet the actual requirement of the sender.

Based on the embodiment shown in fig. 2, fig. 6 shows a flowchart of step S210 in the processing method of voice data of fig. 2 according to an embodiment of the present application. Referring to fig. 6, step S210 at least includes steps S610 to S640, which are described in detail as follows:

in step S610, speech recognition is performed on the speech to be processed, so as to obtain text information corresponding to the speech to be processed.

In this embodiment, according to the acquired to-be-processed speech, speech recognition is performed on the to-be-processed speech, and an audio signal corresponding to the to-be-processed speech may be converted into corresponding text information.

In step S620, performing word segmentation on the text information to obtain keywords included in the text information.

In one embodiment of the present application, based on the recognized text information, the text information is segmented and meaningless words, such as subject and structure auxiliary words, are removed from the text information, so as to obtain keywords contained in the text information. For example, if the text information corresponding to the speech to be processed is "i want to know the latest movie", the text information is segmented into "i", "want", "know", "nearest", "showing", "movie", and the keywords included in the text information, i.e., "want", "know", "nearest", "showing", and "movie", are obtained after removing the nonsense words.

In step S630, the keywords are matched with keyword templates preset in each field, and the matching degree of the text information and the intention of each field is determined.

The keyword template may be a template for analyzing a request purpose of a user, and a person skilled in the art may preset a corresponding keyword matching template according to different fields, for example, in a music field, the keyword template may be preset to "i want to listen to a song of XXX (singer)", "i want to listen to music of XXX (emotion type)", and the like; in the video domain, the keyword template may be preset to "i want to see movies by XXX (actor)," what are the movies shown on XXX (time), "etc.

In one embodiment of the application, keywords contained in text information of the speech to be processed are matched with keyword templates in various fields to determine the keyword templates matched with the keywords, so that the intention matching degree between the text information and the various fields is obtained.

In step S640, according to the intention matching degree, intention information corresponding to the voice to be processed is determined.

In one embodiment of the present application, a domain that a sender of speech to be processed wants to access may be determined according to an intention matching degree between text information and each domain, for example, a higher intention matching degree between text information and a music domain indicates that the sender wants to listen to music, a higher intention matching degree between text information and a video domain indicates that the sender wants to watch video, and so on.

In the embodiment shown in fig. 6, the text information corresponding to the speech to be processed is obtained by performing speech recognition on the speech to be processed, the keywords included in the text information are obtained by performing word segmentation according to the text information, and then the keywords are matched with the keyword templates in each field to obtain the intention matching degree between the text information and each field, so that the intention information corresponding to the speech to be processed is determined according to the intention matching degree, the user requirements can be fully understood, and the accuracy of the intention information determination is ensured.

Based on the embodiments shown in fig. 2 and fig. 6, fig. 7 shows a flowchart of step S630 in the processing method of voice data of fig. 6 according to an embodiment of the present application. Referring to fig. 7, step S630 at least includes steps S710 to S730, which are described in detail as follows:

in step S710, the keyword is compared with a keyword template preset in each field, and a target keyword template containing the keyword in each field is determined.

In one embodiment of the present application, keywords included in text information are compared with keyword templates preset in each field, a keyword template including the keyword in each field is determined, and the keyword template is identified as a target keyword template.

In step S720, the relevance weight of the keyword included in each target keyword template in the corresponding domain is acquired.

The relevance weight may be information indicating the degree of importance of the keyword included in the keyword template in the corresponding field.

It should be understood that the same keywords may have different relevance weights in different domains, e.g., in the video domain, the relevance weights of the keywords such as "video", "movie" and "scenario" should be greater than their relevance weights in the music domain, the relevance weights of the keywords such as "listen", "song" and "singer" should be greater than their relevance weights in the video domain, etc.

In an embodiment of the present application, relevance weights may be set in different fields for each keyword in advance, and a correspondence table of the relevance weights of the keywords in the different fields may be established. Therefore, during subsequent acquisition, the correlation weight of the keyword in the corresponding field can be inquired by inquiring the corresponding relation table.

In one embodiment of the present application, corpora that may appear in each field during actual use may be collected for a plurality of fields, such as "i want to listen to music", "i want to listen to a song of XXX", "i want to see a video", "i want to see a recently shown movie", and so on. And performing word segmentation on the obtained linguistic data to obtain keywords contained in the linguistic data corresponding to each field. And counting the occurrence frequency of each keyword in the corpus of the corresponding field, and further determining the proportion of the occurrence frequency of each keyword in the corpus of the corresponding field to the corpus number of the field, so as to obtain the relevance weight of each keyword in the field.

In step S730, the sum of the relevance weights of the keywords included in each target keyword template is calculated to obtain the intention matching degree between the text information and each of the fields.

In one embodiment of the application, according to the target keyword templates determined in the respective fields and the relevance weights of the keywords contained in the respective target keyword templates, in each field, the sum of the relevance weights of the keywords contained in the target keyword templates in the field is calculated, so as to obtain the intention matching degree of the text information corresponding to the speech to be processed in the field.

Therefore, the comparison is carried out according to the intention matching degree, the field which the sender of the voice to be processed wants to access can be determined, the intention information corresponding to the voice to be processed is further determined, and the accuracy of determining the intention information is guaranteed.

Based on the embodiment shown in fig. 2, fig. 8 is a schematic flowchart illustrating a process of determining an emotion feature vector of content to be pushed, further included in the method for processing voice data according to an embodiment of the present application. Referring to fig. 8, determining the emotional characteristic vector of the content to be pushed at least includes steps S810 to S820, which are described in detail as follows:

in step S810, performing emotion matching on the content to be pushed by using a pre-trained neural network model, so as to obtain emotion matching values of the content to be pushed, which correspond to each preset emotion type.

In an embodiment of the application, emotion matching can be performed on each content to be pushed according to a pre-trained neural network model, so that the neural network model outputs emotion matching values of the content to be pushed corresponding to each preset emotion type. It should be noted that the neural network model may be an existing emotion recognition model, and details of the neural network model are not repeated herein.

As shown in the following table:

TABLE 1

In step S820, according to the emotion matching values of the content to be pushed corresponding to the preset emotion types, generating emotion feature vectors corresponding to the content to be pushed, and associating the emotion feature vectors with the content to be pushed respectively.

In an embodiment of the present application, the emotion matching values of the content to be pushed corresponding to the respective preset emotion types may be arranged according to a predetermined format, so as to generate an emotion feature vector corresponding to the content to be pushed, and the generation method may be as described above, which is not described herein again.

In the embodiment shown in fig. 8, the neural network model is used to perform emotion matching on the content to be pushed, so that the emotion matching efficiency can be greatly improved, and the emotion matching accuracy can be ensured.

Based on the embodiments shown in fig. 2 and fig. 8, fig. 9 is a schematic flow chart illustrating a process of modifying an emotion matching value of content to be pushed, further included in the method for processing voice data according to an embodiment of the present application. Referring to fig. 9, modifying the emotion matching value of the content to be pushed at least includes steps S910 to S920, which are described in detail as follows:

in step S910, an emotion matching value modification interface is displayed according to the modification request for the emotion matching value.

In one embodiment of the present application, the request for modifying the emotion matching value may be information for requesting modification of the emotion matching value of the content to be pushed. In one example, a person skilled in the art can generate and send a correction request for the emotion matching value by clicking a specific area (e.g., a "correct emotion matching value" button, etc.) on the display interface of the terminal device.

When the server receives the correction request, an emotion matching value correction interface can be displayed on a display interface of the terminal device, and the correction interface can contain the corresponding relation between the content to be pushed and the emotion matching values of all preset emotion types. One of them can be selected by those skilled in the art to determine the content to be pushed to be modified. And inputting a correct emotion matching value through an input device (such as an input keyboard or a touch display screen) configured by the terminal device, for example, changing an emotion matching value of the content a to be pushed, which corresponds to fear, from 0.2 to 0.1, and the like.

In step S920, according to the modification information for the emotion matching value obtained by the emotion matching value modification interface, modifying the emotion matching value of the content to be pushed corresponding to each preset emotion type, so as to obtain a modified emotion matching value of the content to be pushed corresponding to each preset emotion type.

In an embodiment of the application, according to the correction information for emotion matching acquired by the correction interface, the emotion matching value of the content to be pushed corresponding to each preset emotion type is updated. And generating an emotion feature vector corresponding to the content to be pushed according to the emotion matching value.

In the embodiment shown in fig. 9, by setting the emotion matching value correction interface, a person skilled in the art can easily check and correct the emotion matching value of the content to be pushed, and thus the accuracy of the emotion matching value corresponding to the content to be pushed is ensured, so that accurate recommendation can be performed subsequently.

Based on the technical solution of the above embodiment, a specific application scenario of the embodiment of the present application is introduced as follows:

referring to fig. 10 and 11, fig. 10 is a schematic diagram illustrating an exemplary system architecture to which the technical solution of the embodiment of the present application can be applied. FIG. 11 shows a flow diagram of a method of processing voice data according to one embodiment of the present application.

Referring to fig. 10, the system architecture may include a terminal device, an AI access layer, an emotion analysis system, a skill center control layer, a domain layer, and a content recommendation system.

Referring to fig. 10 and 11, in an embodiment of the present application, the terminal device may send the to-be-processed voice acquired by the voice collecting module to the AI access layer, the AI access layer sends the to-be-processed voice to the emotion analysis system, and the emotion analysis system may perform emotion analysis on the to-be-processed voice by using a pre-trained neural network model to obtain an emotion analysis result corresponding to the to-be-processed voice, and feed back the emotion analysis result to the AI access layer.

Meanwhile, the AI access layer can also perform intention analysis on the voice to be processed to obtain intention information corresponding to the voice to be processed, and sends the intention information corresponding to the voice to be processed and emotion analysis results to the skill central control layer.

The skill center control layer can determine a target field (such as music, video, chat or other fields and the like) to be accessed by a sender of the voice to be processed from the field layer according to the intention information corresponding to the voice to be processed, and acquire corresponding response information from the field layer. And sending the intention information and the emotion analysis result to a content recommendation system, so that the content recommendation system can select target push content from the contents to be pushed according to the intention information and the emotion analysis result, and generating feedback information aiming at the voice to be processed according to the response information and the target push content. And the content recommendation system feeds back the feedback information to the skill center control platform and finally sends the feedback information to the terminal equipment to feed back to a sender of the voice to be processed.

The following describes embodiments of the apparatus of the present application, which may be used to perform the processing method of voice data in the above embodiments of the present application. For details that are not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method for processing voice data described above in the present application.

Fig. 12 shows a block diagram of a processing device of speech data according to an embodiment of the application.

Referring to fig. 12, a speech data processing apparatus according to an embodiment of the present application includes:

the intention analysis module 1210 is configured to perform intention analysis on a to-be-processed voice and determine intention information corresponding to the to-be-processed voice;

the emotion analysis module 1220 is configured to perform emotion analysis on the speech to be processed to obtain an emotion analysis result of the speech to be processed;

the matching degree calculating module 1230 is configured to calculate an emotion matching degree between the content to be pushed and the speech to be processed according to the emotion analysis result and the emotion feature vector of the content to be pushed;

and an information determining module 1240, configured to determine feedback information for the to-be-processed speech according to the access intention, the emotion matching degree, and the to-be-pushed content.

In some embodiments of the present application, based on the foregoing, emotion analysis module 1220 is configured to: performing emotion recognition according to the voice to be processed to obtain emotion matching values of the voice to be processed corresponding to all preset emotion types; and generating emotion analysis vectors of the voice to be processed according to the emotion matching values of the voice to be processed corresponding to the preset emotion types, and taking the emotion analysis vectors as emotion analysis results.

In some embodiments of the present application, based on the foregoing, emotion analysis module 1220 is configured to: and multiplying the emotion feature vector of the content to be pushed with the emotion analysis vector to obtain the emotion matching degree between the content to be pushed and the voice to be processed.

In some embodiments of the present application, based on the foregoing scheme, the information determination module 1240 is configured to: determining response information to the voice to be processed according to the intention information; calculating the content matching degree between the content to be pushed and the sender of the voice to be processed according to the interest characteristic vector of the sender of the voice to be processed and the content characteristic vector of the content to be pushed; determining target push content in the content to be pushed according to the emotion matching degree and the content matching degree; and combining the response information and the target push content to generate feedback information aiming at the voice to be processed.

In some embodiments of the present application, based on the foregoing scheme, the information determination module 1240 is configured to: acquiring importance weight corresponding to the emotion matching degree and importance weight corresponding to the content matching degree; calculating a recommendation value of the content to be pushed according to the emotion matching degree and the corresponding importance weight thereof, and the content matching degree and the corresponding importance weight thereof; and selecting the content to be pushed with the maximum recommendation value from the content to be pushed as target pushing content.

In some embodiments of the present application, based on the foregoing, the intent analysis module 1210 is configured to: carrying out voice recognition on the voice to be processed to obtain text information corresponding to the voice to be processed; performing word segmentation on the text information to obtain keywords contained in the text information; matching the keywords with keyword templates preset in each field, and determining the intention matching degree of the text information and each field; and determining intention information corresponding to the voice to be processed according to the intention matching degree.

In some embodiments of the present application, based on the foregoing, the intent analysis module 1210 is configured to: comparing the keywords with keyword templates preset in each field, and determining target keyword templates containing the keywords in each field; acquiring the relevance weight of the keywords contained in each target keyword template in the corresponding field; and calculating the sum of the relevance weights of the keywords contained in each target keyword template to obtain the intention matching degree between the text information and each field.

In some embodiments of the present application, based on the foregoing, the emotion analysis module 1220 is further configured to: performing emotion matching on the content to be pushed by adopting a pre-trained neural network model to obtain emotion matching values of the content to be pushed, which correspond to all preset emotion types; generating emotion feature vectors corresponding to the content to be pushed according to emotion matching values of the content to be pushed corresponding to all preset emotion types, and associating the emotion feature vectors with the content to be pushed respectively.

In some embodiments of the present application, based on the foregoing solution, after performing emotion matching on the content to be pushed by using a pre-trained neural network model to obtain emotion matching values of the content to be pushed corresponding to each preset emotion type, the emotion analysis module 1220 is further configured to: displaying an emotion matching value correction interface according to the correction request of the emotion matching value; and correcting the emotion matching values of the content to be pushed corresponding to the preset emotion types according to the correction information for the emotion matching values acquired by the emotion matching value correction interface, so as to obtain the corrected emotion matching values of the content to be pushed corresponding to the preset emotion types.

It should be noted that the computer system of the electronic device shown in fig. 13 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 13, the computer system includes a Central Processing Unit (CPU)1301, which can perform various appropriate actions and processes, such as performing the methods described in the above embodiments, according to a program stored in a Read-Only Memory (ROM) 1302 or a program loaded from a storage portion 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data necessary for system operation are also stored. The CPU 1301, the ROM 1302, and the RAM 1303 are connected to each other via a bus 1304. An Input/Output (I/O) interface 1305 is also connected to bus 1304.

The following components are connected to the I/O interface 1305: an input portion 1306 including a keyboard, a mouse, and the like; an output section 1307 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 1308 including a hard disk and the like; and a communication section 1309 including a network interface card such as a LAN (Local area network) card, a modem, or the like. The communication section 1309 performs communication processing via a network such as the internet. A drive 1310 is also connected to the I/O interface 1305 as needed. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1310 as necessary, so that a computer program read out therefrom is mounted into the storage portion 1308 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications component 1309 and/or installed from removable media 1311. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 1301.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for processing voice data, comprising:

2. The method of claim 1, wherein performing emotion analysis on the speech to be processed to obtain an emotion analysis result of the speech to be processed comprises:

performing emotion recognition according to the voice to be processed to obtain emotion matching values of the voice to be processed corresponding to all preset emotion types;

and generating emotion analysis vectors of the voice to be processed according to the emotion matching values of the voice to be processed corresponding to the preset emotion types, and taking the emotion analysis vectors as emotion analysis results.

3. The method of claim 2, wherein calculating the emotion matching degree between the content to be pushed and the speech to be processed according to the emotion analysis result and the emotion feature vector of the content to be pushed comprises:

4. The method of claim 1, wherein determining feedback information for the speech to be processed according to the intention information, the emotion matching degree and the content to be pushed comprises:

determining response information to the voice to be processed according to the intention information;

calculating the content matching degree between the content to be pushed and the sender of the voice to be processed according to the interest characteristic vector of the sender of the voice to be processed and the content characteristic vector of the content to be pushed;

determining target push content in the content to be pushed according to the emotion matching degree and the content matching degree;

and combining the response information and the target push content to generate feedback information aiming at the voice to be processed.

5. The method of claim 4, wherein determining the target push content in the content to be pushed according to the emotion matching degree and the content matching degree comprises:

acquiring importance weight corresponding to the emotion matching degree and importance weight corresponding to the content matching degree;

calculating a recommendation value of the content to be pushed according to the emotion matching degree and the corresponding importance weight thereof, and the content matching degree and the corresponding importance weight thereof;

and selecting the content to be pushed with the maximum recommendation value from the content to be pushed as target pushing content.

6. The method of claim 1, wherein performing intent analysis on the voice to be processed to determine intent information corresponding to the voice to be processed comprises:

carrying out voice recognition on the voice to be processed to obtain text information corresponding to the voice to be processed;

performing word segmentation on the text information to obtain keywords contained in the text information;

matching the keywords with keyword templates preset in each field, and determining the intention matching degree of the text information and each field;

and determining intention information corresponding to the voice to be processed according to the intention matching degree.

7. The method according to claim 6, wherein matching the keyword with a keyword template preset in each field and determining the matching degree of the text information with the intention of each field comprises:

comparing the keywords with keyword templates preset in each field, and determining target keyword templates containing the keywords in each field;

acquiring the relevance weight of the keywords contained in each target keyword template in the corresponding field;

and calculating the sum of the relevance weights of the keywords contained in each target keyword template to obtain the intention matching degree between the text information and each field.

8. The method of claim 1, further comprising:

performing emotion matching on the content to be pushed by adopting a pre-trained neural network model to obtain emotion matching values of the content to be pushed, which correspond to all preset emotion types;

generating emotion feature vectors corresponding to the content to be pushed according to emotion matching values of the content to be pushed corresponding to all preset emotion types, and associating the emotion feature vectors with the content to be pushed respectively.

9. The method of claim 8, wherein after performing emotion matching on the content to be pushed by using a pre-trained neural network model to obtain emotion matching values of the content to be pushed corresponding to each preset emotion type, the method further comprises:

displaying an emotion matching value correction interface according to the correction request of the emotion matching value;

and correcting the emotion matching values of the content to be pushed corresponding to the preset emotion types according to the correction information for the emotion matching values acquired by the emotion matching value correction interface, so as to obtain the corrected emotion matching values of the content to be pushed corresponding to the preset emotion types.

10. An apparatus for processing voice data, comprising: