CN111883131B

CN111883131B - Voice data processing method and device

Info

Publication number: CN111883131B
Application number: CN202010855563.4A
Authority: CN
Inventors: 汪辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2023-10-27
Anticipated expiration: 2040-08-20
Also published as: CN111883131A

Abstract

The embodiment of the application provides a voice data processing method and device. The processing method of the voice data comprises the following steps: performing intention analysis on voice to be processed, and determining intention information corresponding to the voice to be processed; carrying out emotion analysis on the voice to be processed to obtain an emotion analysis result of the voice to be processed; according to the emotion analysis result and the emotion feature vector of the content to be pushed, calculating the emotion matching degree between the content to be pushed and the voice to be processed; and determining feedback information aiming at the voice to be processed according to the intention information, the emotion matching degree and the content to be pushed. The technical scheme of the embodiment of the application can respond according to the current emotion state of the user so as to provide more humanized interaction experience for the user.

Description

Voice data processing method and device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method and a device for processing voice data.

Background

With the development of artificial intelligence and the continuous improvement of the interactive experience requirements of people, the intelligent interaction mode has gradually replaced the traditional man-machine interaction mode. In the current technical scheme, the existing intelligent interaction scheme can only approximately analyze the semantic content of the user voice, so that corresponding response is made. However, it cannot analyze the emotional demand of the user according to the current emotion of the user. Therefore, how to respond according to the current emotion of the user and further provide more humanized interaction experience for the user becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a processing method and a processing device for voice data, which can respond according to the current emotion of a user at least to a certain extent, thereby providing more humanized interaction experience for the user.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

According to an aspect of an embodiment of the present application, there is provided a method for processing voice data, including:

performing intention analysis on voice to be processed, and determining intention information corresponding to the voice to be processed;

carrying out emotion analysis on the voice to be processed to obtain an emotion analysis result of the voice to be processed;

according to the emotion analysis result and the emotion feature vector of the content to be pushed, calculating the emotion matching degree between the content to be pushed and the voice to be processed;

and determining feedback information aiming at the voice to be processed according to the intention information, the emotion matching degree and the content to be pushed.

According to an aspect of an embodiment of the present application, there is provided a processing apparatus for voice data, the apparatus including:

the intention analysis module is used for carrying out intention analysis on the voice to be processed and determining intention information corresponding to the voice to be processed;

The emotion analysis module is used for carrying out emotion analysis on the voice to be processed to obtain an emotion analysis result of the voice to be processed;

the matching degree calculation module is used for calculating the emotion matching degree between the content to be pushed and the voice to be processed according to the emotion analysis result and the emotion feature vector of the content to be pushed;

and the information determining module is used for determining feedback information aiming at the voice to be processed according to the access intention, the emotion matching degree and the content to be pushed.

In some embodiments of the application, based on the foregoing, the emotion analysis module is configured to: carrying out emotion recognition according to the voice to be processed to obtain emotion matching values of the voice to be processed corresponding to each preset emotion type; generating emotion analysis vectors of the voice to be processed according to emotion matching values of the voice to be processed corresponding to each preset emotion type, and taking the emotion analysis vectors as emotion analysis results.

In some embodiments of the application, based on the foregoing, the emotion analysis module is configured to: multiplying the emotion feature vector of the content to be pushed by the emotion analysis vector to obtain the emotion matching degree between the content to be pushed and the voice to be processed.

In some embodiments of the application, based on the foregoing, the information determination module is configured to: determining response information to the voice to be processed according to the intention information; calculating the content matching degree between the content to be pushed and the sender of the voice to be processed according to the interest feature vector of the sender of the voice to be processed and the content feature vector of the content to be pushed; determining target push content in the content to be pushed according to the emotion matching degree and the content matching degree; and combining the response information with the target push content to generate feedback information aiming at the voice to be processed.

In some embodiments of the application, based on the foregoing, the information determination module is configured to: acquiring importance weights corresponding to the emotion matching degrees and importance weights corresponding to the content matching degrees; calculating a recommendation value of the content to be pushed according to the emotion matching degree and the importance weight corresponding to the emotion matching degree and the content matching degree and the importance weight corresponding to the content matching degree; and selecting the content to be pushed with the maximum recommended value from the content to be pushed as target push content.

In some embodiments of the application, based on the foregoing, the intent analysis module is configured to: performing voice recognition on voice to be processed to obtain text information corresponding to the voice to be processed; word segmentation is carried out on the text information, and keywords contained in the text information are obtained; matching the keywords with preset keyword templates in each field, and determining the intention matching degree of the text information and each field; and determining intention information corresponding to the voice to be processed according to the intention matching degree.

In some embodiments of the application, based on the foregoing, the intent analysis module is configured to: comparing the keywords with preset keyword templates in each field to determine target keyword templates containing the keywords in each field; acquiring the correlation weight of keywords contained in each target keyword template in the corresponding field; and calculating the sum of the correlation weights of the keywords contained in each target keyword template to obtain the intention matching degree between the text information and each field.

In some embodiments of the application, based on the foregoing, the emotion analysis module is further configured to: performing emotion matching on the content to be pushed by adopting a pre-trained neural network model to obtain emotion matching values of the content to be pushed corresponding to preset emotion types; and generating emotion feature vectors corresponding to the content to be pushed according to emotion matching values of the content to be pushed corresponding to each preset emotion type, and respectively associating the emotion feature vectors with the content to be pushed.

In some embodiments of the present application, based on the foregoing solution, after performing emotion matching on the content to be pushed by using a pre-trained neural network model, to obtain emotion matching values of the content to be pushed corresponding to each preset emotion type, the emotion analysis module is further configured to: displaying an emotion matching value correction interface according to the correction request of the emotion matching value; and correcting the emotion matching value of the content to be pushed corresponding to each preset emotion type according to the correction information for the emotion matching value, which is obtained by the emotion matching value correction interface, so as to obtain corrected emotion matching values of the content to be pushed corresponding to each preset emotion type.

According to an aspect of the embodiments of the present application, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a method of processing speech data as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method of processing speech data as described in the above embodiments.

According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the processing method of voice data provided in the above-described embodiment.

According to the technical scheme provided by the embodiments of the application, intention analysis is carried out on the voice to be processed, intention information corresponding to the voice to be processed is determined, emotion analysis is carried out on the voice to be processed, an emotion analysis result of the voice to be processed is obtained, according to the emotion analysis result and emotion feature vectors of contents to be pushed, emotion matching degree between the contents to be pushed and the voice to be processed is calculated, and then feedback information aiming at the voice to be processed is determined according to the intention information, the emotion matching degree and the contents to be pushed. Therefore, the emotion matching degree of the voice to be processed and each content to be pushed can be calculated according to the emotion analysis result of the voice to be processed, so that feedback information aiming at the voice to be processed is determined, the emotion of a sender of the voice to be processed is responded, and more humanized interaction experience is provided for a user.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of an embodiment of the application may be applied;

FIG. 2 shows a flow diagram of a method of processing speech data according to one embodiment of the application;

FIG. 3 is a flow chart of step S220 in the method for processing the voice data of FIG. 2 according to one embodiment of the present application;

FIG. 4 is a flow chart of step S240 in the method for processing the voice data of FIG. 2 according to one embodiment of the present application;

FIG. 5 is a flow chart illustrating step S430 in the method for processing the voice data of FIG. 4 according to one embodiment of the present application;

Fig. 6 is a flowchart illustrating step S210 in the voice data processing method of fig. 2 according to an embodiment of the present application;

fig. 7 is a flowchart illustrating step S630 in the voice data processing method of fig. 6 according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of determining emotion feature vectors of content to be pushed, which is further included in a method for processing voice data according to an embodiment of the present application;

FIG. 9 is a schematic flow chart of modifying emotion matching values of content to be pushed, which is further included in a method for processing voice data according to an embodiment of the present application;

FIG. 10 shows a schematic diagram of an exemplary system architecture to which embodiments of the present application may be applied;

FIG. 11 shows a flow diagram of a method of processing voice data according to one embodiment of the application;

FIG. 12 shows a block diagram of a processing device for voice data according to one embodiment of the application;

fig. 13 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of an embodiment of the present application may be applied.

As shown in fig. 1, the system architecture may include a terminal device 101 having a voice signal collection module, a network 102, and a server 103. The terminal device 101 with the voice signal collecting module may be a mobile phone, a portable computer, a tablet computer, a headset, a microphone, and other terminal devices; network 102 is the medium used to provide a communication link between terminal device 101 and server 103. Network 102 may include various connection types, such as wired communication links, wireless communication links, and the like. In an embodiment of the present disclosure, the network 102 between the terminal device 101 and the server 103 may be a wireless communication link, in particular a mobile network.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks and servers as practical.

The server 103 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms.

In one embodiment of the present disclosure, a user sends a voice to be processed to a terminal device 101 having a voice signal collecting module, a server 103 may perform intent analysis on the voice to be processed, determine intent information corresponding to the voice to be processed, perform emotion analysis on the voice to be processed to obtain an emotion analysis result of the voice to be processed, and calculate, according to the emotion analysis result and an emotion feature vector of content to be pushed, an emotion matching degree between the content to be pushed and the voice to be processed, thereby determining feedback information for the voice to be processed according to the intent information, the emotion matching degree and the content to be pushed. The server 103 may transmit the feedback information to the terminal device 101 so that the terminal device 101 feeds back the feedback information to the user.

It should be noted that, the method for processing voice data provided in the embodiment of the present application is generally executed by the server 103, and accordingly, the device for processing voice data is generally disposed in the server 103. However, in other embodiments of the present application, the terminal device may also have a similar function to the server, so as to execute the scheme of the processing method of voice data provided by the embodiment of the present application.

The implementation details of the technical scheme of the embodiment of the application are described in detail below:

fig. 2 shows a flow diagram of a method of processing speech data according to an embodiment of the application. Referring to fig. 2, the processing method of voice data at least includes steps S210 to S240, and is described in detail as follows:

in step S210, intent analysis is performed on the voice to be processed, and intent information corresponding to the voice to be processed is determined.

The intent analysis may be a process for resolving an access intent corresponding to a voice to be processed, so as to know a request purpose of a sender, for example, the sender of the voice to be processed wants to chat, inquire about weather, watch video, listen to music, or the like.

In one embodiment of the application, the sender of the speech to be processed may send out speech towards a terminal device having a speech collection module by means of which the speech to be processed is collected. After the voice to be processed is obtained, voice recognition can be performed on the voice to be processed, and intention analysis is performed according to the voice recognition result, so that intention information corresponding to the voice to be processed is determined.

In step S220, emotion analysis is performed on the voice to be processed, so as to obtain an emotion analysis result of the voice to be processed.

The emotion analysis may be a process for analyzing emotion types corresponding to the voice to be processed, so as to know the current emotion of the sender of the voice to be processed, such as happiness, heart injury or fear.

In one embodiment of the present application, the voice to be processed may be parsed, and sound features included in the voice to be processed may be obtained, where the sound features may include, but are not limited to, speech speed, pitch, and the like. It will be appreciated that the speed of speech and the pitch of speech of a person in different emotional states also corresponds to the change, for example, the speed of speech of a person may be slowed down when breathing, the pitch may be lowered, the speed of speech may be faster when opening heart, the pitch may be raised, etc. And then, carrying out emotion analysis on the acquired sound characteristics in combination with a voice recognition result corresponding to the voice to be processed, and determining an emotion state corresponding to the voice to be processed, so as to obtain an emotion analysis result corresponding to the voice to be processed.

In step S230, according to the emotion analysis result and the emotion feature vector of the content to be pushed, an emotion matching degree between the content to be pushed and the voice to be processed is calculated.

The content to be pushed may be various resources for the user to obtain, and may be various types of resources, for example, the content to be pushed may include, but is not limited to, audio resources, video resources, text resources, and so on. In an example, the content to be pushed may be downloaded in advance and stored locally for later retrieval; in other examples, the content to be pushed may also be obtained from the network in real time, thereby saving storage resources.

The emotion feature vector may be vector information for representing the degree of matching between the content to be pushed and the respective emotion types. The emotion feature vector may include a matching degree between the corresponding content to be pushed and each emotion type, for example, a matching degree between a certain content to be pushed and an anger emotion type is 0.2, a matching degree between the certain content to be pushed and an happy emotion type is 0.8, and so on.

In one embodiment of the application, emotion recognition can be performed on the content to be pushed in advance, so that the matching degree of the content to be pushed corresponding to each emotion type is obtained, and corresponding emotion feature vectors are generated according to the matching degree of the content to be pushed and each emotion type. And associating the emotion feature vector with the corresponding content to be pushed for subsequent inquiry. In an example, a corresponding relation table may be established according to identification information (such as a number) of the content to be pushed and the emotion feature vector, and when the emotion feature vector of the content to be pushed is obtained, the corresponding relation table may be queried according to the identification information of the content to be pushed, so as to obtain the emotion feature vector of the content to be pushed.

In one embodiment of the present application, after the emotion feature vector of each content to be pushed is obtained, the emotion feature vector can be respectively matched with the emotion analysis result of the voice to be processed, so as to calculate and obtain the emotion matching degree between the voice to be processed and each content to be pushed. It should be understood that the higher the emotion matching degree, the higher the matching degree of the two, and the lower the emotion matching degree, the lower the matching degree of the two.

In step S240, feedback information for the voice to be processed is determined according to the intention information, the emotion matching degree and the content to be pushed.

In one embodiment of the present application, the area that the sender of the processed voice wants to access, such as chatting, listening to music, watching video or listening to novels, etc., may be determined according to the intention information corresponding to the processed voice. According to the domain which the sender wants to access, the content to be pushed with the highest emotion matching degree with the voice to be processed can be selected from the content to be pushed corresponding to the domain to serve as target push content, and feedback information aiming at the voice to be processed is generated according to the target push content.

For example, the voice to be processed is "i don't care today, want to listen to songs", and the intention analysis is performed on the voice, so that the sender of the voice to be processed can know that he wants to listen to music, and then the emotion analysis is performed by combining the voice to be processed and the sound characteristics contained in the voice, so that the current emotion of the sender is analyzed to be possibly a heart injury. Therefore, to-be-pushed content which is matched with the emotion, namely song warm, can be selected as target push content in the music field, feedback information aiming at the to-be-processed voice, namely, the fact that a host is happy every day and does not need to listen to a piece of warm of Liang Jingru, is generated according to the target push content, and the fact that the host is not happy every day is? The server can send the feedback information to the terminal device, and the terminal device can transmit the feedback information to the sender through voice playing or video display and other modes. The sender may then continue to operate according to the feedback information, e.g., the sender says "ok", the server obtains the song "warm", and the song is played by the terminal device, etc.

In the embodiment shown in fig. 2, through performing intent analysis and emotion analysis on the voice to be processed, intent information and emotion analysis results corresponding to the voice to be processed are correspondingly determined. And calculating the emotion matching degree between each content to be pushed and the voice to be processed according to the emotion analysis result and the emotion feature vector of each content to be pushed. And determining feedback information aiming at the voice to be processed according to the intention information, the emotion matching degree and the content to be pushed. Therefore, when intelligent interaction is carried out, the method is not only limited to text information of the voice to be processed, but also can be combined with emotion types of a sender to feed back information matched with emotion to the sender, so that the method can respond to the current emotion of the sender, and more humanized interaction experience can be provided for the sender.

Fig. 3 is a flowchart illustrating step S220 in the voice data processing method of fig. 2 according to an embodiment of the present application, based on the embodiment shown in fig. 2. Referring to fig. 3, step S220 includes at least steps S310 to S320, and is described in detail as follows:

in step S310, emotion recognition is performed according to the to-be-processed voice, so as to obtain emotion matching values of the to-be-processed voice corresponding to each preset emotion type.

In one embodiment of the present application, a plurality of emotion types may be preset by those skilled in the art, for example, the preset emotion types may include, but are not limited to, happiness, love, surprise, neutrality, sadness, fear, anger, and the like. When emotion recognition is carried out on the voice to be processed, emotion matching values corresponding to all preset emotion types of the voice to be processed can be output. It should be understood that the same speech to be processed may correspond to a plurality of preset emotion types, but emotion matching values between the two may be the same or different, for example, emotion matching values of a certain speech to be processed corresponding to each preset emotion type (happy, loved, surprise, neutral, sad, fear and anger) are respectively: 0.2,0.3,0.6,0.3,0.1,0.2, and 0.1, etc.

In step S320, according to the emotion matching values of the to-be-processed voice corresponding to the preset emotion types, an emotion analysis vector of the to-be-processed voice is generated, and the emotion analysis vector is used as an emotion analysis result.

In one embodiment of the present application, according to emotion matching values of the to-be-processed voice corresponding to each preset emotion type, the emotion matching values may be arranged according to a predetermined format to generate an emotion analysis vector corresponding to the to-be-processed voice. For example, emotion analysis feature vectors corresponding to the voice a to be processed are: [ happy, loved, surprise, neutral, sad, fear, anger ] = [0.2,0.3,0.6,0.3,0.1,0.2,0.1], and so on.

In other examples, the encoding may be performed according to emotion matching values of the to-be-processed voice corresponding to each preset emotion type, for example, the emotion matching values are converted into binary values or decimal values, and the converted values are arranged according to a predetermined format to obtain emotion analysis vectors corresponding to the to-be-processed voice, so that the emotion analysis vectors are used as emotion analysis results of the to-be-processed voice.

In the embodiment shown in fig. 3, emotion analysis is performed on the voice to be processed to obtain emotion matching values of the voice to be processed corresponding to each preset emotion type, and then a corresponding emotion analysis vector is generated according to the emotion matching values. Therefore, emotion analysis of the voice to be processed can cover the possibility of emotion of the voice to be processed, and the situations of misjudgment of emotion and the like caused by single possibility of output are avoided.

Based on the embodiments shown in fig. 2 and fig. 3, in one embodiment of the present application, calculating, according to the emotion analysis result and the emotion feature vector of the content to be pushed, an emotion matching degree between the content to be pushed and the voice to be processed includes:

multiplying the emotion feature vector of the content to be pushed by the emotion analysis vector to obtain the emotion matching degree between the content to be pushed and the voice to be processed.

In this embodiment, the emotion feature vectors of each content to be pushed are multiplied by the emotion analysis vector of the voice to be processed, so as to calculate and obtain the emotion matching degree between the content to be pushed and the voice to be processed. For example, the emotion feature vector of the content to be pushed is [1,0,1,0,1,1,1], the emotion analysis vector corresponding to the voice to be processed is [0,1,1,1,0,1,0], the emotion feature vector and the emotion analysis vector are multiplied at the corresponding positions of the emotion feature vector and the emotion analysis vector, and adding the products at each position to obtain the emotion matching degree between the two, namely 1×0+0×1+1×1+0×1+1×0+1×1+1×1×0=2.

It should be appreciated that the higher the emotion match value for the same emotion type, the higher the emotion match value calculated. Therefore, the emotion matching degree can be used as a standard for evaluating whether the emotion types of the content to be pushed and the voice to be processed are similar or identical, and meanwhile, a plurality of preset emotion types are arranged, so that the accuracy of emotion matching degree calculation is improved.

Fig. 4 shows a flowchart of step S240 in the processing method of voice data of fig. 2 according to an embodiment of the present application, based on the embodiment shown in fig. 2. Referring to fig. 4, step S240 includes at least steps S410 to S440, and is described in detail as follows:

In step S410, response information to the speech to be processed is determined according to the intention information.

In one embodiment of the present application, a plurality of response templates may be preset for each domain to respond to the voice to be processed, avoiding mechanical response. For example, "every day should be happy, should not hear a first XXX" and "let us hear a first XXX together, is good? "and so on," is the play amount of XXX today to be XXX, what is to look? "and" is XXX newly mapped today, is it to be seen? ", etc. Therefore, according to the intention information of the voice to be processed, a corresponding response template can be selected in the corresponding field to serve as response information, so that the situation that a certain content to be pushed is directly recommended to a user is avoided, the user feels that the response is too hard, and the interactive experience of the user is ensured.

In step S420, a content matching degree between the content to be pushed and the sender of the voice to be processed is calculated according to the interest feature vector of the sender of the voice to be processed and the content feature vector of the content to be pushed.

The interest feature vector may be vector information describing the user's interest level in a certain type of content. It should be appreciated that even in the same field, different content to be pushed may be of different types, e.g. classification of music by emotion in the music field may be classified into casualty music, nostalgic music, happy music, healing music, relaxation music, etc.; videos can be classified into suspense, action, love, thrill, science fiction, and the like in the video field. It should be understood that the interest degree of different users in different types of content is not the same, and the content interested by the user should be recommended preferentially, so as to improve the user experience.

In one embodiment of the present application, a history access record of a sender of a voice to be processed may be obtained in advance, and an interest feature vector of the sender may be generated according to the history access record. In an example, the ratio of the number of times the sender accesses the different types of content to the total number of times of accesses in the historical access record may be counted, so as to obtain the interest degree of the sender on the different types of content, and an interest feature vector of the sender is generated according to the interest degree.

In addition, content identification can be performed on each content to be pushed in advance, so that matching degrees of the content to be pushed corresponding to different types are obtained, and a content feature vector of the content to be pushed is generated. And associating the content feature vector with the corresponding content to be pushed so as to obtain the content to be pushed later.

Multiplying the obtained interest feature vector and the content feature vector of the content to be pushed, so as to calculate and obtain the content matching degree of each content to be pushed and the sender of the voice to be processed. It should be understood that the higher the content matching degree, the more the content to be pushed meets the interest requirement of the sender of the voice to be processed.

In step S430, a target push content in the content to be pushed is determined according to the emotion matching degree and the content matching degree.

In one embodiment of the present application, the emotion matching degree and the content matching degree of each content to be pushed may be added to obtain a recommended value of each content to be pushed, and the content to be pushed with the largest recommended value is selected from the content to be pushed as the target push content. It should be appreciated that the target push content is the content that best meets the emotion requirements and interest requirements of the sender of the voice to be processed, ensuring the accuracy of the pushed content.

In step S440, the response information and the target push content are combined, so as to generate feedback information for the voice to be processed.

In one embodiment of the present application, the response information may include slots for the target push content to fill, and the target push content may be filled into the corresponding slots, thereby forming feedback information for the voice to be processed. For example, the selected response information is "every day is to be happy, one XXX is not to be listened to", the target push content is song "Warm", and the feedback information obtained by combining the selected response information and the selected response information is: every day, do you need not hear a first Liang Jingru "warm up? And so on.

Therefore, the target push content is selected from the content to be pushed by calculating the content matching degree between the content to be pushed and the sender of the voice to be processed and comprehensively considering the content matching degree and the emotion matching degree, so that the target push content can meet the emotion requirement of the sender and the interest requirement of the sender, the accuracy of the target push content is ensured, and the user experience is improved.

Fig. 5 shows a flow chart of step S430 in the processing method of voice data of fig. 4 according to an embodiment of the present application, based on the embodiments shown in fig. 2 and 4. Referring to fig. 5, step S430 includes at least steps S510 to S530, and is described in detail as follows:

in step S510, an importance weight corresponding to the emotion matching degree and an importance weight corresponding to the content matching degree are obtained.

In one embodiment of the present application, the importance weights corresponding to the emotion matching degree and the content matching degree may be set in advance to embody the importance of the emotion matching degree and the content matching degree. In actual needs, if the emotion requirement of the sender of the speech to be processed is emphasized, the importance weight of the emotion matching degree may be set to be larger than the importance weight of the content matching degree, if the interest requirement of the sender is emphasized, the importance weight of the content matching degree may be set to be larger than the importance weight of the emotion matching degree, and so on. The person skilled in the art can set the corresponding importance weight according to the actual need, and the present application is not limited in particular.

In step S520, the recommended value of the content to be pushed is calculated according to the emotion matching degree and the importance weight corresponding to the emotion matching degree and the content matching degree and the importance weight corresponding to the emotion matching degree.

In one embodiment of the application, a weighted sum operation is performed according to the emotion matching degree and the importance weight corresponding to the content to be pushed, and the content matching degree and the importance weight corresponding to the content to be pushed, so as to calculate and obtain the recommended value of the content to be pushed. For example, emotion matching degree S _e The corresponding importance weight is I _e Content matching degree S _i The corresponding importance weight is I _i Recommended value S _r ＝S _e *I _e +S _i *I _i 。

In step S530, the content to be pushed with the largest recommendation value is selected from the content to be pushed as the target push content.

In one embodiment of the present application, according to the calculated recommended value of the content to be pushed, the content to be pushed with the highest recommended value in the content to be pushed is selected as the target push content, so as to ensure the accuracy of the target push content.

In other examples, a plurality of to-be-pushed contents may also be selected as the target push contents, for example, the to-be-pushed contents with the first two or the first three of the recommended value rows, so that the plurality of target push contents may be selected by the sender of the to-be-processed voice to meet the actual requirement of the sender.

Fig. 6 is a flowchart illustrating step S210 in the voice data processing method of fig. 2 according to an embodiment of the present application, based on the embodiment shown in fig. 2. Referring to fig. 6, step S210 includes at least steps S610 to S640, and is described in detail as follows:

in step S610, speech recognition is performed on the speech to be processed, so as to obtain text information corresponding to the speech to be processed.

In this embodiment, according to the acquired voice to be processed, voice recognition is performed on the voice to be processed, and an audio signal corresponding to the voice to be processed may be converted into corresponding text information.

In step S620, the text information is segmented to obtain keywords included in the text information.

In one embodiment of the application, the text information is segmented according to the recognized text information, nonsensical words such as a subject and a structure aid are removed from the text information, and keywords contained in the text information are obtained. For example, when the text information corresponding to the voice to be processed is "i want to know the movie recently played", the words of the text information can be divided into "i", "want", "know", "recent", "play", "and" movie ", and the keywords included in the text information, i.e." want "," know "," recent "," play "and" movie ", are obtained after the nonsensical words are removed.

In step S630, the keyword is matched with a keyword template preset in each field, and the degree of matching between the text information and the intention of each field is determined.

The keyword template may be a template for resolving a request purpose of a user, and a person skilled in the art may preset a corresponding keyword matching template according to different fields, for example, in the music field, the keyword template may be preset as "i want to listen to a song of XXX (singer)", "i want to listen to music of a point XXX (emotion type)", and so on; in the video field, keyword templates may be preset as "i want to see a movie of XXX (actor)", "what movies show on XXX (time), and so on.

In one embodiment of the application, keywords contained in text information of the voice to be processed are matched with keyword templates in all the fields to determine keyword templates matched with the keywords, and the intention matching degree between the text information and each field is obtained.

In step S640, according to the intent matching degree, intent information corresponding to the speech to be processed is determined.

In one embodiment of the present application, according to the degree of intention matching between text information and each domain, it may be determined that a domain to be accessed by a sender of a voice to be processed, for example, if the degree of intention matching between text information and a music domain is high, it indicates that the sender wants to listen to music, if the degree of intention matching between text information and a video domain is high, it indicates that the sender wants to see video, and so on.

In the embodiment shown in fig. 6, text information corresponding to the voice to be processed is obtained by performing voice recognition on the voice to be processed, keywords contained in the text information are obtained by performing word segmentation according to the text information, and then the keywords are matched with keyword templates in all fields to obtain the intention matching degree between the text information and all fields, so that intention information corresponding to the voice to be processed is determined according to the intention matching degree, the user requirements can be fully understood, and the accuracy of determining the intention information is ensured.

Fig. 7 shows a flowchart of step S630 in the processing method of voice data of fig. 6 according to an embodiment of the present application, based on the embodiments shown in fig. 2 and 6. Referring to fig. 7, step S630 includes at least steps S710 to S730, and is described in detail as follows:

in step S710, the keywords are compared with preset keyword templates in each field, and a target keyword template containing the keywords in each field is determined.

In one embodiment of the present application, keywords included in text information are compared with keyword templates preset in each field, and keyword templates including the keywords in each field are determined and identified as target keyword templates.

In step S720, the relevance weights of the keywords included in each target keyword template in the corresponding domain are acquired.

The relevance weight may be information indicating the degree of importance of the keywords included in the keyword templates in the corresponding fields.

It should be appreciated that the same keywords may have different relevance weights in different domains, e.g., in the video domain, the relevance weights of keywords such as "video," "movie," and "scenario" should be greater than their relevance weights in the music domain, the relevance weights of keywords such as "listen," "song," and "singer" should be greater than their relevance weights in the video domain, and so on.

In one embodiment of the present application, the correlation weights may be set in advance for the respective keywords in different fields, and a correspondence table of the correlation weights of the keywords in different fields may be established. Therefore, in the subsequent acquisition, the correlation weight of the keyword in the corresponding field can be queried through querying the corresponding relation table.

In one embodiment of the present application, corpora that may occur during actual use of each domain may be collected for multiple domains, such as "i want to listen to music", "i want to listen to songs of XXX", "i want to see video", "i want to see recently shown movies", and so on. And the obtained corpus is segmented to obtain keywords contained in the corpus corresponding to each field. And counting the occurrence times of each keyword in the corpus of the corresponding field, and further determining the proportion of the occurrence times of each keyword in the corpus of the corresponding field to the corpus quantity of the field, so as to obtain the correlation weight of each keyword in the field.

In step S730, a sum of correlation weights of keywords included in each target keyword template is calculated, so as to obtain an intention matching degree between the text information and each field.

In one embodiment of the present application, according to the target keyword templates determined in each domain and the correlation weights of the keywords included in each target keyword template, in each domain, the sum of the correlation weights of the keywords included in the target keyword templates in the domain is calculated, so as to obtain the degree of intended matching of the domain with the text information corresponding to the voice to be processed.

Therefore, the domain which is wanted to be accessed by the sender of the voice to be processed can be determined by comparing according to the intention matching degree, so that the intention information corresponding to the voice to be processed is determined, and the accuracy of determining the intention information is ensured.

Based on the embodiment shown in fig. 2, fig. 8 is a schematic flow chart of determining emotion feature vectors of content to be pushed, which is further included in the method for processing voice data according to an embodiment of the present application. Referring to fig. 8, determining an emotion feature vector of a content to be pushed includes at least steps S810 to S820, which are described in detail below:

In step S810, performing emotion matching on the content to be pushed by using a pre-trained neural network model, so as to obtain emotion matching values of the content to be pushed corresponding to each preset emotion type.

In one embodiment of the present application, emotion matching may be performed on each content to be pushed according to a neural network model trained in advance, so that the neural network model outputs emotion matching values of the content to be pushed corresponding to each preset emotion type. It should be noted that the neural network model may be an existing emotion recognition model, which is not described herein.

The following table shows:

TABLE 1

In step S820, according to the emotion matching values of the content to be pushed corresponding to the preset emotion types, emotion feature vectors corresponding to the content to be pushed are generated, and the emotion feature vectors are associated with the content to be pushed respectively.

In an embodiment of the present application, emotion matching values of content to be pushed corresponding to each preset emotion type may be arranged according to a predetermined format, so as to generate an emotion feature vector corresponding to the content to be pushed, and the generation method may be as described above, which is not described herein in detail.

In the embodiment shown in fig. 8, the neural network model is adopted to perform emotion matching on the content to be pushed, so that the efficiency of emotion matching can be greatly improved, and the accuracy of emotion matching can be ensured.

Fig. 9 is a schematic flow chart of modifying emotion matching values of content to be pushed, which is further included in the method for processing voice data according to an embodiment of the present application, based on the embodiments shown in fig. 2 and 8. Referring to fig. 9, the modification of the emotion matching value of the content to be pushed at least includes steps S910 to S920, which are described in detail as follows:

in step S910, an emotion matching value correction interface is displayed according to the request for correcting the emotion matching value.

In one embodiment of the present application, the correction request for the emotion matching value may be information for requesting correction of the emotion matching value of the content to be pushed. In one example, one skilled in the art may generate and send a request for correction of an emotion match value by clicking on a particular area on the display interface of the terminal device (e.g., a "correct emotion match value" key, etc.).

When the server receives the correction request, an emotion matching value correction interface can be displayed on a display interface of the terminal device, and the correction interface can comprise a corresponding relation between the content to be pushed and emotion matching values of all preset emotion types. One of ordinary skill in the art may select one of them to determine what to modify the content to push. And inputs the correct emotion matching value through an input device (such as an input keyboard or a touch display screen) configured by the terminal device, for example, modifies the emotion matching value of the content A to be pushed corresponding to fear from 0.2 to 0.1, and so on.

In step S920, according to the correction information for emotion matching values obtained by the emotion matching value correction interface, the emotion matching values of the content to be pushed corresponding to each preset emotion type are corrected, so as to obtain corrected emotion matching values of the content to be pushed corresponding to each preset emotion type.

In one embodiment of the application, according to the correction information for emotion matching obtained by the correction interface, the emotion matching value of the content to be pushed corresponding to each preset emotion type is updated. And generating emotion feature vectors corresponding to the content to be pushed according to the emotion matching values.

In the embodiment shown in fig. 9, by setting the emotion matching value correction interface, it is convenient for those skilled in the art to audit and correct the emotion matching value of the content to be pushed, so as to ensure the accuracy of the emotion matching value corresponding to the content to be pushed, and facilitate the follow-up accurate recommendation.

Based on the technical solutions of the above embodiments, a specific application scenario of the embodiments of the present application is described below:

referring to fig. 10 and 11, fig. 10 is a schematic diagram illustrating an exemplary system architecture to which the technical solution of the embodiment of the present application may be applied. Fig. 11 shows a flow diagram of a method of processing voice data according to an embodiment of the application.

Referring to fig. 10, the system architecture may include a terminal device, an AI access layer, an emotion analysis system, a skill center control layer, a domain layer, and a content recommendation system.

Referring to fig. 10 and 11, in an embodiment of the present application, a terminal device may send a voice to be processed obtained by a voice collecting module to an AI access layer, where the AI access layer sends the voice to be processed to an emotion analysis system, and the emotion analysis system may perform emotion analysis on the voice to be processed by using a neural network model trained in advance to obtain an emotion analysis result corresponding to the voice to be processed, and feed back the emotion analysis result to the AI access layer.

Meanwhile, the AI access layer can also perform intent analysis on the voice to be processed so as to obtain intent information corresponding to the voice to be processed, and the AI access layer sends the intent information corresponding to the voice to be processed and the emotion analysis result to the skill center control layer.

The skill center control layer can determine a target domain (such as music, video, chat or other domains) to be accessed by a sender of the voice to be processed from the domain layer according to the intention information corresponding to the voice to be processed, and acquire corresponding response information from the domain layer. And then sending the intention information and the emotion analysis result to a content recommendation system, so that the content recommendation system can select target push content from the content to be pushed according to the intention information and the emotion analysis result, and generate feedback information aiming at the voice to be processed according to the response information and the target push content. And the content recommendation system feeds back the feedback information to the skill center control platform, and finally sends the feedback information to the terminal equipment to feed back to the sender of the voice to be processed.

The following describes an embodiment of the apparatus of the present application, which may be used to perform the processing method of voice data in the above embodiment of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method for processing voice data described above.

Fig. 12 shows a block diagram of a processing device for speech data according to an embodiment of the application.

Referring to fig. 12, a processing apparatus for voice data according to an embodiment of the present application includes:

the intention analysis module 1210 is configured to perform intention analysis on a voice to be processed, and determine intention information corresponding to the voice to be processed;

the emotion analysis module 1220 is configured to perform emotion analysis on the voice to be processed to obtain an emotion analysis result of the voice to be processed;

the matching degree calculating module 1230 is configured to calculate, according to the emotion analysis result and an emotion feature vector of the content to be pushed, an emotion matching degree between the content to be pushed and the voice to be processed;

an information determining module 1240, configured to determine feedback information for the voice to be processed according to the access intention, the emotion matching degree, and the content to be pushed.

In some embodiments of the application, based on the foregoing, emotion analysis module 1220 is configured to: carrying out emotion recognition according to the voice to be processed to obtain emotion matching values of the voice to be processed corresponding to each preset emotion type; generating emotion analysis vectors of the voice to be processed according to emotion matching values of the voice to be processed corresponding to each preset emotion type, and taking the emotion analysis vectors as emotion analysis results.

In some embodiments of the application, based on the foregoing, emotion analysis module 1220 is configured to: multiplying the emotion feature vector of the content to be pushed by the emotion analysis vector to obtain the emotion matching degree between the content to be pushed and the voice to be processed.

In some embodiments of the present application, based on the foregoing scheme, the information determining module 1240 is configured to: determining response information to the voice to be processed according to the intention information; calculating the content matching degree between the content to be pushed and the sender of the voice to be processed according to the interest feature vector of the sender of the voice to be processed and the content feature vector of the content to be pushed; determining target push content in the content to be pushed according to the emotion matching degree and the content matching degree; and combining the response information with the target push content to generate feedback information aiming at the voice to be processed.

In some embodiments of the present application, based on the foregoing scheme, the information determining module 1240 is configured to: acquiring importance weights corresponding to the emotion matching degrees and importance weights corresponding to the content matching degrees; calculating a recommendation value of the content to be pushed according to the emotion matching degree and the importance weight corresponding to the emotion matching degree and the content matching degree and the importance weight corresponding to the content matching degree; and selecting the content to be pushed with the maximum recommended value from the content to be pushed as target push content.

In some embodiments of the present application, based on the foregoing, the intent analysis module 1210 is configured to: performing voice recognition on voice to be processed to obtain text information corresponding to the voice to be processed; word segmentation is carried out on the text information, and keywords contained in the text information are obtained; matching the keywords with preset keyword templates in each field, and determining the intention matching degree of the text information and each field; and determining intention information corresponding to the voice to be processed according to the intention matching degree.

In some embodiments of the present application, based on the foregoing, the intent analysis module 1210 is configured to: comparing the keywords with preset keyword templates in each field to determine target keyword templates containing the keywords in each field; acquiring the correlation weight of keywords contained in each target keyword template in the corresponding field; and calculating the sum of the correlation weights of the keywords contained in each target keyword template to obtain the intention matching degree between the text information and each field.

In some embodiments of the application, based on the foregoing, emotion analysis module 1220 is further configured to: performing emotion matching on the content to be pushed by adopting a pre-trained neural network model to obtain emotion matching values of the content to be pushed corresponding to preset emotion types; and generating emotion feature vectors corresponding to the content to be pushed according to emotion matching values of the content to be pushed corresponding to each preset emotion type, and respectively associating the emotion feature vectors with the content to be pushed.

In some embodiments of the present application, based on the foregoing solution, after performing emotion matching on the content to be pushed by using a pre-trained neural network model, to obtain emotion matching values of the content to be pushed corresponding to each preset emotion type, the emotion analysis module 1220 is further configured to: displaying an emotion matching value correction interface according to the correction request of the emotion matching value; and correcting the emotion matching value of the content to be pushed corresponding to each preset emotion type according to the correction information for the emotion matching value, which is obtained by the emotion matching value correction interface, so as to obtain corrected emotion matching values of the content to be pushed corresponding to each preset emotion type.

It should be noted that, the computer system of the electronic device shown in fig. 13 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 13, the computer system includes a central processing unit (Central Processing Unit, CPU) 1301 that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 1302 or a program loaded from a storage portion 1308 into a random access Memory (Random Access Memory, RAM) 1303, for example, performing the method described in the above embodiment. In the RAM 1303, various programs and data required for the system operation are also stored. The CPU 1301, ROM 1302, and RAM 1303 are connected to each other through a bus 1304. An Input/Output (I/O) interface 1305 is also connected to bus 1304.

The following components are connected to the I/O interface 1305: an input section 1306 including a keyboard, a mouse, and the like; an output portion 1307 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like, a speaker, and the like; a storage portion 1308 including a hard disk or the like; and a communication section 1309 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1309 performs a communication process via a network such as the internet. The drive 1310 is also connected to the I/O interface 1305 as needed. Removable media 1311, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is installed as needed on drive 1310 so that a computer program read therefrom is installed as needed into storage portion 1308.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 1309 and/or installed from the removable medium 1311. When executed by a Central Processing Unit (CPU) 1301, performs various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the methods described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for processing voice data, comprising:

carrying out emotion recognition according to the voice to be processed to obtain emotion matching values of the voice to be processed corresponding to each preset emotion type;

generating emotion analysis vectors of the voice to be processed according to emotion matching values of the voice to be processed corresponding to each preset emotion type;

multiplying the emotion feature vector of the content to be pushed by the data at the corresponding position in the emotion analysis vector, and adding the data products at all positions to obtain the emotion matching degree between the content to be pushed and the voice to be processed;

2. The method of claim 1, wherein determining feedback information for the speech to be processed based on the intent information, the emotion matching degree, and the content to be pushed comprises:

determining response information to the voice to be processed according to the intention information;

calculating the content matching degree between the content to be pushed and the sender of the voice to be processed according to the interest feature vector of the sender of the voice to be processed and the content feature vector of the content to be pushed;

determining target push content in the content to be pushed according to the emotion matching degree and the content matching degree;

and combining the response information with the target push content to generate feedback information aiming at the voice to be processed.

3. The method of claim 2, wherein determining the target push content of the content to be pushed based on the emotion matching degree and the content matching degree comprises:

acquiring importance weights corresponding to the emotion matching degrees and importance weights corresponding to the content matching degrees;

Calculating a recommendation value of the content to be pushed according to the emotion matching degree and the importance weight corresponding to the emotion matching degree and the content matching degree and the importance weight corresponding to the content matching degree;

and selecting the content to be pushed with the maximum recommended value from the content to be pushed as target push content.

4. The method of claim 1, wherein performing intent analysis on the speech to be processed to determine intent information corresponding to the speech to be processed comprises:

performing voice recognition on voice to be processed to obtain text information corresponding to the voice to be processed;

word segmentation is carried out on the text information, and keywords contained in the text information are obtained;

matching the keywords with preset keyword templates in each field, and determining the intention matching degree of the text information and each field;

and determining intention information corresponding to the voice to be processed according to the intention matching degree.

5. The method of claim 4, wherein matching the keywords with keyword templates preset in each field, and determining the degree of intended matching of the text information with each field comprises:

comparing the keywords with preset keyword templates in each field to determine target keyword templates containing the keywords in each field;

Acquiring the correlation weight of keywords contained in each target keyword template in the corresponding field;

and calculating the sum of the correlation weights of the keywords contained in each target keyword template to obtain the intention matching degree between the text information and each field.

6. The method according to claim 1, wherein the method further comprises:

performing emotion matching on the content to be pushed by adopting a pre-trained neural network model to obtain emotion matching values of the content to be pushed corresponding to preset emotion types;

and generating emotion feature vectors corresponding to the content to be pushed according to emotion matching values of the content to be pushed corresponding to each preset emotion type, and respectively associating the emotion feature vectors with the content to be pushed.

7. The method of claim 6, wherein after emotion matching the content to be pushed using a pre-trained neural network model to obtain emotion matching values for the content to be pushed for each preset emotion type, the method further comprises:

displaying an emotion matching value correction interface according to the correction request of the emotion matching value;

And correcting the emotion matching value of the content to be pushed corresponding to each preset emotion type according to the correction information for the emotion matching value, which is obtained by the emotion matching value correction interface, so as to obtain corrected emotion matching values of the content to be pushed corresponding to each preset emotion type.

8. A processing apparatus for voice data, comprising:

the emotion analysis module is used for carrying out emotion recognition according to the voice to be processed to obtain emotion matching values of the voice to be processed corresponding to each preset emotion type; generating emotion analysis vectors of the voice to be processed according to emotion matching values of the voice to be processed corresponding to each preset emotion type;

the matching degree calculation module is used for multiplying the emotion feature vector of the content to be pushed by the data at the corresponding position in the emotion analysis vector, and adding the data products at all positions to obtain the emotion matching degree between the content to be pushed and the voice to be processed;

and the information determining module is used for determining feedback information aiming at the voice to be processed according to the intention information, the emotion matching degree and the content to be pushed.

9. A computer readable medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of processing speech data according to any one of claims 1 to 7.

10. An electronic device, comprising:

one or more processors;

a memory for storing one or more computer programs that, when executed by the one or more processors, cause the electronic device to implement the method of processing speech data as claimed in any one of claims 1 to 7.