CN114627881B

CN114627881B - Voice call processing method and system based on artificial intelligence

Info

Publication number: CN114627881B
Application number: CN202210337182.6A
Authority: CN
Inventors: 陈晶; 赵斌
Original assignee: Shanghai Caian Financial Services Group Co ltd
Current assignee: Shanghai Caian Financial Services Group Co ltd
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2022-10-04
Anticipated expiration: 2042-04-01
Also published as: CN114627881A

Abstract

According to the voice call processing method and system based on artificial intelligence, the voiceprint descriptions of the multiple original users are sorted through the determined voiceprint descriptions of the first to-be-processed fragmented audio data and the distribution of the fragmented audio data of the multiple groups of first to-be-processed fragmented audio data, and therefore the voiceprint descriptions of the target users can be obtained. Furthermore, the voice interaction terminal to be subjected to security analysis in the first fragmented audio data to be processed is identified through the sorted target user voiceprint description, so that the contrastive analysis situation between the target user voiceprint description and the actual target user voiceprint description corresponding to the voice interaction terminal to be subjected to security analysis can be determined through the target user voiceprint description, whether the voice conversation information of the voice interaction terminal to be subjected to security analysis is machine-synthesized or not can be accurately and reliably determined, and the precision and the reliability of the call security analysis situation can be further guaranteed.

Description

Voice call processing method and system based on artificial intelligence

Technical Field

The present application relates to the field of voice processing technologies, and in particular, to a voice call processing method and system based on artificial intelligence.

Background

With the rapid development of the mobile internet, the interaction mode through voice is more and more frequent, so that the life of people can be enriched to a certain extent, but some risks also exist. Such as: the voice information generated by the terminal such as a mobile phone or a tablet phone may be synthesized by a machine. Such voice information may present telephone fraud, propagation of fraud information, and the like. Therefore, in order to avoid the above problem, it is necessary to perform security analysis on voice information generated by a terminal such as a mobile phone or a tablet phone. However, the inventor has found that there are some places to be improved for the related speech information analysis scheme, for example, for the security analysis technology for machine synthesis, it is currently difficult to guarantee the accuracy and reliability of the related speech information analysis scheme.

Disclosure of Invention

In order to solve the technical problems in the related art, the application provides a voice call processing method and system based on artificial intelligence.

In a first aspect, an embodiment of the present application provides an artificial intelligence-based voice call processing method, which is applied to an artificial intelligence-based voice call processing cloud platform, and the method at least includes: determining first voice session information corresponding to a voice interaction terminal to be subjected to security analysis; determining multiple groups of first to-be-processed fragmented audio data from the first voice conversation information, and determining original user voiceprint descriptions of each group of first to-be-processed fragmented audio data; the original user voiceprint description comprises a user voiceprint description of the first to-be-processed fragmented audio data at a call scene level; according to the distribution of the fragmented audio data of the plurality of groups of first to-be-processed fragmented audio data in the first voice conversation information, sorting original user voiceprint descriptions of each group of first to-be-processed fragmented audio data to obtain target user voiceprint descriptions; and determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information according to the target user voiceprint description.

By means of the design, through the determined original user voiceprint description of the first to-be-processed fragmented audio data, not only can the basic user voiceprint description of the first to-be-processed fragmented audio data be obtained, for example, user voiceprint descriptions capable of reflecting voiceprint features of the first to-be-processed fragmented audio data and/or category descriptions of each topic information in the first to-be-processed fragmented audio data be obtained, but also user voiceprint descriptions under a calling scene layer can be obtained, diversity and integrity of the determined user voiceprint descriptions are improved, then the plurality of original user voiceprint descriptions are sorted through distribution of multiple groups of first to-be-processed fragmented audio data, and target user voiceprint descriptions covering sequential relation labels can be obtained. Furthermore, the voice interaction terminal to be subjected to security analysis in the first fragmented audio data to be processed is identified through the sorted target user voiceprint description, and the contrastive analysis condition between the target user voiceprint description and the target user voiceprint description corresponding to the actual voice interaction terminal to be subjected to security analysis can be determined through the target user voiceprint description, wherein the contrastive analysis condition can comprise contrastive analysis data of a precedence relation label and contrastive analysis conditions between the user voiceprint descriptions, and then whether voice session information of the voice interaction terminal to be subjected to security analysis is machine-synthesized or not can be accurately and reliably determined by determining the contrastive analysis condition, so that the precision and the reliability of the call security analysis condition are guaranteed.

In one possible embodiment, the first voice conversation information comprises fragmented audio data of the voice interaction terminal to be subjected to security analysis, which is collected under various noise environments; the determining a plurality of groups of first to-be-processed fragmented audio data from the first voice conversation information includes: and determining at least one group of fragmented audio data corresponding to each noise environment in the multiple noise environments from the first voice conversation information to obtain multiple groups of first to-be-processed fragmented audio data.

Therefore, different noise environments are loaded in the voice interaction terminals to be subjected to the safety analysis, so that different effects are achieved, and therefore, the user voiceprint descriptions of the first to-be-processed fragmented audio data corresponding to the same voice interaction terminal to be subjected to the safety analysis in different noise environments have deviations, the comparison analysis data between the comparison analysis data of the tone characteristic information between the voice interaction terminals to be subjected to the safety analysis in each piece of first to-be-processed fragmented audio data and the comparison analysis data between the user voiceprint descriptions can be accurately determined through the determined first to-be-processed fragmented audio data in different noise environments, and then the call safety analysis condition of the voice interaction terminal to be subjected to the safety analysis can be accurately determined through the determined deviations and the comparison analysis data which are actually carried by the voice interaction terminals to be subjected to the safety analysis.

In a possible embodiment, before the determining, through the target user voiceprint description, a call security analysis condition of the voice interaction terminal to be security analyzed in the first voice session information, the method further includes: determining at least one group of second to-be-processed fragmented audio data from the first voice conversation information, respectively performing significant text content identification on each group of second to-be-processed fragmented audio data, and respectively determining significant text content corresponding to a first set interaction event of a voice interaction terminal to be subjected to security analysis in each group of second to-be-processed fragmented audio data; determining a set number of target salient text contents from the salient text contents; determining a first recognition possibility corresponding to a voice interaction terminal to be subjected to security analysis in each group of second to-be-processed fragmented audio data through each group of second to-be-processed fragmented audio data and target significant text content corresponding to each group of second to-be-processed fragmented audio data; the determining, according to the voiceprint description of the target user, a call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information includes: and determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice conversation information according to the first recognition possibility corresponding to each group of second fragmented audio data to be processed and the voiceprint description of the target user.

Therefore, the description content of the subject information corresponding to the target significant text content can significantly express the user voiceprint description corresponding to the voice interaction terminal to be subjected to the safety analysis, and the first set interaction event of the voice interaction terminal to be subjected to the safety analysis in the fragmented audio data and the corresponding significant text content can be determined firstly by performing significant text content recognition on a group of second fragmented audio data to be processed, so that the accurate target significant text content can be determined. Then, the voice interaction terminal to be subjected to security analysis is identified through the target significant text content and the second to-be-processed fragmented audio data, so that not only the whole user voiceprint description corresponding to the second to-be-processed fragmented audio data is focused in the identification process, but also the user voiceprint description of the part corresponding to the target significant text content is focused, and further a first identification possibility which is accurate and reliable can be obtained.

In a possible embodiment, determining, by the target significant text content and the second to-be-processed fragmented audio data, a first identification possibility corresponding to a voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data includes: determining a set word vector corresponding to the first set interaction event; determining a real word vector corresponding to the first set interaction event through a word vector of the topic information corresponding to each target significant text content; determining a target mapping list through the set word vector and the real word vector; performing word vector mapping on the topic information corresponding to each significant text content through the target mapping list; and determining a first recognition possibility of the voice interaction terminal to be subjected to security analysis in the second fragmented audio data to be processed according to the second fragmented audio data to be processed and each significant text content mapped by the word vector.

In this way, the word vector is set as a word vector corresponding to the target significant text content corresponding to the first set interactive event of the voice interactive terminal to be subjected to security analysis in the second fragmented audio data to be processed, the real word vector is an actual word vector corresponding to the target significant text content corresponding to the first set interactive event of the voice interactive terminal to be subjected to security analysis in the second fragmented audio data to be processed, word vector mapping of topic information corresponding to each significant text content can be achieved through the determined target mapping list, a word vector meeting requirements is obtained, and the voice interactive terminal to be subjected to security analysis is identified through the mapped word vector.

In a possible embodiment, the determining, for each of the significant text contents mapped by the second to-be-processed fragmented audio data and a word vector, a first recognition possibility of a voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data includes: performing word vector mapping on the second to-be-processed fragmented audio data through the target mapping list to obtain mapped fragmented audio data; extracting target fragmented audio data corresponding to a first set interaction event of the voice interaction terminal to be subjected to security analysis from the mapped fragmented audio data through each significant text content mapped by the mapped fragmented audio data and the word vector; and determining a first identification possibility of the voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data through the target fragmented audio data and the second to-be-processed fragmented audio data.

By means of the design, each obvious text content after word vector mapping and the mapped fragmented audio data can obtain the target fragmented audio data which meet the requirements and correspond to the first set interaction event of the voice interaction terminal to be subjected to safety analysis, the partial user voiceprint description which meets the requirements can be determined through the target fragmented audio data, the integral user voiceprint description before mapping can be determined through the second fragmented audio data to be processed, the user voiceprint description which corresponds to the two fragmented audio data is further used, the corresponding situation between the description content and the actual description content of the voice interaction terminal to be subjected to safety analysis can be determined more comprehensively, and the first recognition possibility is obtained accurately.

In a possible embodiment, the determining, by the target fragmented audio data and the second to-be-processed fragmented audio data, a first recognition possibility of a voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data includes: determining a first attention descriptor of the target fragmented audio data and a second attention descriptor of the second to-be-processed fragmented audio data; merging the first attention description content and the second attention description content to obtain a third attention description content; and determining a first identification possibility of the voice interaction terminal to be subjected to security analysis in the second fragmented audio data to be processed through the third attention description content.

In this way, by merging the attention description contents, the third attention description content including the stage attention description content corresponding to the target fragmented audio data and the overall attention description content corresponding to the second to-be-processed fragmented audio data can be determined, and the comprehensiveness and integrity of the attention description content for identifying the voice interaction terminal to be subjected to the security analysis can be improved, so that the more accurate first identification possibility can be obtained. And by processing the merged third attention description content, parallel analysis of the first attention description content and the second attention description content can be realized, so that the recognition efficiency can be improved.

In a possible embodiment, the determining, according to the first recognition possibility and the target user voiceprint description corresponding to each group of second to-be-processed fragmented audio data, a call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information includes: determining a second recognition possibility corresponding to the voice interaction terminal to be subjected to the security analysis according to the target user voiceprint description; and determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis according to the first recognition possibility and the second recognition possibility.

Therefore, through the voiceprint description of the target user, the combined second identification possibility of the multiple groups of first fragmented audio data to be processed corresponding to the voice interaction terminal to be subjected to the security analysis can be determined, and then the call security analysis condition is determined through the first identification possibility and the combined second identification possibility corresponding to the group of second fragmented audio data to be processed, so that the determined call security analysis condition can reflect user voiceprint descriptions under multiple layers, such as a call scene layer, an attention description layer and the like, and further the comprehensiveness and the accuracy of the determined call security analysis condition can be improved.

In a possible embodiment, the determining, by the first recognition possibility and the second recognition possibility, the call security analysis condition of the voice interaction terminal to be subjected to security analysis includes: determining a first set confidence corresponding to a first recognition possibility and a second set confidence corresponding to the second recognition possibility; determining a target likelihood from the first recognition likelihood, the first confidence score, the second recognition likelihood, and the second confidence score; and on the basis that the target possibility is larger than a set judgment value, determining that the call security analysis condition comprises that the voice interaction terminal to be subjected to security analysis is a voice interaction terminal passing voiceprint security verification.

Therefore, the recognition possibility is combined by setting the confidence coefficient to determine the call security analysis condition, and the accuracy and comprehensiveness of the determined call security analysis condition can be effectively improved.

In a possible embodiment, the determining the first voice session information corresponding to the voice interaction terminal to be subjected to the security analysis includes: determining second voice session information corresponding to the voice interaction terminal to be subjected to security analysis; determining the original recognition condition of the voice interaction terminal to be subjected to the safety analysis through the second voice conversation information and a set tone characteristic queue; and determining the first voice session information corresponding to the voice interaction terminal to be subjected to the security analysis on the basis that the original recognition condition indicates that the voice interaction terminal to be subjected to the security analysis is the voice interaction terminal passing the voiceprint security verification.

Therefore, the voice interaction terminal to be subjected to the security analysis can be identified in advance through the second voice session information, and on the basis that the voice interaction terminal to be subjected to the security analysis does not accurately execute any set tone characteristic corresponding to the set tone characteristic queue, whether the voice session information of the voice interaction terminal to be subjected to the security analysis is machine-synthesized or not can be directly determined, so that the efficiency and the accuracy for determining the call security analysis condition can be improved.

In a possible embodiment, the determining, by using the second voice session information and the set tone feature queue, an original recognition condition of the voice interaction terminal to be subjected to security analysis includes: performing tone color feature mining on the voice interaction terminal to be subjected to security analysis in each group of third to-be-processed fragmented audio data in the second voice session information to obtain tone color features to be subjected to security analysis corresponding to the voice interaction terminal to be subjected to security analysis in the third to-be-processed fragmented audio data; and determining the original recognition condition of the voice interaction terminal to be subjected to the safety analysis according to the tone characteristic to be subjected to the safety analysis corresponding to each group of the third fragmented audio data to be processed and the set tone characteristic queue.

Therefore, the tone characteristic to be safely analyzed executed by the voice interaction terminal to be safely analyzed can be accurately determined by mining the tone characteristic of the voice interaction terminal to be safely analyzed, and whether the voice interaction terminal to be safely analyzed executes the set tone characteristic corresponding to the set tone characteristic queue or not can be accurately determined by the tone characteristic to be safely analyzed and the set tone characteristic queue, so that the accuracy of the initial recognition condition can be improved to a certain extent.

In one possible embodiment, the set timbre feature queue comprises a first target timbre feature queue; the tone features to be subjected to the safety analysis comprise first tone features to be subjected to the safety analysis; the determining, by the tone color feature to be subjected to security analysis corresponding to each group of the third fragmented audio data to be processed and the set tone color feature queue, an original recognition condition of the voice interaction terminal to be subjected to security analysis includes: determining fourth fragmented audio data to be processed, corresponding to the first target tone characteristic queue, of the tone characteristic to be safely analyzed of the voice interaction terminal to be safely analyzed in each group of the third fragmented audio data to be processed through the first tone characteristic to be safely analyzed and the first target tone characteristic queue; and determining the original identification condition of the voice interaction terminal to be subjected to security analysis in the second voice session information through the statistic value of the fourth fragmented audio data to be processed and the statistic value of the third fragmented audio data to be processed.

In other words, if the voice interaction terminal to be subjected to the security analysis in the second voice session information executes the set tone features corresponding to the set tone feature queue, the voice interaction terminal to be subjected to the security analysis corresponds to a plurality of groups of fourth fragmented audio data to be processed corresponding to the implementation cycles, so that the original identification condition of the voice interaction terminal to be subjected to the security analysis can be accurately determined through the determined statistics of the fourth fragmented audio data to be processed and the determined statistics of the third fragmented audio data to be processed.

In one possible embodiment, the set tone color feature queue comprises a second target tone color feature queue; the tone color features to be subjected to the safety analysis comprise second tone color features to be subjected to the safety analysis; the determining, by the tone color feature to be subjected to security analysis corresponding to each group of the third fragmented audio data to be processed and the set tone color feature queue, an original recognition condition of the voice interaction terminal to be subjected to security analysis includes: determining event evaluation of a second set interaction event of the voice interaction terminal to be subjected to security analysis corresponding to a set index through a second tone color characteristic to be subjected to security analysis of the voice interaction terminal to be subjected to security analysis in each group of third fragmented audio data to be processed; determining a first event evaluation and a second event evaluation corresponding to the voice interaction terminal to be subjected to security analysis through the event evaluation corresponding to the voice interaction terminal to be subjected to security analysis in each group of the third fragmented audio data to be processed; and determining the original recognition condition of the voice interaction terminal to be subjected to the security analysis in the second voice conversation information according to the evaluation comparison result corresponding to the first event evaluation, the second event evaluation and the second target tone characteristic queue.

Therefore, the evaluation difference between the two event evaluations can be determined through the determined first event evaluation and the second event evaluation, and the incidence relation between the two event evaluations can be accurately determined through comparing the evaluation comparison result with the evaluation difference, so that the accurate original recognition condition is obtained.

In a second aspect, an embodiment of the present application further provides an artificial intelligence based voice call processing system, including: the voice call processing method comprises the steps that a cloud platform and a voice interaction terminal are processed based on artificial intelligence; the voice call processing cloud platform based on artificial intelligence is in communication connection with the voice interaction terminal; wherein, the artificial intelligence based voice call processing cloud platform is used for: determining first voice session information corresponding to a voice interaction terminal to be subjected to security analysis; determining multiple groups of first to-be-processed fragmented audio data from the first voice conversation information, and determining original user voiceprint descriptions of each group of first to-be-processed fragmented audio data; wherein the original user voiceprint description comprises a user voiceprint description of the first to-be-processed fragmented audio data at a call scene level; according to the distribution of the fragmented audio data of the plurality of groups of first to-be-processed fragmented audio data in the first voice conversation information, sorting original user voiceprint descriptions of each group of first to-be-processed fragmented audio data to obtain target user voiceprint descriptions; and determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information according to the target user voiceprint description.

In a third aspect, an embodiment of the present application further provides an artificial intelligence based voice call processing cloud platform, including a processor and a memory; the processor is connected with the memory in communication, and the processor is used for reading the computer program from the memory and executing the computer program to realize the method.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic hardware structure diagram of a voice call processing cloud platform based on artificial intelligence according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a voice call processing method based on artificial intelligence according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a communication architecture of an artificial intelligence based voice call processing system according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiment provided by the embodiment of the application can be executed in an artificial intelligence-based voice call processing cloud platform, a computer device or a similar computing device. Taking the example of running on an artificial intelligence-based voice call processing cloud platform as an example, fig. 1 is a hardware structure block diagram of an artificial intelligence-based voice call processing cloud platform implementing an artificial intelligence-based voice call processing method according to an embodiment of the present application. As shown in fig. 1, the artificial intelligence based voice call processing cloud platform 10 may include one or more (only one shown in fig. 1) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.) and a memory 104 for storing data, and optionally, may further include a transmission device 106 for communication functions. It will be understood by those of ordinary skill in the art that the structure shown in fig. 1 is merely an illustration, and does not limit the structure of the above-described artificial intelligence based voice call processing cloud platform. For example, artificial intelligence based voice call processing cloud platform 10 may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.

The memory 104 can be used for storing computer programs, for example, software programs and modules of application software, such as a computer program corresponding to an artificial intelligence based voice call processing method in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, thereby implementing the above-mentioned methods. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 104 may further include memory located remotely from processor 102, which may be connected to artificial intelligence based voice call processing cloud platform 10 over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. The above-described specific example of the network may include a wireless network provided by a communication provider of the artificial intelligence-based voice call processing cloud platform 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

Based on this, please refer to fig. 2, fig. 2 is a schematic flowchart of a voice call processing method based on artificial intelligence according to an embodiment of the present invention, where the method is applied to a voice call processing cloud platform based on artificial intelligence, and further may at least include the following technical solutions recorded in steps 11 to 13.

Step 11, determining first voice session information corresponding to a voice interaction terminal to be subjected to security analysis; a plurality of sets of first to-be-processed fragmented audio data are determined from the first voice conversation information, and an original user voiceprint description for each set of first to-be-processed fragmented audio data is determined.

In this embodiment of the present application, a voice interaction terminal to be subjected to security analysis may be understood as an object to be analyzed, such as: a mobile phone, a tablet computer, and the like, which are not limited in this application. The first voice session information may be understood as session information generated through the voice interactive terminal. The to-be-processed fragmented audio data may be understood as a portion of audio data in the first voice conversation information. The original user voiceprint description can be understood as the base user voiceprint description. Wherein, the original user voiceprint description comprises a user voiceprint description of the first to-be-processed fragmented audio data at the level of the call scene, and further, the voiceprint description can be understood as a sound feature of the user in the voice conversation process.

In a possible embodiment, the determining, recorded in step 11, first voice session information corresponding to the voice interaction terminal to be subjected to the security analysis may specifically include: determining second voice session information corresponding to the voice interaction terminal to be subjected to security analysis; determining the original recognition condition of the voice interaction terminal to be subjected to the safety analysis through the second voice conversation information and a set tone characteristic queue; and determining the first voice session information corresponding to the voice interaction terminal to be subjected to the security analysis on the basis that the original recognition condition indicates that the voice interaction terminal to be subjected to the security analysis is the voice interaction terminal passing the voiceprint security check.

In the embodiment of the application, the pre-recognition of the voice interaction terminal to be subjected to the security analysis can be realized through the second voice session information, and on the basis that the voice interaction terminal to be subjected to the security analysis does not accurately execute any set tone characteristic corresponding to the set tone characteristic queue, whether the voice session information of the voice interaction terminal to be subjected to the security analysis is machine-synthesized or not can be directly determined, so that the efficiency and the accuracy for determining the call security analysis condition can be improved.

In a possible embodiment, the determining, through the second voice session information and the set tone feature queue, an original recognition condition of the voice interaction terminal to be subjected to the security analysis specifically includes the following steps: performing tone color feature mining on the voice interaction terminal to be subjected to security analysis in each group of third to-be-processed fragmented audio data in the second voice session information to obtain tone color features to be subjected to security analysis corresponding to the voice interaction terminal to be subjected to security analysis in the third to-be-processed fragmented audio data; and determining the original recognition condition of the voice interaction terminal to be subjected to the safety analysis according to the tone characteristic to be subjected to the safety analysis corresponding to each group of the third fragmented audio data to be processed and the set tone characteristic queue.

In one possible embodiment, setting the tone color feature queue comprises setting a first target tone color feature queue; the tone color features to be subjected to the safety analysis comprise first tone color features to be subjected to the safety analysis. Based on this, the determining, by the recorded tone color feature to be subjected to security analysis and the set tone color feature queue corresponding to each set of the third fragmented audio data to be processed, an original recognition condition of the voice interaction terminal to be subjected to security analysis may specifically include: determining fourth fragmented audio data to be processed, corresponding to the first target tone feature queue, of the first tone feature to be safely analyzed from the third fragmented audio data to be processed through the first tone feature to be safely analyzed and the first target tone feature queue of the voice interaction terminal to be safely analyzed in each group of the third fragmented audio data to be processed; and determining the original identification condition of the voice interaction terminal to be subjected to security analysis in the second voice conversation information according to the statistic value of the fourth fragmented audio data to be processed and the statistic value of the third fragmented audio data to be processed.

In other words, if the voice interaction terminal to be subjected to the security analysis in the second voice session information executes the set tone features corresponding to the set tone feature queue, the voice interaction terminal to be subjected to the security analysis corresponds to a plurality of sets of fourth fragmented audio data to be processed corresponding to the implementation periods, so that the original identification condition of the voice interaction terminal to be subjected to the security analysis can be accurately determined through the determined statistics of the fourth fragmented audio data to be processed and the determined statistics of the third fragmented audio data to be processed.

In one possible embodiment, the set timbre feature queue comprises a second target timbre feature queue; the tone color characteristics to be subjected to the safety analysis comprise second tone color characteristics to be subjected to the safety analysis. On this basis, determining the original recognition condition of the voice interaction terminal to be subjected to security analysis according to the tone features to be subjected to security analysis corresponding to each group of the third fragmented audio data to be processed and the set tone feature queue, which may specifically include the following contents: determining event evaluation of a second set interaction event of the voice interaction terminal to be subjected to security analysis corresponding to a set index according to a second tone characteristic to be subjected to security analysis of the voice interaction terminal to be subjected to security analysis in each group of the third fragmented audio data to be processed; determining a first event evaluation and a second event evaluation corresponding to the voice interaction terminal to be subjected to security analysis through the event evaluation corresponding to the voice interaction terminal to be subjected to security analysis in each group of the third fragmented audio data to be processed; and determining the original recognition condition of the voice interaction terminal to be subjected to the security analysis in the second voice conversation information according to the first event evaluation, the second event evaluation and the evaluation comparison result corresponding to the second target tone color characteristic queue.

Therefore, the evaluation difference between the two event evaluations can be determined through the determined first event evaluation and the second event evaluation, and the incidence relation between the two event evaluations can be accurately determined through comparing the evaluation comparison result with the evaluation difference, so that the accurate original identification condition can be obtained.

In one possible embodiment, the first voice conversation information includes fragmented audio data of the voice interaction terminal to be subjected to security analysis, which is collected under various noise environments. Based on this, the determining multiple sets of first to-be-processed fragmented audio data from the first voice session information in step 11 may specifically include the following: and determining at least one group of fragmented audio data corresponding to each noise environment in the multiple noise environments from the first voice conversation information to obtain multiple groups of first to-be-processed fragmented audio data.

Therefore, the comparison analysis data between the comparison analysis data of the tone characteristic information between the voice interaction terminals to be subjected to the safety analysis in each piece of first to-be-processed fragmented audio data and the comparison analysis data between the user voiceprint descriptions can be accurately determined through the determined first to-be-processed fragmented audio data under different noise environments, and the call safety analysis condition of the voice interaction terminal to be subjected to the safety analysis can be accurately determined through the determined deviation and the actual comparison analysis data to be carried by the voice interaction terminal to be subjected to the safety analysis.

And step 12, according to the fragmented audio data distribution of the plurality of groups of first to-be-processed fragmented audio data in the first voice conversation information, sorting original user voiceprint descriptions of each group of first to-be-processed fragmented audio data to obtain target user voiceprint descriptions.

In the embodiment of the present application, the distribution of the fragmented audio data may be understood as an audio order of the fragmented audio data in the first voice conversation information. The target user voiceprint description may be understood as feature information obtained by combining original user voiceprint descriptions of each set of first to-be-processed fragmented audio data.

And step 13, determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information according to the target user voiceprint description.

In the embodiment of the present application, the call security analysis condition may be understood as a result obtained after security analysis is performed on the voice interaction terminal through a target user voiceprint description. Such as: whether the voice session information in the voice interaction terminal is machine-synthesized. Generally, machine-synthesized voice conversation information causes telephone fraud, propagation of fraud information, and the like to some extent. The safety problem can be completely eradicated from the source through accurate and reliable detection of the voice conversation information synthesized by the machine.

In a possible embodiment, before the determining, through the target user voiceprint description, the call security analysis condition of the voice interaction terminal to be security analyzed in the first voice session information, the method may further include the following: determining at least one group of second to-be-processed fragmented audio data from the first voice conversation information, respectively performing significant text content identification on each group of second to-be-processed fragmented audio data, and respectively determining significant text content corresponding to a first set interaction event of a voice interaction terminal to be subjected to security analysis in each group of second to-be-processed fragmented audio data; determining a set number of target salient text contents from the salient text contents; and determining a first recognition possibility (probability) corresponding to the voice interaction terminal to be subjected to security analysis in each group of second to-be-processed fragmented audio data through each group of second to-be-processed fragmented audio data and the target significant text content corresponding to each group of second to-be-processed fragmented audio data. On this basis, the determining, through the voiceprint description of the target user, the call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information, which is recorded in step 13, may specifically include: and determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information according to the first identification possibility corresponding to each group of second fragmented audio data to be processed and the target user voiceprint description.

In the embodiment of the present application, performing salient text content identification on each group of second to-be-processed fragmented audio data may be understood as performing key content identification on each group of second to-be-processed fragmented audio data. Therefore, the method can focus not only the whole user voiceprint description corresponding to the second fragmented audio data to be processed but also the user voiceprint description of the part corresponding to the target significant text content in the identification process, and further can obtain a first identification possibility which is accurate and reliable.

In a possible embodiment, the above-mentioned recorded first recognition possibility corresponding to the voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data is determined through the target significant text content and the second to-be-processed fragmented audio data, and specifically may include the content recorded in the following steps 21 to 24.

Step 21, determining a set word vector corresponding to the first set interaction event; and determining a real word vector corresponding to the first set interaction event through a word vector of topic information (such as summarized text content) corresponding to each target significant text content.

And step 22, determining a target mapping list through the set word vector and the real word vector.

And step 23, performing word vector mapping on the topic information corresponding to each significant text content through the target mapping list.

And step 24, determining a first recognition possibility of the voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data through each significant text content mapped by the second to-be-processed fragmented audio data and the word vector.

In the embodiment of the present application, the set interaction event may be understood as a preset interaction event. And setting the word vector as the word vector corresponding to the target significant text content corresponding to the first set interaction event of the expected voice interaction terminal to be subjected to security analysis in the second fragmented audio data to be processed. The real word vector is a real word vector corresponding to the target significant text content corresponding to the first set interactive event of the voice interactive terminal to be subjected to the security analysis in the second fragmented audio data to be processed. The target mapping list may be understood as a conversion relationship between the set word vector and the real word vector. Therefore, through the determined target mapping list, word vector mapping of the subject information corresponding to each significant text content can be achieved, word vectors meeting requirements are obtained, the voice interaction terminal to be subjected to safety analysis is identified through the mapped word vectors, and the mapped word vectors meet the requirements, so that the configuration complexity of the AI model can be reduced to a certain extent, and the accuracy of the determined first identification possibility can be improved.

In a possible embodiment, the determining, by step 24, the first recognition possibility of the voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data according to each of the significant text contents mapped by the second to-be-processed fragmented audio data and the word vector may include the following contents: performing word vector mapping on the second to-be-processed fragmented audio data through the target mapping list to obtain mapped fragmented audio data; extracting target fragmented audio data corresponding to a first set interaction event of the voice interaction terminal to be subjected to security analysis from the mapped fragmented audio data through each significant text content mapped by the mapped fragmented audio data and the word vector; and determining a first identification possibility of the voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data through the target fragmented audio data and the second to-be-processed fragmented audio data.

Therefore, through each obvious text content after word vector mapping and the mapped fragmented audio data, the target fragmented audio data which meet the requirements and correspond to the first set interaction event of the voice interaction terminal to be subjected to security analysis can be obtained, through the target fragmented audio data, the partial user voiceprint description which meets the requirements can be determined, through the second fragmented audio data to be processed, the integral user voiceprint description which is not mapped can be determined, and further through the user voiceprint descriptions corresponding to the two fragmented audio data, the corresponding situation between the description content and the actual description content of the voice interaction terminal to be subjected to security analysis can be more comprehensively determined, and further, the first recognition possibility is accurately obtained.

In a possible embodiment, the above-mentioned determining, through the target fragmented audio data and the second to-be-processed fragmented audio data, a first recognition possibility of a voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data may include: determining a first attention descriptor of the target fragmented audio data and a second attention descriptor of the second to-be-processed fragmented audio data; merging the first attention description content and the second attention description content to obtain a third attention description content; and determining a first identification possibility of the voice interaction terminal to be subjected to security analysis in the second fragmented audio data to be processed through the third attention description content.

In the embodiments of the present application, the attention description may be understood as a focus level for fragmented audio data. The first attention description content, the second attention description content and the third attention description content are mainly used for distinguishing the attention description content. In this way, by merging the attention description contents, the third attention description content including the stage attention description content corresponding to the target fragmented audio data and the overall attention description content corresponding to the second to-be-processed fragmented audio data can be determined, and the comprehensiveness and integrity of the attention description content for identifying the voice interaction terminal to be subjected to the security analysis can be improved, so that the more accurate first identification possibility can be obtained. And by processing the merged third attention description content, parallel analysis of the first attention description content and the second attention description content can be realized, so that the identification efficiency can be improved.

In a possible embodiment, the above-mentioned recorded call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information is determined through the first recognition possibility and the target user voiceprint description corresponding to each group of second fragmented audio data to be processed, and may specifically include the following contents: determining a second recognition possibility corresponding to the voice interaction terminal to be subjected to the security analysis according to the target user voiceprint description; and determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis according to the first recognition possibility and the second recognition possibility.

Therefore, the combined second identification possibility of the voice interaction terminal to be subjected to the security analysis corresponding to the multiple groups of first fragmented audio data to be processed can be determined through the target user voiceprint description, and the call security analysis condition can be determined through the combined second identification possibility and the first identification possibility corresponding to the group of second fragmented audio data to be processed, so that the determined call security analysis condition can reflect user voiceprint descriptions under multiple layers, and further the comprehensiveness and accuracy of the determined call security analysis condition can be improved.

In a possible embodiment, the determining, through the first recognition possibility and the second recognition possibility and by the above-mentioned recorded passing of the first recognition possibility and the second recognition possibility, the call security analysis condition of the voice interaction terminal to be subjected to security analysis may specifically include: determining a first set confidence corresponding to a first recognition possibility and a second set confidence corresponding to the second recognition possibility; determining a target likelihood from the first recognition likelihood, the first confidence score, the second recognition likelihood, and the second confidence score; and on the basis that the target possibility is larger than a set judgment value, determining that the call security analysis condition comprises that the voice interaction terminal to be subjected to security analysis is a voice interaction terminal passing voiceprint security verification.

Therefore, the recognition possibility is combined by setting the confidence coefficient to determine the call security analysis condition, and the accuracy and the comprehensiveness of the determined call security analysis condition can be effectively improved.

Based on the same or similar inventive concepts, an architecture schematic diagram of an artificial intelligence based voice call processing system 30 is also provided, which includes an artificial intelligence based voice call processing cloud platform 10 and a voice interaction terminal 20 that communicate with each other, and the artificial intelligence based voice call processing cloud platform 10 and the voice interaction terminal 20 implement or partially implement the technical solutions described in the above method embodiments when running. For example, the artificial intelligence based voice call processing cloud platform 10 is configured to: determining first voice session information corresponding to a voice interaction terminal 20 to be subjected to security analysis; determining multiple groups of first to-be-processed fragmented audio data from the first voice conversation information, and determining original user voiceprint descriptions of each group of first to-be-processed fragmented audio data; wherein the original user voiceprint description comprises a user voiceprint description of the first to-be-processed fragmented audio data at a call scene level; according to the distribution of the fragmented audio data of the plurality of groups of first to-be-processed fragmented audio data in the first voice conversation information, sorting original user voiceprint descriptions of each group of first to-be-processed fragmented audio data to obtain target user voiceprint descriptions; and determining the call security analysis condition of the voice interaction terminal 20 to be subjected to security analysis in the first voice session information according to the target user voiceprint description.

Further, a readable storage medium is provided, on which a program is stored which, when being executed by a processor, carries out the above-mentioned method.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus and method embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a media service server 10, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A voice call processing method based on artificial intelligence is characterized by being applied to a voice call processing cloud platform based on artificial intelligence, and at least comprising the following steps:

determining first voice session information corresponding to a voice interaction terminal to be subjected to security analysis; determining a plurality of groups of first to-be-processed fragmented audio data from the first voice conversation information, and determining an original user voiceprint description of each group of first to-be-processed fragmented audio data; wherein the original user voiceprint description comprises a user voiceprint description of the first to-be-processed fragmented audio data at a call scene level;

according to the distribution of the fragmented audio data of the plurality of groups of first to-be-processed fragmented audio data in the first voice conversation information, sorting original user voiceprint descriptions of each group of first to-be-processed fragmented audio data to obtain target user voiceprint descriptions; determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information according to the target user voiceprint description;

before the determining, through the voiceprint description of the target user, a call security analysis condition of the voice interaction terminal to be security analyzed in the first voice session information, the method further includes: determining at least one group of second to-be-processed fragmented audio data from the first voice conversation information, respectively performing significant text content identification on each group of second to-be-processed fragmented audio data, and respectively determining significant text content corresponding to a first set interaction event of a voice interaction terminal to be subjected to security analysis in each group of second to-be-processed fragmented audio data; determining a set number of target salient text contents from the salient text contents; determining a first recognition possibility corresponding to a voice interaction terminal to be subjected to security analysis in each group of second to-be-processed fragmented audio data through each group of second to-be-processed fragmented audio data and target significant text content corresponding to each group of second to-be-processed fragmented audio data;

the determining, according to the voiceprint description of the target user, a call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information includes: determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information according to the first recognition possibility corresponding to each group of second fragmented audio data to be processed and the voiceprint description of the target user;

determining a first recognition possibility corresponding to a voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data through the target significant text content and the second to-be-processed fragmented audio data, wherein the determining includes: determining a set word vector corresponding to the first set interaction event; determining a real word vector corresponding to the first set interaction event through a word vector of the topic information corresponding to each target significant text content; determining a target mapping list according to the set word vector and the real word vector; performing word vector mapping on the topic information corresponding to each significant text content through the target mapping list; determining a first recognition possibility of a voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data through the second to-be-processed fragmented audio data and each of the significant text contents mapped by the word vector;

wherein the determining, by the second to-be-processed fragmented audio data and each of the significant text contents mapped by the word vector, a first recognition possibility of a voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data includes: performing word vector mapping on the second to-be-processed fragmented audio data through the target mapping list to obtain mapped fragmented audio data; extracting target fragmented audio data corresponding to a first set interaction event of the voice interaction terminal to be subjected to security analysis from the mapped fragmented audio data through each significant text content mapped by the mapped fragmented audio data and the word vector; and determining a first identification possibility of the voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data through the target fragmented audio data and the second to-be-processed fragmented audio data.

2. The method according to claim 1, wherein the first voice session information comprises fragmented audio data of the voice interaction terminal to be subjected to security analysis, which is collected under various noise environments; the determining a plurality of groups of first to-be-processed fragmented audio data from the first voice conversation information includes: and determining at least one group of fragmented audio data corresponding to each noise environment in the multiple noise environments from the first voice conversation information to obtain multiple groups of first to-be-processed fragmented audio data.

3. The method according to claim 1, wherein the determining, from the target fragmented audio data and the second to-be-processed fragmented audio data, a first recognition possibility of a voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data comprises:

determining a first attention descriptor of the target fragmented audio data and a second attention descriptor of the second to-be-processed fragmented audio data;

merging the first attention description content and the second attention description content to obtain a third attention description content;

and determining a first identification possibility of the voice interaction terminal to be subjected to security analysis in the second fragmented audio data to be processed through the third attention description content.

4. The method according to claim 3, wherein the determining, according to the first recognition possibility and the target user voiceprint description corresponding to each group of second to-be-processed fragmented audio data, a call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information includes: determining a second recognition possibility corresponding to the voice interaction terminal to be subjected to the security analysis according to the target user voiceprint description; determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis according to the first identification possibility and the second identification possibility;

wherein, the determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis according to the first recognition possibility and the second recognition possibility comprises: determining a first set confidence corresponding to a first recognition possibility and a second set confidence corresponding to the second recognition possibility; determining a target likelihood from the first recognition likelihood, the first confidence score, the second recognition likelihood, and the second confidence score; and on the basis that the target possibility is larger than a set judgment value, determining that the call security analysis condition comprises that the voice interaction terminal to be subjected to security analysis is a voice interaction terminal passing voiceprint security verification.

5. The method according to claim 4, wherein the determining the first voice session information corresponding to the voice interaction terminal to be subjected to the security analysis includes:

determining second voice session information corresponding to the voice interaction terminal to be subjected to security analysis;

determining the original recognition condition of the voice interaction terminal to be subjected to the safety analysis through the second voice conversation information and a set tone characteristic queue;

and determining the first voice session information corresponding to the voice interaction terminal to be subjected to the security analysis on the basis that the original recognition condition indicates that the voice interaction terminal to be subjected to the security analysis is the voice interaction terminal passing the voiceprint security check.

6. The method according to claim 5, wherein the determining the original recognition condition of the voice interaction terminal to be subjected to the security analysis through the second voice session information and the set tone feature queue comprises: performing tone color feature mining on the voice interaction terminal to be subjected to security analysis in each group of third to-be-processed fragmented audio data in the second voice session information to obtain tone color features to be subjected to security analysis corresponding to the voice interaction terminal to be subjected to security analysis in the third to-be-processed fragmented audio data; determining the original recognition condition of the voice interaction terminal to be subjected to the safety analysis according to the tone characteristic to be subjected to the safety analysis corresponding to each group of the third fragmented audio data to be processed and the set tone characteristic queue;

wherein the set tone color feature queue comprises a first target tone color feature queue; the tone features to be subjected to the safety analysis comprise first tone features to be subjected to the safety analysis; the determining, by the tone feature to be subjected to security analysis corresponding to each group of the third fragmented audio data to be processed and the set tone feature queue, an original recognition condition of the voice interaction terminal to be subjected to security analysis includes: determining fourth fragmented audio data to be processed, corresponding to the first target tone feature queue, of the first tone feature to be safely analyzed from the third fragmented audio data to be processed through the first tone feature to be safely analyzed and the first target tone feature queue of the voice interaction terminal to be safely analyzed in each group of the third fragmented audio data to be processed;

and determining the original identification condition of the voice interaction terminal to be subjected to security analysis in the second voice conversation information according to the statistic value of the fourth fragmented audio data to be processed and the statistic value of the third fragmented audio data to be processed.

7. The method of claim 6, wherein the set tone color feature queue comprises a second target tone color feature queue; the tone color features to be subjected to the safety analysis comprise second tone color features to be subjected to the safety analysis;

the determining, by the tone color feature to be subjected to security analysis corresponding to each group of the third fragmented audio data to be processed and the set tone color feature queue, an original recognition condition of the voice interaction terminal to be subjected to security analysis includes:

determining event evaluation of a second set interaction event of the voice interaction terminal to be subjected to security analysis corresponding to a set index according to a second tone characteristic to be subjected to security analysis of the voice interaction terminal to be subjected to security analysis in each group of the third fragmented audio data to be processed;

determining a first event evaluation and a second event evaluation corresponding to the voice interaction terminal to be subjected to security analysis through the event evaluation corresponding to the voice interaction terminal to be subjected to security analysis in each group of the third fragmented audio data to be processed;

and determining the original recognition condition of the voice interaction terminal to be subjected to the security analysis in the second voice conversation information according to the first event evaluation, the second event evaluation and the evaluation comparison result corresponding to the second target tone color characteristic queue.

8. An artificial intelligence based voice call processing system, comprising: the voice call processing method comprises the steps that a cloud platform and a voice interaction terminal are processed based on artificial intelligence; the voice call processing cloud platform based on artificial intelligence is in communication connection with the voice interaction terminal;

wherein, the artificial intelligence based voice call processing cloud platform is used for: determining first voice session information corresponding to a voice interaction terminal to be subjected to security analysis; determining a plurality of groups of first to-be-processed fragmented audio data from the first voice conversation information, and determining an original user voiceprint description of each group of first to-be-processed fragmented audio data; wherein the original user voiceprint description comprises a user voiceprint description of the first to-be-processed fragmented audio data at a call scene level; according to the distribution of the fragmented audio data of the plurality of groups of first to-be-processed fragmented audio data in the first voice conversation information, sorting original user voiceprint descriptions of each group of first to-be-processed fragmented audio data to obtain target user voiceprint descriptions; determining the call security analysis condition of the voice interaction terminal to be subjected to security analysis in the first voice session information through the target user voiceprint description;

before determining, through the voiceprint description of the target user, a call security analysis condition of the voice interaction terminal to be security analyzed in the first voice session information, the method further includes: determining at least one group of second to-be-processed fragmented audio data from the first voice conversation information, respectively performing significant text content identification on each group of second to-be-processed fragmented audio data, and respectively determining significant text content corresponding to a first set interaction event of a voice interaction terminal to be subjected to security analysis in each group of second to-be-processed fragmented audio data; determining a set number of target salient text contents from the salient text contents; determining a first identification possibility corresponding to a voice interaction terminal to be subjected to security analysis in each group of second fragmented audio data to be processed according to each group of second fragmented audio data to be processed and target significant text content corresponding to each group of second fragmented audio data to be processed;

determining a first identification possibility corresponding to a voice interaction terminal to be subjected to security analysis in the second fragmented audio data to be processed according to the target significant text content and the second fragmented audio data to be processed, including: determining a set word vector corresponding to the first set interaction event; determining a real word vector corresponding to the first set interaction event through a word vector of the topic information corresponding to each target significant text content; determining a target mapping list according to the set word vector and the real word vector; performing word vector mapping on the topic information corresponding to each significant text content through the target mapping list; determining a first recognition possibility of a voice interaction terminal to be subjected to security analysis in the second to-be-processed fragmented audio data through the second to-be-processed fragmented audio data and each of the significant text contents mapped by the word vector;

9. A voice call processing cloud platform based on artificial intelligence is characterized by comprising a processor and a memory; the processor is connected in communication with the memory, and the processor is configured to read the computer program from the memory and execute the computer program to implement the method of any one of claims 1 to 7.