CN115457979A

CN115457979A - Video voice analysis, recognition and processing method and system

Info

Publication number: CN115457979A
Application number: CN202211158592.0A
Authority: CN
Inventors: 赵显阳; 徐丛敏
Original assignee: Individual
Current assignee: Individual
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2022-12-09

Abstract

According to the video voice analysis, recognition and processing method and system, a plurality of first user intention data sets are determined from first voice interaction data, shared characteristic vectors among the first user intention data sets are analyzed by combining interference variables of the first user intention data sets, because first similarity of the first user intention data sets is obtained by analyzing the interference variables of the first user intention data sets, the shared characteristic vectors among the first user intention data sets cover interference information in first voice audio data, and the process of debugging the first user intention vectors by using the first similarity is equivalent to debugging the first user intention vectors of the first user intention data sets based on the interference information in the first local audio data, so that voice data which are covered in the first voice audio data and are related to interference caused by the interference can be weakened, the accuracy of voice analysis is improved, and the accuracy and reliability of matching are further improved.

Description

Video voice analysis, recognition and processing method and system

Technical Field

The application relates to the technical field of data analysis, recognition and processing, in particular to a video and voice analysis, recognition and processing method and system.

Background

With the non-development of the internet, performing video and voice through the internet is a technology which is commonly used at present, and when a user performs video and voice, the latter noise may be caught by a network, so that the video and voice accurately performs information transmission. Therefore, the technical solution provided by the present application is needed to analyze and recognize the video speech so as to ensure the accuracy of the speech optimization result.

Disclosure of Invention

In order to solve the technical problems in the related art, the application provides a video voice analysis, recognition and processing method and a system.

In a first aspect, a video speech analysis recognition processing method is provided, where the method at least includes: acquiring a voice vector of first voice interaction data in audio data to be processed; determining a plurality of first user intention data sets from the first voice interaction data by combining the obtained voice vectors, and screening first user tendency vectors of the first user intention data sets; determining an interfered voice abnormal data set in the first voice interaction data, and analyzing a interfered first interference variable of each first user intention data set by combining the determined voice abnormal data set; acquiring first similarity among first user intention data sets determined by combining the first interference variables obtained by analysis; and debugging each first user tendency vector by combining the acquired first similarity, and comparing the debugged first user tendency vector with a second user tendency vector to obtain a voice optimization result of the first voice interaction data, wherein the second user tendency vector is as follows: and debugging the user tendency vector of each second user intention data set by combining each second similarity to obtain a vector, wherein each second user intention data set is as follows: and the key voice data comprises an associated data set corresponding to each first user intention data set in the second voice interaction data set previously set.

In an embodiment of independent implementation, the comparing the debugged first user tendency vector with the second user tendency vector to obtain a voice optimization result of the first voice interaction data includes: analyzing the shared characteristic vectors of the first user tendency vectors and the corresponding second user tendency vectors after the debugging of each first user intention data set, and regarding the shared characteristic vectors as the shared characteristic vectors corresponding to each first user intention data set; determining a confidence level of the first user tendency vector of each first user intent data set for the vector of the first voice interaction data based on the first interference variable of each first user intent data set; weighting the shared feature vectors corresponding to the first user intention data sets by combining the determined confidence degrees to obtain weighted processing results, and regarding the weighted processing results as the shared feature vectors of the first voice interaction data and the second voice interaction data; and determining a voice optimization result of the first voice interaction data by combining the obtained shared feature vector.

In an independently implemented embodiment, the determining a confidence level of the first user propensity vector of each first user intent data set for the vector of the first voice interaction data based on the first interference variable of each first user intent data set comprises: determining a confidence level of the first user tendency vector of each first user intent data set for the vector of the first speech interaction data based on the first interference variable of each first user intent data set and the second interference variable of the corresponding second user intent data set.

In an embodiment of independent implementation, the comparing the debugged first user tendency vector with the second user tendency vector to obtain a voice optimization result of the first voice interaction data includes: vector splicing is carried out on the first user tendency vectors of the first user intention data sets in combination with the first interference variables of the first user intention data sets to obtain first spliced vectors, and vector splicing is carried out on the second user tendency vectors of the second user intention data sets in combination with the second interference variables of the second user intention data sets to obtain second spliced vectors; analyzing the shared feature vector of the first and second stitching vectors; and determining a voice optimization result of the first voice interaction data by combining the shared characteristic vector obtained by analysis.

In an independently implemented embodiment, the debugging each first user tendency vector in combination with the acquired first similarity, and comparing the debugged first user tendency vector with the second user tendency vector to obtain the voice optimization result of the first voice interaction data includes: loading each first user tendency vector, first similarity among each first user intention data set, user tendency vector of each second user intention data set and second similarity among each second user intention data set to a previously configured artificial intelligence thread, so that the artificial intelligence thread debugs each first user tendency vector based on each first similarity to obtain debugged first user tendency vector, debugges the user tendency vector of each second user intention data set based on each second similarity to obtain second user tendency vector, compares the debugged first user tendency vector with the second user tendency vector, and outputs a voice optimization result; and acquiring the voice optimization result output by the artificial intelligence thread.

In an independently implemented embodiment, the determining, from the first voice interaction data in combination with the obtained voice vectors, a number of first user intent data sets includes: determining the type of the associated data set to which each voice vector belongs by combining the important type of the acquired voice vector and the association condition between the type of the voice vector set in advance and the type of the associated data set of the first user intention data set; for each associated data set category, acquiring a reference location of a user intention data set belonging to the associated data set category based on location data of a voice vector belonging to the associated data set category, determining a difference between the reference location of the user intention data set belonging to the associated data set category and a reference location of a nearby user intention data set, and analyzing an associated data set vector of the user intention data set belonging to the associated data set category in combination with the determined difference between the associated data sets; based on the obtained reference positions and the analyzed associated dataset vectors, a first user intent dataset is determined to which each reference position belongs.

In a separately implemented embodiment, the filtering the speech vectors within each first user intent data set comprises: screening all vector sets of first voice interaction data in the audio data to be processed; determining a vector association dataset corresponding to each first user intention dataset in the total vector set based on the location of each first user intention dataset in the first voice interaction data; updating the vector association data set of each first user intention data set according to the vector of the previously set vector set to generate a vector set of which the vector is the vector of the previously set vector set; a speech vector corresponding to the set of vectors for each first user intent data set is determined and considered to be a speech vector within each first user intent data set.

In a second aspect, a video speech analysis, recognition and processing system is provided, which comprises a processor and a memory, which are communicated with each other, wherein the processor is used for reading a computer program from the memory and executing the computer program, so as to realize the method.

According to the video voice analysis, recognition and processing method and system provided by the embodiment of the application, a plurality of first user intention data sets are determined from first voice interaction data, and then the shared feature vectors among the first user intention data sets are analyzed by combining with the interference variables of the first user intention data sets, because the first similarity of the first user intention data sets is obtained by analyzing the interference variables of the first user intention data sets, the shared feature vectors among the first user intention data sets cover the interference information in the first voice audio data, and the process of debugging the first user tendency vectors by using the first similarity is equivalent to debugging the first user tendency vectors of the first user intention data sets based on the interference information in the first local audio data, so that the voice data related to interference covered in the first voice audio data caused by the interference can be weakened, the accuracy of voice analysis is improved, and the accuracy and reliability of matching are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart of a video speech analysis and recognition processing method according to an embodiment of the present application.

Detailed Description

In order to better understand the technical solutions, the technical solutions of the present application are described in detail below with reference to the drawings and specific embodiments, and it should be understood that the specific features in the embodiments and examples of the present application are detailed descriptions of the technical solutions of the present application, and are not limitations of the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.

Referring to fig. 1, a video speech analysis and recognition processing method is shown, which may include the technical solutions described in the following steps S101 to S105.

S101: and acquiring a voice vector of first voice interaction data in the audio data to be processed.

S102: and combining the acquired voice vectors, determining a plurality of first user intention data sets from the first voice interaction data, and screening the first user tendency vectors of the first user intention data sets.

S103: and determining an interfered voice abnormal data set in the first voice interaction data, and analyzing a first interference variable interfered by each first user intention data set by combining the determined voice abnormal data set.

S104: and acquiring a first similarity between the first user intention data sets determined by combining the first interference variables obtained by analysis.

S105: and debugging each first user tendency vector by combining the acquired first similarity, and comparing the debugged first user tendency vector with a second user tendency vector to obtain a voice optimization result of the first voice interaction data, wherein the second user tendency vector is as follows: and debugging the user tendency vector of each second user intention data set by combining each second similarity to obtain a vector, wherein each second user intention data set is as follows: and the associated data set corresponds to each first user intention data set in the second voice interaction data set in advance in the key voice data.

The video voice analysis, recognition and processing method provided by the embodiment of the disclosure determines a plurality of first user intention data sets from first voice interaction data, and further analyzes shared feature vectors among the first user intention data sets in combination with interference variables of the first user intention data sets, and because the first similarity of the first user intention data sets is obtained based on the interference variable analysis of the first user intention data sets, the shared feature vectors among the first user intention data sets cover interference information in first voice audio data, and a process of debugging the first user intention vectors by using the first similarity is equivalent to debugging the first user inclination vectors of the first user intention data sets based on the interference information in the first local audio data, so that voice data related to interference, which is covered by the first voice audio data due to the interference, can be weakened, the accuracy of voice analysis is improved, and the accuracy and reliability of matching are improved.

Exemplarily, the shared feature vector between the first user intention data sets is determined through the interference variable of each first user intention data set, so that the acquisition of the relation between the first user intention data sets is realized, the optimization of the voice is realized by using the acquired relation between the associated data sets, the accuracy of the voice analysis is further improved, and the accuracy and the reliability of the matching are further improved.

In order to clearly explain the technical solution of the embodiment of the present disclosure, the following steps are taken to explain the video speech analysis and recognition processing method provided by the embodiment of the present disclosure, and specifically include the following steps.

For step S101, the audio data to be processed may be audio data covering the voice to be processed, which is obtained by the mobile phone in the scenes of task voice information, video information, and the like, and optionally, the audio data to be processed according to the embodiment of the present invention may be audio data obtained by preprocessing the collected initial audio data, for example, the initial audio data covers a plurality of voice interaction data, in order to perform voice optimization in a more targeted manner, each associated data set belonging to the voice may be selected after optimizing the associated data set belonging to the voice, and the clipped audio data belonging to each associated data set belonging to the voice and the audio data to be processed according to the embodiment of the present disclosure may be selected.

In an embodiment of the present disclosure, a voice vector of first voice interaction data in audio data to be processed may be obtained by performing voice vector verification on the audio data to be processed, and a specific verification manner may be determined in combination with an actual scene, for example, a configured voice vector verification thread type may be used to perform voice vector verification on the audio data to be processed.

In one embodiment of the present disclosure, after verifying the voice vector of the initial voice interaction data, the state of the voice covered by the initial voice interaction data is debugged to be a previously set state based on the verified voice vector.

With respect to step S102, the first user intention data set is determined based on the speech vector, and the first user intention data set may cover part of the speech vector, may not cover the speech vector, and may cover the global speech vector.

For step S103, the voice abnormal data set interfered in the first voice interaction data is the associated data set where the interference data in the first voice interaction data is located.

Further, the interfered voice abnormal data set in the first voice interaction data can be determined through a previously configured interference check thread.

Based on the determined voice abnormal data set in the first voice interaction data, a part of each first user intention data set belonging to the voice abnormal data set can be determined, and then the proportion of the part of each first user intention data set belonging to the voice abnormal data set in the first user intention data set can be analyzed, wherein the analyzed proportion is a first interference variable of each first user intention data set which is interfered.

For step S104, the first interference variable of each first user intention data set may correspond to how much interference information is included in the first user tendency vector of the first user intention data set, and the more interference variables, the more interfered portions of the first user intention data set are represented, and for the first user intention data set, the portions thereof interfered by the interferers cannot be used for performing voice optimization, so the more interference information is included in the first user tendency vector of the first user intention data set.

For two random first user intent data sets, a first similarity of each first user intent data set to the other first user intent data set reflects: the degree of influence of the first vector of the first user intention data set on the first user tendency vector of the other first user intention data set is, that is, the greater the disturbance variable of one first user intention data set, the smaller the first similarity to the other first user intention data set, and the greater the disturbance variable of the other first user intention data set, the smaller the first similarity to the first user intention data set.

In step S105, for speech, the parts of the speech do not exist independently, that is, there is a certain correlation between the first user intention data sets, and therefore, the first user tendency vector of each first user intention data set can be debugged based on the correlation between the first user intention data sets.

Optionally, for one user intention data set, the strength of debugging the first user tendency vector of the first user intention data set is related to the first similarity between the first user intention data set and other first user intention data sets, and the greater the first similarity, the greater the debugging strength.

For each piece of audio data to be processed with voice optimization, in order to optimize local information corresponding to the previous audio interaction data to be processed, it is necessary to compare key voice data with known local information, determine whether parts corresponding to first audio interaction data in the audio data to be processed and previously set second audio interaction data in the key voice data are the same, and if the parts are the same, regard the known local information as the local information corresponding to the audio data to be processed, thereby completing the voice optimization.

For the second voice interaction data in the key voice data, the second user intention data set covered by the second voice interaction data, the user tendency vector of the second user intention data set, the second similarity between the second user intention data sets, and the second user tendency vector obtained by debugging the user tendency vector based on the second similarity may be determined in advance.

Optionally, the key speech data is processed according to steps S101 to S104, so as to obtain a second user tendency vector of a second user intention data set of second speech interaction data in the key speech data.

In one possible implementation, for each first user intent data set, the second user intent data set corresponding to that first user intent data set may be: a second user intent data set located in the second voice interaction data in the same location as the first user intent data set was located in the first voice interaction data, or a second user intent data set of the same type of associated data set as the first user intent data set, or both.

Alternatively, when the associated dataset category of the first user intent dataset is the left-eye associated dataset category, the second user intent dataset corresponding to the first user intent dataset may be: the second speech interaction data has a second user intent data set of the left eye association data set type.

In an embodiment of the present disclosure, the step S105 may be implemented based on a previously configured artificial intelligence thread, including: loading each first user tendency vector, first similarity among each first user intention data set, user tendency vectors of each second user intention data set and second similarity among each second user intention data set to a previously configured artificial intelligence thread, so that the artificial intelligence thread debugs each first user tendency vector based on each first similarity to obtain debugged first user tendency vectors, debugged user tendency vectors of each second user intention data set based on each second similarity to obtain second user tendency vectors, and compares the debugged first user tendency vectors with the second user tendency vectors to output a voice optimization result; and acquiring a voice optimization result output by the artificial intelligence thread.

Optionally, the first user tendency vector of each first user intention data set, the user tendency vector of each second user intention data set, and the first similarity between the first user intention data sets and the second similarity between the second user intention data sets recorded are regarded as the calculation results of the artificial intelligence thread.

On the basis of the voice optimization method, the embodiment of the present disclosure further provides a voice optimization method, and the implementation of step S105 may specifically include the following steps.

S201: and analyzing the shared characteristic vector of the first user tendency vector and the corresponding second user tendency vector after each first user intention data set is debugged, and regarding the shared characteristic vector as the shared characteristic vector corresponding to each first user intention data set.

In this step, the debugged shared feature vector of the first user tendency vector and the debugged shared feature vector of the second user tendency vector may be determined by combining the vector lists corresponding to the first user tendency vector and the second user tendency vector, and the debugged shared feature vector of the first user tendency vector and the debugged shared feature vector of the second user tendency vector are analyzed by analyzing the sine shared feature vector of the vector list corresponding to the first user tendency vector and the vector list corresponding to the second user tendency vector, where optionally, the larger the sine shared feature vector is, the higher the shared feature vector is.

S202: a confidence level of the first user tendency vector of the respective first user intent data set for the vector of the first speech interaction data is determined based on the first interference variable of the respective first user intent data set.

In this step, as can be seen from the foregoing, the larger the first interference variable is, the more interference information is included in the first user tendency vector of the first user intention data set, and the larger the first interference variable is, the smaller the confidence of the first user tendency vector of the first user intention data set with respect to the vector of the first voice interaction data is.

Optionally, in an alternative embodiment, the association between the interference variable and the confidence level may be constructed first, and after the first interference variable of each first user intention data set is determined, the confidence level for the first interference variable of each first user intention data set is determined.

Optionally, for some possible embodiments, each user intention data set may be evaluated based on the first disturbance variable of each user intention data set, and the result of comparing the scores of each user intention data set may be regarded as the confidence of each first user intention data set.

Optionally, in yet another implementation, the confidence level of the first user tendency vector of each first user intention data set for the vector of the first voice interaction data may also be determined based on the first interference variable of each first user intention data set and the second interference variable of the corresponding second user intention data set.

S203: and weighting the shared feature vector corresponding to each first user intention data set by combining the determined confidence degrees to obtain a weighted processing result, and regarding the weighted processing result as the shared feature vector of the first voice interaction data and the second voice interaction data.

S204: and determining a voice optimization result of the first voice interaction data by combining the obtained shared feature vector.

In this step, a shared feature vector determination value may be configured, and when the obtained shared feature vector is greater than the shared feature vector determination value, it is determined that the voice of the first voice interaction data is the same as the voice of the second voice interaction data, otherwise, it is different.

On the basis of the voice optimization method, the embodiment of the present disclosure further provides a voice optimization method, which implements step S105, and specifically may include the following steps.

S301: and vector splicing is carried out on the first user tendency vectors of the first user intention data sets in combination with the first interference variables of the first user intention data sets to obtain first spliced vectors, and vector splicing is carried out on the second user tendency vectors of the second user intention data sets in combination with the second interference variables of the second user intention data sets to obtain second spliced vectors.

In this step, vector splicing the first user tendency vectors of the first user intention data sets may be performed based on the first interference variables of the first user intention data sets, and generally speaking, the larger the first interference variable, the smaller the ratio of the first user intention data sets when spliced, and the smaller the first interference variable, the larger the ratio of the first user intention data sets when spliced. In particular, the first vectors of the first user intent data sets may be vector-up stitched.

The vector stitching of the second user tendency vectors of the second user intention data sets is similar to the stitching of the first user tendency vectors of the first user intention data sets, and is not limited to one.

S302: the shared feature vectors of the first and second stitching vectors are analyzed.

In this step, the shared eigenvectors of the first splicing vector and the second splicing vector may be determined by the difference between the vector list corresponding to the first splicing vector and the vector list corresponding to the second splicing vector in the vector, and when the difference between the vector list corresponding to the first splicing vector and the vector list corresponding to the second splicing vector in the vector is smaller, the shared eigenvector of the first splicing vector and the second splicing vector is larger.

S303: and determining a voice optimization result of the first voice interaction data by combining the shared characteristic vector obtained by analysis.

The technical scheme for comparing the debugged first user tendency vector with the debugged second user tendency vector is provided, and the analysis steps can be simplified and the analysis accuracy can be improved by splicing the vectors and then analyzing the vectors by sharing the characteristic vectors.

On the basis of the above speech optimization method, an embodiment of the present disclosure further provides a speech optimization method, which realizes the determination of the first user intention data set, and specifically includes the following steps.

S401: and determining the associated data set type of each voice vector by combining the important type of the acquired voice vector and the association condition between the previously set voice vector type and the associated data set type of the first user intention data set.

The association between the previously set speech vector class and the associated data set class of the first user intention data set may be a one-to-one association.

S402: for each associated data set category, acquiring a reference location of a user intention data set belonging to the associated data set category based on location data of a speech vector belonging to the associated data set category, determining a difference between the reference location of the user intention data set belonging to the associated data set category and a reference location of a nearby user intention data set, and analyzing an associated data set vector of the user intention data set belonging to the associated data set category in combination with the determined difference between the associated data sets.

In this step, for each associated data set type, the depolarization result of the speech vector belonging to the associated data set type may be analyzed, and the location corresponding to the depolarization result obtained by the analysis may be regarded as the reference location of the user intention data set belonging to the associated data set type.

After the reference locations of the user intent data sets of each of the associated data set categories are determined, differences between the associated data sets of the proximate user intent data sets may be analyzed, generally speaking, the greater the differences between the associated data sets, the greater the associated data set vector.

S403: based on the obtained reference locations and the analyzed associated dataset vectors, a first user intent dataset to which each reference location belongs is determined.

In this step, the first user intent data set may be rectangular, and at this time, the real-time location of the first user intent data set may be determined based on the determined reference location and the associated data set vector.

According to the video voice analysis recognition processing method provided by the embodiment of the disclosure, on the basis of the beneficial effects of the voice optimization method, the first user intention data set can be determined by combining the actual state and shape of each voice interaction data, so that the determined first user intention data set is more matched with the actual data, the accuracy of voice analysis is further improved, and the accuracy and reliability of matching are further improved.

On the basis of the voice optimization method, the embodiment of the present disclosure further provides a voice optimization method, which realizes the screening of the first user tendency vector, and specifically includes the following steps. S501: and screening all vector sets of the first voice interaction data in the audio data to be processed.

S502: based on the location of the respective first user intent data set in the first voice interaction data, a vector association data set corresponding to the respective first user intent data set in the total set of vectors is determined.

In this step, each character in the first voice interaction data has a location corresponding to each character in all the vector sets, and therefore, a vector association data set in which each first user intention data set is projected on all the vector sets can be determined by combining the projection relationship between the first voice interaction data and all the vector sets.

S503: and updating the vector association data set of each first user intention data set according to the vector of the previously set vector set to generate a vector set of which the vector is the vector of the previously set vector set.

S504: and determining a voice vector corresponding to the vector set of each first user intention data set, and regarding the voice vector in each first user intention data set.

In this step, the vector set obtained in step S503 may be processed continuously to obtain the speech vectors in each first user intention data set.

On the basis of the above, there is provided a video speech analysis, recognition and processing device 200, applied to a video speech analysis, recognition and processing system, the device comprising:

the variable determining module 210 is configured to obtain a voice vector of first voice interaction data in the audio data to be processed; determining a plurality of first user intention data sets from the first voice interaction data by combining the acquired voice vectors, and screening first user tendency vectors of the first user intention data sets; determining an interfered voice abnormal data set in the first voice interaction data, and analyzing a interfered first interference variable of each first user intention data set by combining the determined voice abnormal data set;

a similarity obtaining module 220, configured to obtain a first similarity between each first user intention data set determined by combining the first interference variable obtained through the analysis;

a result optimizing module 230, configured to debug each first user tendency vector in combination with the obtained first similarity, and compare the debugged first user tendency vector with a second user tendency vector to obtain a voice optimizing result of the first voice interaction data, where the second user tendency vector is: and combining each second similarity to debug the user tendency vector of each second user intention data set to obtain a vector, wherein each second user intention data set is as follows: and the associated data set corresponds to each first user intention data set in the second voice interaction data set in advance in the key voice data.

On the basis of the above, a video-speech analysis recognition processing system 300 is shown, which comprises a processor 310 and a memory 320 communicating with each other, wherein the processor 310 is configured to read a computer program from the memory 320 and execute the computer program to implement the above method.

On the basis of the above, there is also provided a computer-readable storage medium on which a computer program is stored, which when executed implements the above-described method.

In summary, based on the above-mentioned scheme, a plurality of first user intention data sets are determined from the first voice interaction data, and then the shared feature vector between the first user intention data sets is analyzed in combination with the interference variable of each first user intention data set, because the first similarity of each first user intention data set is obtained based on the interference variable analysis of each first user intention data set, the shared feature vector between the first user intention data sets covers the interference information in the first voice audio data, and the process of debugging the first user tendency vector using the first similarity is equivalent to debugging the first user tendency vector of each first user intention data set based on the interference information in the first local audio data, so that the voice data related to interference covered in the first voice audio data caused by the interference can be weakened, the accuracy of the voice analysis is improved, and the accuracy and reliability of the matching are improved.

It should be appreciated that the system and its modules shown above may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules of the present application may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of the above hardware circuits and software (e.g., firmware).

It is to be noted that different embodiments may produce different advantages, and in different embodiments, the advantages that may be produced may be any one or combination of the above, or any other advantages that may be obtained.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be considered merely illustrative and not restrictive of the broad application. Various modifications, improvements and adaptations to the present application may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present application and thus fall within the spirit and scope of the exemplary embodiments of the present application.

Also, this application uses specific language to describe embodiments of the application. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the present application is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the present application may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present application may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of the present application may be written in any one or more programming languages, including an object oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C + +, C #, VB.NET, python, and the like, a conventional programming language such as C, visual Basic, fortran 2003, perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, ruby, and Groovy, or other programming languages, and the like. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service using, for example, software as a service (SaaS).

Additionally, the order in which elements and sequences of the processes described herein are processed, the use of alphanumeric characters, or the use of other designations, is not intended to limit the order of the processes and methods described herein, unless explicitly claimed. While certain presently contemplated useful embodiments of the invention have been discussed in the foregoing disclosure by way of various examples, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments of the disclosure. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features are required than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the numbers allow for variation in flexibility. Accordingly, in some embodiments, the numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit-preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, and the like, cited in this application is hereby incorporated by reference in its entirety. Except where the application history document is inconsistent or conflicting with the present application as to the extent of the present claims, which are now or later appended to this application. It is noted that the descriptions, definitions and/or use of terms in this application shall control if they are inconsistent or contrary to the statements and/or uses of the present application in the material attached to this application.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present application. Other variations are also possible within the scope of the present application. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the present application may be viewed as being consistent with the teachings of the present application. Accordingly, the embodiments of the present application are not limited to only those embodiments explicitly described and depicted herein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims

1. A video voice analysis recognition processing method is characterized by at least comprising the following steps:

acquiring a voice vector of first voice interaction data in the audio data to be processed; determining a plurality of first user intention data sets from the first voice interaction data by combining the acquired voice vectors, and screening first user tendency vectors of the first user intention data sets; determining an interfered voice abnormal data set in the first voice interaction data, and analyzing a interfered first interference variable of each first user intention data set by combining the determined voice abnormal data set;

acquiring first similarity among first user intention data sets determined by combining the first interference variables obtained by analysis;

debugging each first user tendency vector by combining the acquired first similarity, and comparing the debugged first user tendency vector with a second user tendency vector to obtain a voice optimization result of the first voice interaction data, wherein the second user tendency vector is as follows: and combining each second similarity to debug the user tendency vector of each second user intention data set to obtain a vector, wherein each second user intention data set is as follows: and the associated data set corresponds to each first user intention data set in the second voice interaction data set in advance in the key voice data.

2. The method according to claim 1, wherein the comparing the debugged first user tendency vector with the second user tendency vector to obtain the voice optimization result of the first voice interaction data comprises:

analyzing the shared characteristic vectors of the first user tendency vectors and the corresponding second user tendency vectors after the debugging of each first user intention data set, and regarding the shared characteristic vectors as the shared characteristic vectors corresponding to each first user intention data set;

determining a confidence level of the first user tendency vector of each first user intent data set for the vector of the first voice interaction data based on the first interference variable of each first user intent data set;

weighting the shared feature vectors corresponding to the first user intention data sets by combining the determined confidence degrees to obtain weighted processing results, and regarding the weighted processing results as the shared feature vectors of the first voice interaction data and the second voice interaction data;

and determining a voice optimization result of the first voice interaction data by combining the obtained shared feature vector.

3. The method of claim 2, wherein determining a confidence level of the first user propensity vector of each first user intent data set for the vector of the first voice interaction data based on the first interference variable of each first user intent data set comprises: determining a confidence level of the first user propensity vector of each first user intent data set for the vector of the first voice interaction data based on the first interference variable of each first user intent data set and the second interference variable of the corresponding second user intent data set.

4. The method of claim 1, wherein comparing the debugged first user tendency vector with the second user tendency vector to obtain a voice optimization result of the first voice interaction data comprises:

vector splicing is carried out on the first user tendency vectors of the first user intention data sets in combination with the first interference variables of the first user intention data sets to obtain first spliced vectors, and vector splicing is carried out on the second user tendency vectors of the second user intention data sets in combination with the second interference variables of the second user intention data sets to obtain second spliced vectors;

analyzing the shared feature vector of the first and second stitching vectors;

and determining a voice optimization result of the first voice interaction data by combining the shared feature vector obtained by analysis.

5. The method according to claim 1, wherein the debugging each first user tendency vector in combination with the acquired first similarity and comparing the debugged first user tendency vector with the second user tendency vector to obtain the voice optimization result of the first voice interaction data comprises:

loading each first user tendency vector, a first similarity among each first user intention data set, a user tendency vector of each second user intention data set and a second similarity among each second user intention data set to a previously configured artificial intelligence thread, so that the artificial intelligence thread debugs each first user tendency vector based on each first similarity to obtain a debugged first user tendency vector, debugges the user tendency vector of each second user intention data set based on each second similarity to obtain a second user tendency vector, compares the debugged first user tendency vector with the second user tendency vector, and outputs a voice optimization result;

and acquiring the voice optimization result output by the artificial intelligence thread.

6. The method of claim 5, wherein determining a number of first user intent data sets from the first voice interaction data in conjunction with the obtained voice vectors comprises:

determining the type of the associated data set to which each voice vector belongs by combining the important type of the acquired voice vector and the association condition between the type of the voice vector set in advance and the type of the associated data set of the first user intention data set;

for each associated data set category, acquiring a reference location of a user intention data set belonging to the associated data set category based on location data of a voice vector belonging to the associated data set category, determining a difference between the reference location of the user intention data set belonging to the associated data set category and a reference location of a nearby user intention data set, and analyzing an associated data set vector of the user intention data set belonging to the associated data set category in combination with the determined difference between the associated data sets;

based on the obtained reference locations and the analyzed associated dataset vectors, a first user intent dataset to which each reference location belongs is determined.

7. The method of claim 5, wherein the filtering the speech vectors within each first user intent data set comprises:

screening all vector sets of first voice interaction data in the audio data to be processed;

determining vector association data sets in the total vector set corresponding to the respective first user intention data sets based on the locations of the respective first user intention data sets in the first voice interaction data;

updating the vector association data set of each first user intention data set according to the vector of the previously set vector set to generate a vector set of which the vector is the vector of the previously set vector set; and determining a voice vector corresponding to the vector set of each first user intention data set, and regarding the voice vector in each first user intention data set.

8. A video speech analysis recognition processing system comprising a processor and a memory in communication with each other, the processor being configured to read a computer program from the memory and execute the computer program to implement the system of any one of claims 1 to 7.