CN116343751B

CN116343751B - Voice translation-based audio analysis method and device

Info

Publication number: CN116343751B
Application number: CN202310615745.8A
Authority: CN
Inventors: 许宁涛; 邝毅勋; 许歆怡; 邝隽涵
Original assignee: Shenzhen Taiwei Software Development Co ltd
Current assignee: Shenzhen Taiwei Software Development Co ltd
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-08-11
Anticipated expiration: 2043-05-29
Also published as: CN116343751A

Abstract

The invention relates to the field of artificial intelligence, and discloses an audio analysis method and device based on voice translation, which are used for improving the accuracy of voice translation. The method comprises the following steps: inputting the first voice data into a first voice analysis model to perform voice feature analysis to obtain a voice feature analysis result, and extracting voice feature data of the first voice data according to the voice feature analysis result to obtain speaker voice feature data; inputting the voice characteristic data of the speaker into a second voice analysis model to perform language type analysis to obtain language type information; generating a target feature matrix, inputting the target feature matrix and the voice translation requirement of each target user into a voice translation model for voice translation, and obtaining second voice data; and carrying out voice distribution and audio transmission on the second voice data according to the first voice channel.

Description

Voice translation-based audio analysis method and device

Technical Field

The invention relates to the field of artificial intelligence, in particular to an audio analysis method and device based on speech translation.

Background

The intelligent speech translator is one intelligent equipment integrating speech recognition, machine translation, speech synthesis and other technology, and can translate one language into other language and output corresponding speech or text result. The user can use the device to input the original text in a speaking or handwriting input mode, then the device can automatically perform voice recognition and convert the original text into a text form, then the original text is translated into a required target language by using a machine translation algorithm, and finally the translation result is converted into voice output by using a voice synthesis technology.

However, the translation effect of the existing scheme is poor, and the existing scheme cannot select corresponding voices to translate according to the requirements of users, so that the accuracy of voice translation is low.

Disclosure of Invention

The invention provides an audio analysis method and device based on voice translation, which are used for improving the accuracy of voice translation.

The first aspect of the present invention provides a voice translation-based audio analysis method, which includes:

acquiring voice of N target users based on a preset voice intelligent translator to obtain first voice data, and acquiring voice translation requirements and first voice channels of each target user;

inputting the first voice data into a preset first voice analysis model to perform voice feature analysis to obtain a voice feature analysis result, and extracting voice feature data of the first voice data according to the voice feature analysis result to obtain speaker voice feature data corresponding to each target user;

inputting the speaker voice characteristic data corresponding to each target user into a preset second voice analysis model to perform language type analysis, so as to obtain language type information corresponding to each speaker voice data;

Generating a feature vector corresponding to each target user according to the speaker voice feature data and the language type information, and carrying out matrix fusion on a feature matrix corresponding to each target user to obtain a target feature matrix;

inputting the target feature matrix and the voice translation requirement of each target user into a preset voice translation model to carry out voice translation, and obtaining second voice data corresponding to each target user;

and carrying out transmission channel allocation on the second voice data corresponding to each target user according to the first voice channels to obtain at least one second voice channel corresponding to each second voice data, and carrying out voice distribution and audio transmission on the second voice data through the at least one second voice channel.

With reference to the first aspect, in a first implementation manner of the first aspect of the present invention, the voice collecting, by the intelligent voice translator, N target users to obtain first voice data, and obtaining a voice translation requirement and a first voice channel of each target user includes:

performing voice acquisition on N target users based on a plurality of voice collectors in a preset voice intelligent translator to obtain first voice data;

Target demand information of each target user is obtained respectively, and demand analysis is carried out on the target demand information to obtain voice translation demands of each target user;

and according to the voice collectors, carrying out voice channel configuration on the target demand information of each target user to obtain a first voice channel of each target user.

With reference to the first aspect, in a second implementation manner of the first aspect of the present invention, inputting the first voice data into a preset first voice analysis model to perform voice feature analysis, to obtain a voice feature analysis result, and performing voice feature data extraction on the first voice data according to the voice feature analysis result, to obtain speaker voice feature data corresponding to each target user, where the voice feature data includes:

inputting the first voice data into a preset first voice analysis model, wherein the first voice analysis model comprises: a plurality of acoustic feature extraction modules;

performing voice feature analysis on the first voice data through the plurality of acoustic feature extraction modules to obtain a voice feature analysis result, wherein the voice feature analysis result comprises voice features of each target user;

Extracting voice characteristic data from the first voice data to obtain initial voice characteristic data;

and according to the voice characteristic analysis result, carrying out speaker classification extraction on the initial voice characteristic data to obtain speaker voice characteristic data corresponding to each target user.

With reference to the first aspect, in a third implementation manner of the first aspect of the present invention, inputting the speaker voice feature data corresponding to each target user into a preset second voice analysis model to perform language type analysis, to obtain language type information corresponding to each speaker voice data includes:

inputting the speaker voice characteristic data corresponding to each target user into a preset second voice analysis model, wherein the second voice analysis model comprises the following steps: an input layer, a hidden layer, and an output layer;

performing voice attribute classification on the speaker voice characteristic data corresponding to each target user through the second voice analysis model to obtain a voice attribute predicted value corresponding to each target user;

and matching the language type information corresponding to the voice data of each speaker from a preset language type mapping model according to the voice attribute predicted value.

With reference to the first aspect, in a fourth implementation manner of the first aspect of the present invention, the generating, according to the speaker voice feature data and the language type information, a feature vector corresponding to each target user, and performing matrix fusion on a feature matrix corresponding to each target user, to obtain a target feature matrix, includes:

performing characteristic index conversion on the voice characteristic data of the speaker to obtain a first index set;

performing category index conversion on the language category information to obtain a second index set;

performing vectorization recombination on the first index set and the second index set to generate a feature vector corresponding to each target user;

and carrying out matrix fusion on the feature matrix corresponding to each target user to obtain a target feature matrix.

With reference to the first aspect, in a fifth implementation manner of the first aspect of the present invention, the inputting the target feature matrix and the speech translation requirement of each target user into a preset speech translation model to perform speech translation, to obtain second speech data corresponding to each target user includes:

inputting the target feature matrix and the voice translation requirement of each target user into a preset voice translation model, wherein the voice translation model comprises the following steps: n first encoding networks, N first decoding networks, N second encoding networks, and N second decoding networks;

Performing feature code conversion on the target feature matrix and the voice translation requirement of each target user through the first coding network and the first decoding network to obtain a plurality of feature code sequences;

and respectively performing voice compiling on the plurality of characteristic coding sequences through the second coding network and the second decoding network to obtain second voice data corresponding to each target user.

With reference to the first aspect, in a sixth implementation manner of the first aspect of the present invention, the allocating a transmission channel to second voice data corresponding to each target user according to the first voice channel, to obtain at least one second voice channel corresponding to each second voice data, and performing voice distribution and audio transmission on the second voice data through the at least one second voice channel includes:

according to the voice translation requirement, matching the voice channels of the second voice data from the first voice channels to obtain at least one second voice channel corresponding to each second voice data;

constructing a demand identifier and a transmission identifier of the at least one second voice channel;

and according to the requirement identification and the transmission identification, carrying out voice distribution and audio transmission on the second voice data through the at least one second voice channel.

A second aspect of the present invention provides a speech translation-based audio analysis apparatus, comprising:

the acquisition module is used for carrying out voice acquisition on N target users based on a preset voice intelligent translator to obtain first voice data, and acquiring voice translation requirements and first voice channels of each target user;

the extraction module is used for inputting the first voice data into a preset first voice analysis model to perform voice feature analysis to obtain a voice feature analysis result, and extracting voice feature data of the first voice data according to the voice feature analysis result to obtain speaker voice feature data corresponding to each target user;

the analysis module is used for inputting the voice characteristic data of the speaker corresponding to each target user into a preset second voice analysis model to perform language type analysis, so as to obtain language type information corresponding to the voice data of each speaker;

the fusion module is used for generating a feature vector corresponding to each target user according to the speaker voice feature data and the language type information, and carrying out matrix fusion on the feature matrix corresponding to each target user to obtain a target feature matrix;

The translation module is used for inputting the target feature matrix and the voice translation requirement of each target user into a preset voice translation model to carry out voice translation, so as to obtain second voice data corresponding to each target user;

the transmission module is used for carrying out transmission channel allocation on the second voice data corresponding to each target user according to the first voice channels to obtain at least one second voice channel corresponding to each second voice data, and carrying out voice distribution and audio transmission on the second voice data through the at least one second voice channel.

A third aspect of the present invention provides an audio analysis device based on speech translation, comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the speech translation based audio analysis device to perform the speech translation based audio analysis method described above.

A fourth aspect of the present invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the above-described speech translation based audio analysis method.

According to the technical scheme provided by the invention, first voice data are input into a first voice analysis model to perform voice feature analysis, so that a voice feature analysis result is obtained, and voice feature data extraction is performed on the first voice data according to the voice feature analysis result, so that speaker voice feature data are obtained; inputting the voice characteristic data of the speaker into a second voice analysis model to perform language type analysis to obtain language type information; generating a target feature matrix, inputting the target feature matrix and the voice translation requirement of each target user into a voice translation model for voice translation, and obtaining second voice data; according to the invention, through classifying and extracting the voice data according to the voice characteristics of a speaker, then carrying out voice translation through a pre-constructed voice translation model, and finally distributing the translated voice data to corresponding target users according to the voice translation requirements of different target users, the invention realizes intelligent voice translation and improves the accuracy of voice translation.

Drawings

FIG. 1 is a schematic diagram of an embodiment of an audio analysis method based on speech translation according to an embodiment of the present invention;

FIG. 2 is a flow chart of a human voice feature analysis in an embodiment of the invention;

FIG. 3 is a flow chart of language type analysis in an embodiment of the present invention;

FIG. 4 is a flow chart of matrix fusion in an embodiment of the invention;

FIG. 5 is a schematic diagram of an embodiment of an audio analysis device based on speech translation according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an embodiment of an audio analysis device based on speech translation according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides an audio analysis method and device based on voice translation, which are used for improving the accuracy of voice translation. The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, a specific flow of an embodiment of the present invention is described below with reference to fig. 1, where an embodiment of a speech translation based audio analysis method in an embodiment of the present invention includes:

s101, carrying out voice acquisition on N target users based on a preset voice intelligent translator to obtain first voice data, and obtaining voice translation requirements and first voice channels of each target user;

it is to be understood that the execution subject of the present invention may be an audio analysis device based on speech translation, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.

Specifically, the server acquires the intelligent voice translator, selects and deploys the intelligent voice translator of the required language, connects a plurality of voice collectors associated with a plurality of target users, provides a voice collector, such as a microphone, for each target user, lets each target user speak some common words or sentences, records voice signals, which are original voice data, and then performs data processing: and processing and analyzing the original voice data, including removing noise, enhancing voice quality, extracting voice characteristics and the like, to obtain first voice data. And carrying out requirement information analysis by providing questionnaires for a plurality of target users or communication feedback information of the target users to obtain the voice translation requirement of each target user. The server creates a corresponding first voice channel for each target user, and connects the first voice channels with a preset voice intelligent translator so as to ensure that the first voice channels can translate and transmit according to the translation requirements of the target users.

S102, inputting the first voice data into a preset first voice analysis model to perform voice feature analysis to obtain a voice feature analysis result, and extracting voice feature data of the first voice data according to the voice feature analysis result to obtain speaker voice feature data corresponding to each target user;

specifically, the server inputs first voice data into a preset first voice analysis model, the first voice analysis model comprises a plurality of acoustic feature extraction modules, the plurality of acoustic feature extraction modules can extract features related to a plurality of target users from the first voice data, and the plurality of acoustic feature extraction modules are used for carrying out voice feature analysis on the first voice data to obtain voice feature analysis results. The voice characteristic analysis result comprises voice characteristics of each target user, wherein the voice characteristics are voice attributes in the first voice data, such as gender, accent and speech speed characteristic information of the target user. The server extracts voice characteristics of the first voice data to obtain initial voice characteristic data, and then the server carries out speaker classification extraction on the initial voice characteristic data according to the voice characteristic analysis result to obtain speaker voice characteristic data corresponding to each target user. Wherein, the speaker classification is to provide a Gaussian Mixture Model (GMM) to extract the voice spoken by each target user in the first voice data, and the speaker classification process is to divide the training data set into two parts: one part is used for training the classifier and the other part is used for testing and verification. In the training process, the classifier can recognize the voice characteristics of each target user by modeling and learning the voice signals and distinguish the voice characteristics from the voice characteristics of other people, so that the speaker voice characteristic data corresponding to each target user is extracted from the first voice data.

S103, inputting the speaker voice characteristic data corresponding to each target user into a preset second voice analysis model to perform language type analysis, so as to obtain language type information corresponding to each speaker voice data;

it should be noted that the preset second speech analysis model includes an input layer, a hidden layer and an output layer. The input layer receives the speaker voice characteristic data corresponding to each target user, maps the speaker voice characteristic data to a high-dimensional space through nonlinear transformation of the hiding layer, and the output layer outputs a voice attribute predicted value corresponding to each target user according to the learned model parameters. And the server classifies the voice attribute of the speaker voice characteristic data corresponding to each target user through the second voice analysis model. The voice attribute refers to language type information of the target user. In the training process of the second voice analysis model, the known voice attribute information is used as a supervision signal to train the model, and after the training of the second voice analysis model is completed, a voice attribute predicted value corresponding to each target user is generated. And then mapping the voice attribute predicted value into language type information through a preset language type mapping model. The language type mapping model includes a plurality of voice attribute predictors and corresponding language type information, for example: and setting the voice attribute predicted value of English as 1, wherein English is language type information when the voice attribute predicted value is 1. The server matches the language type information corresponding to the voice data of each speaker from a preset language type mapping model according to the voice attribute predicted value. By inputting the speech attribute predicted value into the mapping model, the language type information used by each target user can be obtained. According to the language type information, the application can provide corresponding translation service to translate and interpret the input voice signal.

S104, generating a feature vector corresponding to each target user according to the speaker voice feature data and the language type information, and carrying out matrix fusion on a feature matrix corresponding to each target user to obtain a target feature matrix;

specifically, the server converts the speaker voice feature data and language type information corresponding to each target user into a multidimensional feature vector through feature index conversion operation, which represents the value condition of the feature vector under different feature indexes, wherein the feature indexes can comprise basic acoustic features (such as energy, zero crossing rate, frequency spectrum shape and the like), high-level acoustic features (such as MFCC, mel-frequency cepstrum and the like) and other features related to voice signals. The server can convert the language type information corresponding to each target user into a vector represented by one-hot code, wherein each element represents a language type. The required language category is determined by using a list of known languages or other methods. And the server performs vectorization recombination on the first index set and the second index set to generate a feature vector corresponding to each target user. The two vectors are spliced or superimposed into one larger vector by linear algebra or machine learning algorithms to represent the required feature information. The server combines the feature vectors corresponding to each target user into a feature matrix for the feature vectors corresponding to each target user. According to the embodiment, the feature vector corresponding to each target user is generated according to the speaker voice feature data and the language type information, and the feature matrix corresponding to each target user is subjected to matrix fusion, so that the speaker voice feature data is converted into the numerical vector which can be processed by a computer, and the information of different target users is integrated, and therefore more accurate and efficient voice translation service is realized.

S105, inputting the target feature matrix and the voice translation requirement of each target user into a preset voice translation model to carry out voice translation, and obtaining second voice data corresponding to each target user;

it should be noted that the preset speech translation model includes N first encoding networks, N first decoding networks, N second encoding networks, and N second decoding networks. The first coding network is used for converting the input target feature matrix and the voice translation requirement into a high-dimensional feature vector; the first decoding network is used for converting the feature vector into a plurality of feature coding sequences; the second encoding network is used for converting the plurality of characteristic encoding sequences into a low-dimensional representation of the voice signal; the second decoding network is used for converting the voice signal of the low-dimensional representation into second voice data required by the target user. The server inputs the target feature matrix and the voice translation requirement of each target user into the first coding network and the first decoding network to perform feature coding conversion. The server uses an automatic encoder or other deep learning model to extract the hidden representations of the input features, resulting in a plurality of feature encoded sequences. The server implements a speech compilation function for the second encoding network and the second decoding network based on an autoregressive model, a convolutional neural network, or the like. During training, known speech data sets and translation data sets are used as supervisory signals to learn model parameters and optimize model performance. After model training is completed, second voice data required by the target user can be generated according to the input feature coding sequence. The server adopts multiplexing technology to simultaneously transmit a plurality of input streams to a speech translation model for processing according to the speech translation requirements of different target users. And further converting the voice signal into a target language text and further converting the target language text into corresponding translated second voice data.

S106, carrying out transmission channel allocation on the second voice data corresponding to each target user according to the first voice channels to obtain at least one second voice channel corresponding to each second voice data, and carrying out voice distribution and audio transmission on the second voice data through the at least one second voice channel.

Specifically, the server matches the second voice data corresponding to each target user with the corresponding voice channel in the first voice channel. The server determines at least one second voice channel corresponding to each second voice data through a preset channel mapping table. The server constructs a requirement identifier and a transmission identifier of at least one second voice channel, wherein the requirement identifier mainly represents related information of voice translation requirements, such as target language, translation model and the like; the transmission identifier represents relevant parameters in the data transmission process, such as code rate, frame length, sampling rate, etc. And the server performs voice distribution and audio transmission on the second voice data through at least one second voice channel according to the demand identifier and the transmission identifier. The audio data is converted to digital signals and transferred and processed between different devices using known network protocols or other data transfer protocols. The server uses a data compression and retransmission mechanism to improve transmission efficiency and reliability.

In the embodiment of the invention, the first voice data is input into a first voice analysis model to perform voice feature analysis to obtain a voice feature analysis result, and voice feature data extraction is performed on the first voice data according to the voice feature analysis result to obtain speaker voice feature data; inputting the voice characteristic data of the speaker into a second voice analysis model to perform language type analysis to obtain language type information; generating a target feature matrix, inputting the target feature matrix and the voice translation requirement of each target user into a voice translation model for voice translation, and obtaining second voice data; according to the invention, through classifying and extracting the voice data according to the voice characteristics of a speaker, then carrying out voice translation through a pre-constructed voice translation model, and finally distributing the translated voice data to corresponding target users according to the voice translation requirements of different target users, the invention realizes intelligent voice translation and improves the accuracy of voice translation.

In a specific embodiment, the process of executing step S101 may specifically include the following steps:

(1) Performing voice acquisition on N target users based on a plurality of voice collectors in a preset voice intelligent translator to obtain first voice data;

(2) Target demand information of each target user is obtained respectively, and demand analysis is carried out on the target demand information to obtain voice translation demands of each target user;

(3) And according to the plurality of voice collectors, carrying out voice channel configuration on the target demand information of each target user to obtain a first voice channel of each target user.

Specifically, the server performs voice acquisition on N target users through a plurality of voice collectors in a preset voice intelligent translator in a multi-user scene, so that first voice data can be obtained. This approach may help collect the voice information of individual individuals in a group to better meet their needs. Target demand information of each target user is obtained respectively, and demand analysis is carried out on the target demand information so as to obtain the voice translation demand of each target user. These speech translation requirements may include requirements in terms of language, dialect, accent, pronunciation, etc. To this end, such information may be obtained through communication or other forms of interaction with the user. If the user is able to understand English, they can also be communicated with by English. If the user is unable to understand English, processing is performed by translating the other language. When the target demand information is acquired, the target user can customize the voice translation demand according to the demand feedback information. And the server configures a voice channel according to the target demand information of each target user to obtain a first voice channel of each target user. The voice channel configuration refers to setting appropriate parameters such as sampling rate, coding mode, transmission protocol and the like so as to ensure that voice data can be accurately collected, transmitted and identified. The server selects the corresponding sampling rate, coding mode and transmission protocol parameters according to the characteristics of the language, accent, pronunciation and the like of the target user. For example, for some accent heavy users, higher sampling rates and more complex coding schemes may be employed to increase the recognition rate of the speech data. It should be noted that, the standard sampling rate adopted in this embodiment is 44.1kHz, the sampler samples the sound signal 44100 times per second, and the obtained digitized audio signal is used for subsequent language processing and storage. According to the characteristics of language, accent, pronunciation, etc. of the target user, different sampling rates, such as 8kHz, 16kHz, etc. are selected. The coding modes include MP3 and AAC, ALAC, FLAC, WAV. Different coding modes bring different compression quality and file size, when the embodiment collects user voice signals, high-fidelity voice data is needed, so that a higher sampling rate and a relatively lossless coding mode, such as a 44.1kHz sampling rate and a WAV coding format, can ensure the relative accuracy of the voice signals, and in some situations of heavy accents, a lower sampling rate and a compression coding mode, such as an 8kHz sampling rate and MP3 coding mode, are needed, so that the file size can be reduced, the time cost for storage and transmission can be reduced, but because part of voice data is damaged by compression, a larger distortion risk can exist in subsequent processing. In this embodiment, an appropriate sampling rate and an appropriate encoding mode are determined according to actual situations and user requirements, so as to achieve an optimal processing effect.

In a specific embodiment, as shown in fig. 2, the process of executing step S102 may specifically include the following steps:

s201, inputting first voice data into a preset first voice analysis model, wherein the first voice analysis model comprises: a plurality of acoustic feature extraction modules;

s202, performing voice feature analysis on the first voice data through a plurality of acoustic feature extraction modules to obtain voice feature analysis results, wherein the voice feature analysis results comprise voice features of each target user;

s203, extracting voice characteristic data of the first voice data to obtain initial voice characteristic data;

s204, according to the voice characteristic analysis result, carrying out speaker classification extraction on the initial voice characteristic data to obtain speaker voice characteristic data corresponding to each target user.

Specifically, the first speech analysis model includes a plurality of acoustic feature extraction modules. The plurality of acoustic feature extraction modules process the voice signals by using different algorithms to obtain voice features of different levels, such as frequency, time domain, energy and the like, which can be extracted. The server performs voice feature analysis on the first voice data through a plurality of acoustic feature extraction modules. The server will extract different features from the first speech data for describing some properties of the sound like pitch, fundamental frequency etc. The server recognizes that the voiceprint of each speaker is different, so that each target user is distinguished according to the result of the voice feature analysis. After the server obtains the voice characteristic analysis result, the preliminary voice data can further extract voice characteristic data. The method for extracting the MFCC (Mel frequency cepstrum coefficient) is adopted, the target voice under a huge amount of ordinary environments is fully separated on the basis of a human voice separation characteristic model, other interference signals and background environment sounds are isolated, and the voice characteristic extraction function is realized. The server can classify and extract the extracted speaker voice characteristic data according to the corresponding voice characteristic analysis result. And carrying out recognition analysis on the voice signal, and matching the voice signal with the characteristics of the target user according to the analysis result, so as to obtain the voice characteristic data of the speaker corresponding to the target user. Wherein the acoustic feature extraction module comprises a plurality of network levels: an acoustic preprocessing layer for performing basic audio signal processing, such as noise reduction and pre-emphasis, on the audio signal; the original characteristic extraction layer extracts energy of different frequency bands through a molar filter bank, and simultaneously converts a time domain signal into a frequency domain by using fast Fourier transform; a forward feature processing layer for further preprocessing the original sound features and converting the features into a series of higher level representations using a Convolutional Neural Network (CNN); a state modeling layer for performing more complex modeling on the forward feature, and using a model method such as a Hidden Markov Model (HMM) or a context independent model of millisecond degree; and the subsequent feature processing layer is used for weighting, correcting or enhancing the features output by the state modeling layer, so that the method is more suitable for specific application scenes. The most representative real environment condition can be obtained, and the functions of downloading or offline storage of the user voice and the like are realized.

In a specific embodiment, as shown in fig. 3, the process of executing step S103 may specifically include the following steps:

s301, inputting speaker voice characteristic data corresponding to each target user into a preset second voice analysis model, wherein the second voice analysis model comprises: an input layer, a hidden layer, and an output layer;

s302, performing voice attribute classification on speaker voice feature data corresponding to each target user through a second voice analysis model to obtain a voice attribute predicted value corresponding to each target user;

s303, matching the language type information corresponding to the voice data of each speaker from a preset language type mapping model according to the voice attribute predicted value.

Specifically, the server classifies voice attribute of the speaker corresponding to each target user by using a preset second voice analysis model. The second speech analysis model comprises an input layer, a hidden layer and an output layer, and is designed and optimized by deep learning and other technologies. In the second voice analysis model, the input layer receives specific voice feature input aiming at the target users, the information is processed for many times through the hidden layer, the evaluation result set of the data features is output to the output layer, and the features of different levels are fused together to carry out voice attribute classification, so that the voice attribute predicted value corresponding to each target user is obtained. The voice attribute refers to information in terms of pronunciation, speed, intonation and the like. For example, some target users have slower speech and a smooth speaking tone; while other target users may speak faster and more powerful, with longer or shorter speech, and thus are directed to the user's individual acoustic features. Finally, the server maps the predicted values of the voice attributes into a preset language type mapping model to establish effective information aggregation. The language category mapping model contains different acoustic features and language features, and can map the predicted results to the appropriate language types. According to the model return result, in the data template set, all recorded sample data language categories can be classified and matched uniformly at the same time, and language category information corresponding to each speaker voice data is obtained through matching.

In a specific embodiment, as shown in fig. 4, the process of executing step S104 may specifically include the following steps:

s401, performing feature index conversion on speaker voice feature data to obtain a first index set;

s402, performing category index conversion on the language category information to obtain a second index set;

s403, carrying out vectorization recombination on the first index set and the second index set to generate a feature vector corresponding to each target user;

s404, performing matrix fusion on the feature matrix corresponding to each target user to obtain a target feature matrix.

Specifically, the server performs feature index conversion on the speaker voice feature data, which means that the original voice feature data is converted into a form suitable for processing by a specific algorithm. Before the feature index conversion, the voice features of the speaker are required to be preprocessed, the preprocessing can keep the information of the voice signals to the greatest extent and eliminate the influence of the unstable dimension and the noise influence dimension, so that the more unique meaning is extracted. The server performs category index conversion on language category information, which means that symbol type structure representation of a language is mapped into a mathematical model of model calculation. For example, if it is desired to perform an inference modeling operation in multiple language categories in Chinese, english, etc., each category is mapped to a different vector or matrix. According to the input information, the respective example voices of the soxprode are generated, so that the analysis application sample data are tested and analyzed in a more comparable mode, and the server performs vectorization recombination on the first index set and the second index set so as to perform subsequent prediction or analysis work. The server generates a specific coding scheme and completes the classification fusion processing of heterogeneous audio normalization and the processes of converting, migrating and identifying various corpus meanings and extracting features. After the feature vector and the vector group corresponding to each target user are obtained, matrix fusion processing is needed to be carried out on the feature vector. This is to effectively capture or compress complex relationships created between different feature vectors and respond to high-scale training. The matrix fusion method comprises Joint Mapping (JM), matrix Completion (MC), characteristic tree (FT) or the like.

In a specific embodiment, the process of executing step S105 may specifically include the following steps:

(1) Inputting the target feature matrix and the voice translation requirement of each target user into a preset voice translation model, wherein the voice translation model comprises the following steps: n first encoding networks, N first decoding networks, N second encoding networks, and N second decoding networks;

(2) Performing feature code conversion on the target feature matrix and the voice translation requirement of each target user through a first coding network and a first decoding network to obtain a plurality of feature code sequences;

(3) And respectively performing voice compiling on the plurality of characteristic coding sequences through a second coding network and a second decoding network to obtain second voice data corresponding to each target user.

Specifically, the server performs a translation model flow of feature code conversion and speech compiling based on the target feature matrix and the speech translation requirement of each target user. The speech translation model comprises N first coding networks, N first decoding networks, N second coding networks and N second decoding networks, and can be applied to a plurality of speech translation scenes. The server inputs the target feature matrix and the voice translation requirement of each target user into a preset voice translation model. The target feature matrix mainly consists of voice feature sequences extracted by some voice processing technologies. And then, performing feature code conversion on the target feature matrix and the voice translation requirement of each target user in a translation model through N first coding networks and first decoding networks to obtain a plurality of feature code sequences. In the process, the target feature matrix is converted into digitized information through the processing of the first coding network and is transmitted into the first decoding network to obtain a corresponding feature coding sequence. Meanwhile, the voice translation requirement of each target user can be input as another input parameter, and then the corresponding characteristic coding sequence is obtained through the processing of the first coding network and the first decoding network. And then, respectively carrying out voice compiling on the plurality of characteristic coding sequences through N second coding networks and N second decoding networks in the voice translation model, so as to obtain second voice data corresponding to each target user. In this stage, the translation requirement of each target user is fused with the corresponding feature coding sequence, then a voice coding sequence is generated through a second coding network, and voice coding is performed through decoding of a second decoding network, so that second voice data corresponding to each target user is obtained.

In a specific embodiment, the process of executing step S106 may specifically include the following steps:

(1) According to the voice translation requirement, matching voice channels of the second voice data from the first voice channels to obtain at least one second voice channel corresponding to each second voice data;

(2) Constructing a demand identifier and a transmission identifier of at least one second voice channel;

(3) And according to the demand identifier and the transmission identifier, carrying out voice distribution and audio transmission on the second voice data through at least one second voice channel.

Specifically, the server performs voice distribution and audio transmission on the second voice data through at least one second voice channel. And matching the voice channel corresponding to the second voice data from the first voice channel. In the process, the server judges the voice channel corresponding to each second voice data by utilizing the technologies of voiceprint recognition and the like, and since different people have different voice characteristics, voiceprint recognition can be carried out in modes of acoustic analysis, modeling and the like. The server constructs a demand identifier and a transmission identifier of at least one second voice channel. The server determines the translation content and the form of the output speech that the user needs. And the server performs voice distribution and audio transmission on the second voice data through at least one second voice channel according to the demand identifier and the transmission identifier. And the server sends the analyzed voice translation result to the user, and ensures factors such as stability and instantaneity of transmission. At the same time, how to support a plurality of different output modes is also needed to be considered, for example, modes such as compression transmission can be selected to improve the transmission efficiency under the condition of poor network environment.

The above description is made on the audio analysis method based on speech translation in the embodiment of the present invention, and the following description is made on the audio analysis device based on speech translation in the embodiment of the present invention, referring to fig. 5, and one embodiment of the audio analysis device based on speech translation in the embodiment of the present invention includes:

the acquisition module 501 is configured to perform voice acquisition on N target users based on a preset voice intelligent translator, obtain first voice data, and obtain a voice translation requirement and a first voice channel of each target user;

the extraction module 502 is configured to input the first voice data into a preset first voice analysis model to perform voice feature analysis, obtain a voice feature analysis result, and extract voice feature data of the first voice data according to the voice feature analysis result, so as to obtain speaker voice feature data corresponding to each target user;

the analysis module 503 is configured to input the speaker voice feature data corresponding to each target user into a preset second voice analysis model to perform language type analysis, so as to obtain language type information corresponding to each speaker voice data;

the fusion module 504 is configured to generate a feature vector corresponding to each target user according to the speaker voice feature data and the language type information, and perform matrix fusion on a feature matrix corresponding to each target user to obtain a target feature matrix;

The translation module 505 is configured to input the target feature matrix and the voice translation requirement of each target user into a preset voice translation model to perform voice translation, so as to obtain second voice data corresponding to each target user;

and the transmission module 506 is configured to perform transmission channel allocation on the second voice data corresponding to each target user according to the first voice channel, obtain at least one second voice channel corresponding to each second voice data, and perform voice distribution and audio transmission on the second voice data through the at least one second voice channel.

Inputting the first voice data into a first voice analysis model to perform voice feature analysis through the cooperative cooperation of the components, obtaining a voice feature analysis result, and extracting the voice feature data of the first voice data according to the voice feature analysis result to obtain voice feature data of a speaker; inputting the voice characteristic data of the speaker into a second voice analysis model to perform language type analysis to obtain language type information; generating a target feature matrix, inputting the target feature matrix and the voice translation requirement of each target user into a voice translation model for voice translation, and obtaining second voice data; according to the invention, through classifying and extracting the voice data according to the voice characteristics of a speaker, then carrying out voice translation through a pre-constructed voice translation model, and finally distributing the translated voice data to corresponding target users according to the voice translation requirements of different target users, the invention realizes intelligent voice translation and improves the accuracy of voice translation.

The voice translation-based audio analysis apparatus in the embodiment of the present invention is described in detail above in terms of the modularized functional entity in fig. 5, and the voice translation-based audio analysis device in the embodiment of the present invention is described in detail below in terms of hardware processing.

Fig. 6 is a schematic structural diagram of a voice translation-based audio analysis device 600 according to an embodiment of the present invention, where the voice translation-based audio analysis device 600 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 610 (e.g., one or more processors) and a memory 620, and one or more storage media 630 (e.g., one or more mass storage devices) storing application programs 633 or data 632. Wherein the memory 620 and the storage medium 630 may be transitory or persistent storage. The program stored on the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations in the speech translation based audio analysis device 600. Still further, the processor 610 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the speech translation based audio analysis device 600.

The speech translation based audio analysis device 600 may also include one or more power supplies 640, one or more wired or wireless network interfaces 650, one or more input/output interfaces 660, and/or one or more operating systems 631, such as Windows Serve, macOS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the speech translation based audio analysis device structure shown in fig. 6 does not constitute a limitation of the speech translation based audio analysis device and may include more or less components than illustrated, or may combine certain components, or may be arranged in different components.

The present invention also provides a voice translation-based audio analysis device, which includes a memory and a processor, where the memory stores computer readable instructions that, when executed by the processor, cause the processor to execute the steps of the voice translation-based audio analysis method in the above embodiments.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, where instructions are stored in the computer readable storage medium, which when executed on a computer, cause the computer to perform the steps of the speech translation based audio analysis method.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (randomacceS memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An audio analysis method based on speech translation, which is characterized by comprising the following steps:

generating a feature vector corresponding to each target user according to the speaker voice feature data and the language type information, and carrying out matrix fusion on the feature vector corresponding to each target user to obtain a target feature matrix;

inputting the target feature matrix and the voice translation requirement of each target user into a preset voice translation model to carry out voice translation, and obtaining second voice data corresponding to each target user, wherein the method specifically comprises the following steps: inputting the target feature matrix and the voice translation requirement of each target user into a preset voice translation model, wherein the voice translation model comprises the following steps: n first encoding networks, N first decoding networks, N second encoding networks, and N second decoding networks; performing feature code conversion on the target feature matrix and the voice translation requirement of each target user through the first coding network and the first decoding network to obtain a plurality of feature code sequences; respectively performing voice compiling on the plurality of characteristic coding sequences through the second coding network and the second decoding network to obtain second voice data corresponding to each target user;

2. The voice translation-based audio analysis method according to claim 1, wherein the voice acquisition is performed on N target users by the preset voice intelligent translator to obtain first voice data, and a voice translation requirement and a first voice channel of each target user are obtained, including:

3. The voice translation-based audio analysis method according to claim 1, wherein the inputting the first voice data into a preset first voice analysis model for voice feature analysis to obtain a voice feature analysis result, and extracting voice feature data from the first voice data according to the voice feature analysis result to obtain speaker voice feature data corresponding to each target user includes:

4. The voice translation-based audio analysis method according to claim 1, wherein the inputting the speaker voice feature data corresponding to each target user into a preset second voice analysis model to perform language type analysis, to obtain language type information corresponding to each speaker voice data, includes:

5. The voice translation-based audio analysis method according to claim 1, wherein the generating feature vectors corresponding to each target user according to the speaker voice feature data and the language type information, and performing matrix fusion on the feature vectors corresponding to each target user to obtain a target feature matrix, includes:

and carrying out matrix fusion on the feature vectors corresponding to each target user to obtain a target feature matrix.

6. The voice translation-based audio analysis method according to claim 1, wherein the performing transmission channel allocation on the second voice data corresponding to each target user according to the first voice channel to obtain at least one second voice channel corresponding to each second voice data, and performing voice distribution and audio transmission on the second voice data through the at least one second voice channel includes:

7. An audio analysis device based on speech translation, characterized in that the audio analysis device based on speech translation comprises:

the fusion module is used for generating a feature vector corresponding to each target user according to the speaker voice feature data and the language type information, and carrying out matrix fusion on the feature vector corresponding to each target user to obtain a target feature matrix;

the translation module is configured to input the target feature matrix and a voice translation requirement of each target user into a preset voice translation model to perform voice translation, and obtain second voice data corresponding to each target user, and specifically includes: inputting the target feature matrix and the voice translation requirement of each target user into a preset voice translation model, wherein the voice translation model comprises the following steps: n first encoding networks, N first decoding networks, N second encoding networks, and N second decoding networks; performing feature code conversion on the target feature matrix and the voice translation requirement of each target user through the first coding network and the first decoding network to obtain a plurality of feature code sequences; respectively performing voice compiling on the plurality of characteristic coding sequences through the second coding network and the second decoding network to obtain second voice data corresponding to each target user;

8. An audio analysis device based on speech translation, characterized in that the audio analysis device based on speech translation comprises: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the speech translation based audio analysis device to perform the speech translation based audio analysis method of any of claims 1-6.

9. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the speech translation based audio analysis method of any of claims 1-6.