CN116524931A

CN116524931A - System, method, electronic equipment and medium for converting voice of 5G rich media message into text

Info

Publication number: CN116524931A
Application number: CN202310437314.7A
Authority: CN
Inventors: 沈浩; 黄海量; 吴东进; 韩松乔; 吴优
Original assignee: Shanghai Zhixun Information Technology Co ltd
Current assignee: Shanghai Zhixun Information Technology Co ltd
Priority date: 2023-04-23
Filing date: 2023-04-23
Publication date: 2023-08-01

Abstract

The embodiment of the invention discloses a system, a method, electronic equipment and a medium for converting voice of a 5G rich media message into text, which comprise the following steps: acquiring video information in the 5G rich media message, constructing a video message information set Y based on the video information, and constructing a training set based on the video message information set Y; establishing an improved CTC model, wherein the improved CTC model is formed by introducing a maximum entropy function on the basis of the CTC model to improve the original loss function of CTC in the CTC model; training the improved CTC model through a training set to obtain a trained improved CTC model; and converting the video message information set Y into a video text set Y through the trained improved CTC model. The voice-to-text method for the 5G rich media message solves the problem that the recognition period of the CTC model is prolonged along with the increase of noise in a voice environment in the prior art.

Description

System, method, electronic equipment and medium for converting voice of 5G rich media message into text

Technical Field

The invention relates to the technical field of computers, in particular to a system, a method, electronic equipment and a medium for converting voice of a 5G rich media message into text.

Background

The 5G rich media message is a great leap of the communication capability of the short message industry, and compared with the traditional text short message, the 5G rich media message has more supported media formats and richer expression forms, can send rich media information such as long text, pictures, voice, video and the like, and also comprises the user interaction and feedback capability such as public numbers, applets and the like, so that the application scene, the content quality and the application range of the 5G rich media message are greatly improved.

The 5G rich media information is often present in marketing video materials, background sound is doped with abundant noise, but the decoding scheme of the traditional CTC model effectively solves the misalignment problem existing in the attention-based voice recognition method, as the noise in the voice environment is increased, the probability of a feasible path which is found by learning through maximum likelihood estimation by the traditional CTC model is exponentially increased, and once the traditional CTC model finds a main feasible path, error information is concentrated near the path, so that the recognition period is greatly amplified. Considering the high efficiency and timeliness of the 5G rich media message transmission, the deep learning neural network algorithm with too long processing time cannot be selected from the voice-to-text of the 5G rich media message.

Therefore, there is a need for a voice-to-text method for 5G rich media messages with short processing time.

Disclosure of Invention

The embodiment of the invention aims to provide a system, a method, electronic equipment and a medium for converting 5G rich media information into text, which are used for solving the problem that the recognition period of a CTC model is increased along with the increase of noise in a voice environment in the prior art.

In order to achieve the above objective, an embodiment of the present invention provides a method for converting voice of a 5G rich media message into text, where the method specifically includes:

acquiring video information in a 5G rich media message, constructing a video message information set Y based on the video information, and constructing a training set based on the video message information set Y;

establishing an improved CTC model, wherein the improved CTC model is formed by introducing a maximum entropy function on the basis of the CTC model to improve an original loss function of CTC in the CTC model;

training the improved CTC model through the training set to obtain a trained improved CTC model;

and converting the video message information set Y into a video text set Y through the trained improved CTC model.

Based on the technical scheme, the invention can also be improved as follows:

further, the voice-to-text method for the 5G rich media message further comprises the following steps:

acquiring a text message in the 5G rich media message;

constructing a text message information set X based on the text message and the video text set y;

constructing a sensitive word variant recognition model;

and sequentially inputting the text messages in the text message information set X into the sensitive word variant recognition model to judge whether all the text messages in the text message information set X are all compliant.

acquiring a picture message in a 5G rich media message, and constructing a picture message information set Z based on the picture message;

constructing a bad picture classification model and a feature extraction model;

performing feature extraction on the picture message information set Z based on the feature extraction model to obtain a picture depth feature set Z;

and sequentially inputting the pictures in the picture depth feature set z into the bad picture classification model to judge whether all the pictures in the picture depth feature set z are all compliant.

and when all the text messages in the text message information set X are in compliance and all the pictures in the picture depth feature set z are in compliance, judging that the 5G rich media message can be normally sent.

Further, the construction of an improved CTC model, wherein the improvement of the CTC model is to introduce a maximum entropy function based on the CTC model to improve an original loss function of CTCs in the CTC model, and the construction comprises the following steps:

improving the original loss function of the CTC through a formula 1;

in the method, in the process of the invention,for the loss function of the improved CTC model, < >>For CTC original loss function, α is the coefficient regularized by maximum conditional entropy, H (p (pi|l, X)) is the entropy of the feasible path for the given input sequence and target sequence.

solving for H (p (pi|l, X)) by equation 2;

where (p (pi|l, X)) represents the conditional probability of a certain feasible path pi given 5G speech information X and the true output I;

logp (pi|X) represents the logarithm of the conditional probability of pi for a certain feasible path given 5G speech information X;representing all output conditional probabilities of 5G speech information X, whether or not the true output I is givenAnd, a method for producing the same.

Further, the training the improved CTC model by the training set to obtain a trained improved CTC model includes:

dividing the video message information set Y into a training set, a testing set and a verification set;

training the improved CTC model based on the training set;

performing performance verification on the improved CTC model based on the verification set, and storing the improved CTC model meeting performance conditions;

and evaluating a conversion result of the improved CTC model based on the test set.

A 5G rich media message voice to text system comprising:

the acquisition module is used for acquiring video information in the 5G rich media message;

the first construction module is used for constructing a video message information set Y based on the video information and constructing a training set based on the video message information set Y;

the method comprises the steps of improving a CTC model, wherein the improved CTC model is formed by introducing a maximum entropy function on the basis of the CTC model to improve an original loss function of CTC in the CTC model;

the training module is used for training the improved CTC model through the training set to obtain a trained improved CTC model;

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method when the computer program is executed.

A non-transitory computer readable medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method.

The embodiment of the invention has the following advantages:

according to the voice-to-text method for the 5G rich media message, video information in the 5G rich media message is obtained, a video message information set Y is constructed based on the video information, and a training set is constructed based on the video message information set Y; establishing an improved CTC model, wherein the improved CTC model is formed by introducing a maximum entropy function on the basis of the CTC model to improve an original loss function of CTC in the CTC model; training the improved CTC model through the training set to obtain a trained improved CTC model; the trained improved CTC model is used for converting the video message information set Y into a video text set Y, so that the problem that the identification period of the CTC model is prolonged along with the increase of noise in a voice environment in the prior art is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.

The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the ambit of the technical disclosure.

FIG. 1 is a flow chart of the voice-to-text method of the present invention for a 5G rich media message;

FIG. 2 is a first architecture diagram of the 5G rich media message voice to text system of the present invention;

FIG. 3 is a second architecture diagram of the 5G rich media message voice to text system of the present invention;

FIG. 4 is a graph showing the result of the loss function comparison experiment of the present invention;

fig. 5 is a schematic diagram of an entity structure of an electronic device according to the present invention.

Wherein the reference numerals are as follows:

the system comprises an acquisition module 10, a first construction module 20, an improved CTC model 30, a sensitive word variant recognition model 40, a feature extraction model 50, a bad picture classification model 60, an electronic device 70, a processor 701, a memory 702 and a bus 703.

Detailed Description

Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

Fig. 1 is a flowchart of an embodiment of a method for converting voice of a 5G rich media message into text, as shown in fig. 1, and the method for converting voice of a 5G rich media message provided by the embodiment of the invention comprises the following steps:

s101, acquiring video information in a 5G rich media message, constructing a video message information set Y based on the video information, and constructing a training set based on the video message information set Y;

specifically, a 5G rich media message is obtained, and because the maximum capacity of a single 5G rich media message is 3M, a plurality of text messages, a plurality of pictures and a plurality of video/audio segments can be contained in one 5G rich media message. A 5G rich media message may be denoted as T _xyz T may comprise a plurality of sets of text message information X (X ₁ ,x ₂ ,..), a plurality of video message information sets Y (Y ₁ ,y ₂ ,..) and a plurality of tile message information sets Z (Z) ₁ ,z ₂ ,...)。

The video information set Y contains video content and audio content, and the video and audio information needs to be checked simultaneously in the security compliance check, and the video information set Y (Y ₁ ,y ₂ ,..) into a set of video text Y and a set of video sample pictures Y', and constructs a training set based on the set of video message information Y.

S102, constructing an improved CTC model, wherein the improved CTC model is formed by introducing a maximum entropy function on the basis of the CTC model to improve an original loss function of CTC in the CTC model;

specifically, CTC (Connectionist temporal classification), it can be understood that time series class classification is based on neural networks; assume that the 5G speech information is X with a sequence length of T _1:T The expression of the conditional probability of the true output/given by the CTC speech-to-text model is:

wherein pi represents a label path that the CTC model can observe in all series information through maximum likelihood estimation; b is a many-to-one mapping set of all feasible paths pi to the true output l; p (pi|X) _1:T ) Inputting sequence X for 5G voice information _1:T Conditional probability of pi for all feasible paths;to select the loss faced when the path pi is chosen, the value increases exponentially when the noise in the 5G speech information is too loud;

representing the 5G voice information input sequence X under all the feasible paths _1：T Is a sum of all feasible path conditional probabilities; the sum p (I|X _1：T ) Equal to 5G voice information input sequence X _1：T Conditional probability of true input I; the core of the CTC model is to find the best speech recognition effect by solving the max of the conditional probability.

Original loss function of CTCWhen the noise is too loud, it will exhibit an exponential expansion, because its solving process is a conditional probability p (i|x _1：T ) Taking log, namely:

when the 5G voice information is noisy,exponentially expands, and it takes a huge amount of time to solve the conditional probability max for CTCs.

Improving the original loss function of the CTC through a formula 1;

in the method, in the process of the invention,for the loss function of the improved CTC model 30, and (2)>For CTC original loss function, α is the coefficient regularized by maximum conditional entropy, H (p (pi|l, X)) is the entropy of the feasible path for the given input sequence and target sequence.

As shown in FIG. 4, the result of the loss function comparison experiment can be understood as that the original exponential loss function of CTC is used to subtract a fluctuation entropy, so that the loss function proposed by the patentThe value interval of (2) is reduced to 0 and +.>Thereby avoiding a great deal of noise in the 5G voice informationThe resulting loss function expands indefinitely. The video text set Y can be obtained quickly and accurately from the video message information set Y.

Solving for H (p (pi|l, X)) by equation 2;

logp (pi|X) represents the logarithm of the conditional probability of pi for a certain feasible path given 5G speech information X;representing the sum of all output conditional probabilities of the 5G speech information X, whether or not the true output I is given.

The loss function used in the present invention may be selected from L1Loss, MSEloss, crossEntropyLoss, etc., without a large difference in the final effect of improving CTC model 30.

S103, training the improved CTC model through a training set to obtain a trained improved CTC model;

specifically, the video message information set Y is divided into a training set, a testing set and a verification set;

training the improved CTC model 30 based on the training set; performing performance verification on the improved CTC model 30 based on the verification set, and storing the improved CTC model 30 meeting performance conditions; the conversion results of the modified CTC model 30 are evaluated based on the test set.

S104, converting the video message information set Y into a video text set Y through the trained improved CTC model.

The voice-to-text method for the 5G rich media message further comprises the following steps:

acquiring a text message in the 5G rich media message; constructing a text message information set X based on the text message and the video text set y;

constructing a sensitive word variant recognition model 40; preferably, the sensitive word variant recognition model 40 is a Text CNN model, and the poor short Text recognition method based on the Text CNN model is already used in short message Text examination.

The sensitive word variant recognition model 40 used in the invention can be replaced by models such as CRNN, LSTM+CTC and the like besides the Text CNN model, and the recognition effect is not greatly different.

Firstly, the 5G rich media message to be processed needs to be subjected to pretreatment such as digital character standardization, english character standardization, complex body to simplified body conversion, special meaning symbol processing, symbol noise removal, unified continuous digital payment representation, character string segmentation and the like.

And secondly, vectorizing the short text through word2vec, carrying out high-dimensional convolution and extension on the text vector in a convolution layer, carrying out vector activation on the sensitive vocabulary by using a pooling layer and a full connection layer, and calculating the hit probability of the sensitive vocabulary through a softMax function. The SoftMax function expression chosen here is as follows:

where x represents a word vector.

Finally, the text messages in the text message information set X are sequentially input into the sensitive word variant recognition model 40 to judge whether all the text messages in the text message information set X are all compliant. If the text message compliance is judged to be non-compliance, the method is converted into manual judgment or early warning. And if the text message is judged to be compliant, entering a subsequent judging process.

constructing a bad picture classification model 60 and a feature extraction model 50;

performing feature extraction on the picture message information set Z based on the feature extraction model 50 to obtain a picture depth feature set Z; preferably, the original image feature extraction method used in the invention is LBP, HOG, SIFT, other similar feature extraction algorithms can be used for substitution, and the substitution effect does not greatly affect the effect of the final bad image classification model 60.

The pictures in the picture depth feature set z are sequentially input into the bad picture classification model 60 to judge whether all the pictures in the picture depth feature set z are all compliant. If a picture or some characteristic information in the picture is judged to be non-compliant, the picture is judged to be non-compliant. If the picture or any characteristic information in the picture is judged to be compliant, the picture is judged to be compliant.

The voice-to-text method for the 5G rich media message comprises the steps of obtaining video information in the 5G rich media message, constructing a video message information set Y based on the video information, and constructing a training set based on the video message information set Y; constructing an improved CTC model 30, wherein the improved CTC model 30 is formed by introducing a maximum entropy function on the basis of a CTC model to improve an original loss function of CTC in the CTC model; training the improved CTC model 30 through the training set to obtain a trained improved CTC model 30; the set of video message information Y is converted to a set of video text Y by the trained modified CTC model 30. The problem that the CTC model identification period is increased along with the increase of noise in a voice environment in the prior art is solved.

Compared with the traditional text or picture auditing method, the voice-to-text method for the 5G rich media message has the following advantages.

The audit dimension is more complete. The text or picture auditing method of the traditional text can only audit the content aiming at single information, and the auditing content is single. The invention creatively provides a security auditing framework aiming at the content of 5G message videos, texts and pictures by utilizing a multi-mode recognition algorithm, and integrally solves the security auditing problem of the rich media information of the 5G message.

Subtracting a fluctuation entropy from the original exponential loss function of CTC to obtainThe loss function proposed by this patentThe value interval of (2) is reduced to 0 and +.>And the problem of infinite expansion of a loss function caused by a large amount of noise in the 5G voice information is avoided. The video text set Y can be obtained quickly and accurately from the video message information set Y.

FIGS. 2-3 are flowcharts of an embodiment of a voice-to-text system for 5G rich media messages according to the present invention; as shown in fig. 2-3, the voice-to-text system for 5G rich media messages provided by the embodiment of the invention includes the following steps:

an acquiring module 10, configured to acquire video information in the 5G rich media message;

a first construction module 20, configured to construct a video message information set Y based on the video information, and construct a training set based on the video message information set Y; dividing the video message information set Y into a training set, a testing set and a verification set; training the improved CTC model 30 based on the training set;

performing performance verification on the improved CTC model 30 based on the verification set, and storing the improved CTC model 30 meeting performance conditions; the conversion results of the modified CTC model 30 are evaluated based on the test set.

The method comprises the steps of improving a CTC model 30, wherein the improved CTC model 30 is formed by introducing a maximum entropy function on the basis of the CTC model to improve an original loss function of CTC in the CTC model;

improving the original loss function of the CTC through a formula 1;

in the method, in the process of the invention,to improve the CTC model30, loss function->For CTC original loss function, α is the coefficient regularized by maximum conditional entropy, H (p (pi|l, X)) is the entropy of the feasible path for the given input sequence and target sequence.

Solving for H (p (pi|l, X)) by equation 2;

The training module is used for training the improved CTC model 30 through the training set to obtain a trained improved CTC model 30;

the set of video message information Y is converted to a set of video text Y by the trained modified CTC model 30.

The 5G rich media message voice-to-text system further comprises:

the acquisition module 10 is further configured to:

acquiring a text message and a picture message in the 5G rich media message;

the first building block 20 is further configured to:

constructing a picture message information set Z based on the picture message;

the sensitive word variant recognition model 40 sequentially inputs the text messages in the text message information set X into the sensitive word variant recognition model 40 to judge whether all the text messages in the text message information set X are all compliant.

The feature extraction model 50 is used for extracting features of the picture message information set Z to obtain a picture depth feature set Z;

and the poor picture classification model 60 inputs pictures in the picture depth feature set z into the poor picture classification model 60 in sequence to judge whether all the pictures in the picture depth feature set z are all compliant.

According to the voice-to-text system for the 5G rich media message, the video information in the 5G rich media message is acquired through the acquisition module 10; building a video message information set Y based on the video information through a first building module 20, and building a training set based on the video message information set Y; the method comprises the steps of improving a CTC model 30, wherein the improved CTC model 30 is formed by introducing a maximum entropy function on the basis of the CTC model to improve an original loss function of CTC in the CTC model; training the improved CTC model 30 through the training set by a training module to obtain a trained improved CTC model 30; the set of video message information Y is converted to a set of video text Y by the trained modified CTC model 30. The voice-to-text method for the 5G rich media message solves the problem that the recognition period of the CTC model is prolonged along with the increase of noise in a voice environment in the prior art.

Fig. 5 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present invention, as shown in fig. 5, an electronic device 70 includes: a processor 701, a memory 702, and a bus 703;

wherein, the processor 701 and the memory 702 complete communication with each other through the bus 703;

the processor 701 is configured to invoke program instructions in the memory 702 to perform the methods provided by the above-described method embodiments, for example, including: acquiring video information in a 5G rich media message, constructing a video message information set Y based on the video information, and constructing a training set based on the video message information set Y; constructing an improved CTC model 30, wherein the improved CTC model 30 is formed by introducing a maximum entropy function on the basis of a CTC model to improve an original loss function of CTC in the CTC model; training the improved CTC model 30 through the training set to obtain a trained improved CTC model 30; the set of video message information Y is converted to a set of video text Y by the trained modified CTC model 30.

The present embodiment provides a non-transitory computer readable medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments, for example, including: acquiring video information in a 5G rich media message, constructing a video message information set Y based on the video information, and constructing a training set based on the video message information set Y; constructing an improved CTC model 30, wherein the improved CTC model 30 is formed by introducing a maximum entropy function on the basis of a CTC model to improve an original loss function of CTC in the CTC model; training the improved CTC model 30 through the training set to obtain a trained improved CTC model 30; the set of video message information Y is converted to a set of video text Y by the trained modified CTC model 30.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable medium such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the embodiments or the methods of some parts of the embodiments.

While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims

1. A method for converting voice of a 5G rich media message into text, the method comprising:

2. The method for voice-to-text of a 5G rich media message of claim 1, further comprising:

acquiring a text message in the 5G rich media message;

constructing a sensitive word variant recognition model;

3. The method for voice-to-text of a 5G rich media message of claim 2, further comprising:

constructing a bad picture classification model and a feature extraction model;

4. The voice-to-text method of claim 3, wherein the voice-to-text method of 5G rich media message further comprises:

5. The method for converting voice into text of 5G rich media message according to claim 1, wherein the constructing an improved CTC model, wherein the improving CTC model is to introduce a maximum entropy function based on a CTC model to improve a CTC original loss function in the CTC model, comprises:

improving the original loss function of the CTC through a formula 1;

6. The method for converting voice into text of 5G rich media message according to claim 1, wherein the constructing an improved CTC model, wherein the improving CTC model is to introduce a maximum entropy function based on a CTC model to improve a CTC original loss function in the CTC model, comprises:

solving for H (p (pi|l, X)) by the following equation 2;

where (p (pi|l, X)) represents the conditional probability of a certain feasible path pi given the 5G speech information X and the true output l;

logp (pi|X) represents the logarithm of the conditional probability of pi for a certain feasible path given 5G speech information X;

representing the sum of all output conditional probabilities of the 5G speech information X, whether given true output/or not.

7. The method for voice-to-text of a 5G rich media message of claim 1, wherein training the modified CTC model via the training set results in a trained modified CTC model comprising:

training the improved CTC model based on the training set;

8. A 5G rich media message voice to text system comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed.

10. A non-transitory computer readable medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1 to 7.