CN114125506A - Voice auditing method and device - Google Patents

Voice auditing method and device Download PDF

Info

Publication number
CN114125506A
CN114125506A CN202010887653.1A CN202010887653A CN114125506A CN 114125506 A CN114125506 A CN 114125506A CN 202010887653 A CN202010887653 A CN 202010887653A CN 114125506 A CN114125506 A CN 114125506A
Authority
CN
China
Prior art keywords
voice
voice data
information
auditing
quality information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010887653.1A
Other languages
Chinese (zh)
Other versions
CN114125506B (en
Inventor
雒晓帆
余帆帆
费凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN202010887653.1A priority Critical patent/CN114125506B/en
Publication of CN114125506A publication Critical patent/CN114125506A/en
Application granted granted Critical
Publication of CN114125506B publication Critical patent/CN114125506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

The embodiment of the specification provides a voice auditing method and a device, wherein the voice auditing method comprises the steps of obtaining voice data to be recognized; performing text processing on the voice data to obtain text information of the voice data; performing tone quality processing on the voice data to obtain tone quality information of the voice data; determining that the voice data passes the audit under the condition that the text information and the voice quality information meet the preset audit requirement; the voice auditing method realizes the fast and accurate auditing of the voice data by acquiring the text information and the tone information of the voice data to be identified and using the two discrimination standards of the text information and the tone information corresponding to the voice data, so as to ensure that the voice data can be displayed in a video as a safe and compliant voice barrage, and improve the participation experience of a user when watching the video.

Description

Voice auditing method and device
Technical Field
The embodiment of the specification relates to the technical field of computers, in particular to a voice auditing method. One or more embodiments of the present specification also relate to a voice auditing apparatus, a computing device, and a computer-readable storage medium.
Background
The barrage is a user comment displayed in a video, can provide a real-time interactive feeling for audiences in the video field, and can greatly improve the video watching interest and participation of the audiences; at present, in the video field, the video is mainly in a text barrage form, a video player can verify the text comment content sent by audiences, so that comments of a user can be displayed to a main broadcaster or other users in the text barrage form, and for a voice barrage sent by the user, a particularly suitable verification scheme does not exist at present, so that the compliance of the voice barrage sent by the user is ensured.
Therefore, it is desirable to provide a voice auditing method capable of quickly and accurately auditing voice barrages.
Disclosure of Invention
In view of this, the present specification provides a voice auditing method. One or more embodiments of the present disclosure also relate to a voice audit device, a computing device, and a computer-readable storage medium, so as to solve the technical defect that the prior art cannot audit a voice bullet screen to ensure the compliance of the voice bullet screen.
According to a first aspect of embodiments of the present specification, there is provided a voice audit method, including:
acquiring voice data to be recognized;
performing text processing on the voice data to obtain text information of the voice data;
performing tone quality processing on the voice data to obtain tone quality information of the voice data;
and determining that the voice data passes the audit under the condition that the text information and the voice quality information meet the preset audit requirement.
According to a second aspect of embodiments of the present specification, there is provided a voice audit apparatus including:
the acquisition module is configured to acquire voice data to be recognized;
the text information obtaining module is configured to perform text processing on the voice data to obtain text information of the voice data;
the voice quality information obtaining module is configured to perform voice quality processing on the voice data to obtain voice quality information of the voice data;
and the auditing module is configured to determine that the voice data passes the auditing under the condition that the text information and the tone quality information meet the preset auditing requirement.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is used for storing computer-executable instructions, and the processor is used for executing the computer-executable instructions, wherein the processor realizes the steps of the voice auditing method when executing the computer-executable instructions.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the voice audit method.
One embodiment of the present specification implements a voice auditing method and apparatus, wherein the voice auditing method includes acquiring voice data to be recognized; performing text processing on the voice data to obtain text information of the voice data; performing tone quality processing on the voice data to obtain tone quality information of the voice data; determining that the voice data passes the audit under the condition that the text information and the voice quality information meet the preset audit requirement; the voice auditing method realizes the fast and accurate auditing of the voice data by acquiring the text information and the tone information of the voice data to be identified and using the two discrimination standards of the text information and the tone information corresponding to the voice data, so as to ensure that the voice data can be displayed in a video as a safe and compliant voice barrage, and improve the participation experience of a user when watching the video.
Drawings
FIG. 1 is a system architecture diagram of a voice audit method according to an embodiment of the present specification;
FIG. 2 is a flow diagram of a method for voice auditing according to one embodiment of the present description;
fig. 3 is a flowchart illustrating an application of the voice auditing method to auditing a voice barrage in a video domain according to an embodiment of the present specification;
fig. 4 is a schematic specific flowchart of model training and model application in the voice auditing method according to an embodiment of the present specification.
Fig. 5 is a schematic structural diagram of a voice auditing apparatus according to an embodiment of the present specification;
fig. 6 is a block diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms to which one or more embodiments of the present specification relate are explained.
Voice barrage: the barrage generally refers to comments appearing along a video playing time axis when a video is watched, and is generally a text type, and the voice barrage specifically refers to a barrage containing audio content generated by sending voice.
And (4) barrage auditing: because the barrage is sent by the user independently and the content is not limited, in order to create and maintain a healthy network environment, the video player usually displays the barrage on a video interface after the content of the barrage is checked and passed, so that the user can browse openly.
In this specification, a voice audit method is provided. One or more embodiments of the present specification also relate to a voice auditing apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
The voice auditing method provided by the embodiment of the specification can be applied to any field needing to audit voice, such as auditing a voice barrage in a video field, auditing a voice barrage in an audio field, auditing voice conversation in a communication field, auditing voice messages in a self-media field, and the like; for convenience of understanding, the embodiment of the present specification takes an example that the voice auditing method is applied to auditing a voice barrage in a video field as an example, but is not limited to this.
Then, under the condition that the voice auditing method is applied to auditing the voice barrage in the video field as an example, the voice data to be recognized acquired in the voice auditing method can be understood as the voice barrage.
In specific implementation, the voice barrage in the embodiment of the present disclosure may be presented in clients such as a large-scale video playing device, a game console, a desktop computer, a smart phone, a tablet computer, an MP3(Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4) player, a laptop, an e-book reader, and other display terminals.
In addition, the voice barrage of the embodiments of the present disclosure may be applied to any video and audio capable of presenting a voice barrage, for example, a voice barrage may be presented in a live video, an on-demand video, a recorded video, and an audio capable of listening to a song, a book, etc. online or offline.
Referring to fig. 1, fig. 1 is a system architecture diagram illustrating a voice auditing method according to an embodiment of the present specification.
In fig. 1, a user a watches a video a through a client a, in a playing interface of the video a, the user a sends a voice bullet screen through the client a, the client a transmits the voice bullet screen to a server corresponding to the video a, the server performs text processing and sound quality processing on the voice bullet screen to obtain text information and sound quality information corresponding to the voice bullet screen, the server audits the voice bullet screen according to the text information and the sound quality information corresponding to the voice bullet screen, and under the condition that it is determined that the voice bullet screen meets the audit requirement of the current video a, the server sends the voice bullet screen to a user B watching the video a through the client B at the same time and a user C watching the video a through the client C at the same time.
Referring to fig. 2, fig. 2 is a flowchart illustrating a voice auditing method according to an embodiment of the present specification, including the following steps:
step 202: and acquiring voice data to be recognized.
The voice data to be recognized may be understood as a voice bullet to be recognized, including but not limited to voice data generated by any language or dialect.
Taking the example that the voice auditing method is applied to auditing a voice barrage in the video field, if the video is a live video, the voice data to be identified can trigger the voice barrage generated by the client side of the user in real time under the condition that the user watches the live video, and the server for acquiring the voice barrage is the server for the live video; in practical application, the video is not limited to live video, but also can include video on demand, recorded video and the like.
During specific implementation, the live video may be oriented to multiple clients, a server of the live video may receive voice barrages sent by the multiple clients at the same time, and at this time, the server may store the voice barrages sent by the multiple clients in sequence to form an audit queue, and then obtain each to-be-identified voice barrage from the audit queue for subsequent audit.
Step 204: and performing text processing on the voice data to obtain text information of the voice data.
Because the examination of the voice data is not visual, and the voice data is directly examined, the problem of low examination efficiency can exist (the audio file corresponding to the voice data is heard little by little in a manual examination mode to realize examination), the obtained voice data to be identified is converted into text information to be examined, and the examination speed of the voice data can be improved; specifically, the specific manner of converting the speech data to be recognized into text information is as follows:
the performing text processing on the voice data to obtain text information of the voice data includes:
preprocessing the voice data, and extracting voice characteristics of the preprocessed voice data;
inputting the voice features into an acoustic model to obtain phoneme information corresponding to the voice features;
determining characters corresponding to the phoneme information in a character library based on a preset search algorithm, and performing semantic analysis on the characters according to a language model to obtain text information of the voice data.
Wherein, the preprocessing the voice data, and the extracting the voice characteristics of the preprocessed voice data comprises:
performing mute point detection on the voice data, and segmenting the voice data into a plurality of voice segments according to the mute point;
and extracting the voice characteristics of each voice segment based on a preset characteristic extraction algorithm.
In practical application, the internal structure of longer voice data is more complex, if the phoneme information corresponding to the whole sentence of voice data is directly obtained based on the acoustic model, the acoustic model needs to consider the sequence or causal relationship between each word or each word in the whole sentence of voice data during recognition, so that the recognition efficiency is lower, and the recognition error rate is higher under the condition that the voice speed of the voice data is higher.
In the embodiment of the present specification, after voice data to be recognized is obtained, mute point detection is performed on the voice data, so as to segment the voice data according to a mute point, the voice data is segmented into a plurality of short voice segments, an effect that each voice segment is a voice frame can be achieved, then, a voice feature of each voice segment is extracted based on a preset extraction algorithm, and then, the voice feature of each voice segment can be more quickly and accurately recognized based on an acoustic model, so as to obtain phoneme information corresponding to the voice feature of each voice segment, and then, on the basis of ensuring that the phoneme information of the voice data is accurate, accuracy of subsequent recognition of text information of the voice data based on the phoneme information of the voice data can be ensured.
In specific implementation, the preset feature extraction algorithm includes a linear prediction cepstrum coefficient algorithm or a mel-frequency cepstrum coefficient algorithm.
In practical application, the voice data is voice, the voice is an analog signal, the time domain waveform of the voice only represents the relation of the change of the sound pressure along with time, and the voice characteristic cannot be represented well, so that the voice waveform can be more effectively close to the real voice data by converting the voice waveform into the acoustic characteristic vector through a linear prediction cepstrum coefficient algorithm or a Mel frequency cepstrum coefficient, the distortion of the voice data cannot be caused, and the linear prediction cepstrum coefficient algorithm or the Mel frequency cepstrum coefficient algorithm is based on cepstrum, more accords with the human auditory principle, and is a more effective voice characteristic extraction algorithm.
In practical application, before inputting the speech features into an acoustic model and obtaining phoneme information corresponding to the speech features, the method further includes:
acquiring a voice data sample;
performing mute point detection on the voice data sample, and segmenting the voice data sample into a plurality of voice fragment samples according to the mute point;
extracting a voice feature sample of each voice fragment sample based on a preset feature extraction algorithm;
and training an initial acoustic model according to the voice feature sample and the phoneme information sample corresponding to the voice feature sample to obtain the acoustic model.
And the acoustic model inputs the voice feature sample and outputs a phoneme information sample corresponding to the voice feature sample.
The phoneme is the minimum voice unit divided according to the natural attribute of the voice, and is analyzed according to the pronunciation action in the syllable, and one action forms one phoneme.
In the embodiments of the present specification, the phoneme information is information indicating a tag composition of a pronunciation, for example, in the case of chinese, the phoneme information is a pinyin corresponding to a chinese character. The phoneme information may include one or more phoneme units, each phoneme unit corresponding to a word, each phoneme unit may be composed of one or more pronounced identifications; for Chinese, the pronunciation is identified as the initial and final in each Pinyin, for example, the phoneme unit of "I" is "wo".
Specifically, in order to obtain phoneme information corresponding to a speech feature more quickly and accurately, in the embodiment of the present specification, a deep learning model is used to obtain the phoneme information corresponding to the speech feature, where the acoustic model is a hidden Markov model hmm (hidden Markov model), a deep Neural network dnn (deep Neural network) model, or a convolutional Neural network cnn (convolutional Neural network).
In specific implementation, voice data is acquired from a pre-existing sample database, corresponding phoneme information is determined for the voice features of the acquired voice data through artificial experience in the manner, wherein the voice data is a voice data sample, the voice features corresponding to the voice data are voice feature samples and serve as sample input data, the phoneme information is a phoneme information sample and serves as sample output data, a training sample is composed of the voice feature samples and the phoneme information samples corresponding to the voice feature samples, an initial acoustic model is trained to obtain a trained acoustic model, the voice features are input based on the acoustic model, and the phoneme information corresponding to the voice features is output.
In practical application, the more training samples are, the better the training effect of the acoustic model is, so in the embodiment of the present specification, by performing mute point detection on a voice data sample, dividing the voice data sample into a plurality of voice segment samples based on a mute point, and then accurately extracting a voice feature sample of each divided voice segment sample through a preset feature extraction algorithm, the training of the acoustic model can be realized based on a plurality of groups of voice feature samples and phoneme information corresponding to the voice feature samples, which not only enhances the training effect of the acoustic model, but also improves the recognition accuracy of the acoustic model; the preset feature extraction algorithm is the same as the preset feature extraction algorithm of the above embodiment, and is not described herein again.
Specifically, after obtaining the phoneme information corresponding to the speech features based on the acoustic model, the text corresponding to each phoneme information may be determined in the text library based on a preset search algorithm, then all the texts are input into the language model, and the speech analysis of the texts is implemented based on the language model, so as to obtain text information of speech data with accurate semantics.
The preset Search algorithm includes, but is not limited to, a frame-synchronous (Time-synchronous) Breadth-First Search (break First Search) algorithm or a frame-asynchronous (Time-asynchronous) Depth-First Search (Depth First Search) algorithm, wherein the frame-synchronous Breadth-First Search algorithm includes, but is not limited to, a frame-synchronous Viterbi Search algorithm, and the frame-asynchronous Depth-First Search algorithm includes, but is not limited to, a frame-asynchronous stack Search algorithm and an a algorithm; the text library is a preset text database, which may be understood as an electronic dictionary or the like, and the text library includes words or words corresponding to each piece of phoneme information, for example, the phoneme information is "wo", so that the words corresponding to the phoneme information "wo" in the text library include i, o, and/or holding words.
The language model is a language model which is trained through a large amount of text information in advance, the probability of the correlation of single characters or words can be obtained based on the language model, and the language model can reasonably analyze the semantic grammar of a section of voice data through a context structure so as to determine the accurate text information corresponding to the section of voice data.
Specifically, after phoneme information corresponding to a voice feature of certain voice data is obtained, a word corresponding to each phoneme information is quickly found in a text library by determining an optimal search path based on a preset search algorithm, and then the words are input into a language model for semantic analysis, so that accurate text information of the voice data is obtained.
In the embodiment of the present specification, segmentation and feature extraction of the speech data may implement accurate acquisition of phoneme information based on an acoustic model, and the accurate acquisition of the phoneme information further affects accuracy of determining characters of the phoneme information in a text library based on a search algorithm, so that on the basis of accuracy of characters corresponding to the phoneme information of the speech data, accuracy of text information of the speech data obtained after semantic analysis is performed on the speech model may be ensured.
Step 206: and performing tone quality processing on the voice data to obtain tone quality information of the voice data.
Specifically, in order to obtain the voice quality information of the voice data with high efficiency, the voice quality detection model trained in advance may be used to perform voice quality processing on the voice data to obtain the voice quality information of the voice data, and the specific implementation manner is as follows:
the voice data is processed by voice quality, and obtaining the voice quality information of the voice data comprises:
and inputting the voice data into a pre-trained voice quality detection model to obtain voice quality information of the voice data.
In specific implementation, before inputting the voice data into a pre-trained voice quality detection model and obtaining voice quality information of the voice data, the method further includes:
acquiring a voice data sample and tone quality information corresponding to the voice data sample, wherein the tone quality information comprises volume, tone and waveform envelope of the voice data sample;
and training an initial tone quality detection model based on the voice data sample and tone quality information corresponding to the voice data sample to obtain the tone quality detection model.
And the voice quality detection model inputs the voice data sample and outputs voice quality information corresponding to the voice data sample.
The waveform envelope refers to the transient state of the start and the end of the amplitude of a single sound when the sound is sounded, namely the envelope of the waveform. These waveform envelope variations also affect the timbre of the sound.
In practical application, voice data samples and tone quality information corresponding to each voice data sample are obtained from a pre-established sample database, or the voice data samples are obtained from the internet, the tone quality information corresponding to each voice data sample is determined through artificial experience, the voice data samples and the tone quality information corresponding to each voice data sample form training samples, and an initial tone quality detection model is trained on the basis of the training samples to obtain a tone quality detection model obtained through training.
In specific implementation, it is not enough to implement auditing of voice data only through text information of the voice data, and many times, although the text information corresponding to the voice data has no problem, low-quality timbres such as thrillers and thrillers may exist in sounds in the voice data, so that psychological disorders may be caused to users who receive the voice data, and therefore, in order to ensure security of the voice data, not only the text information corresponding to the voice data needs to be audited, but also the timbres of the sounds of the voice data need to be audited.
In the embodiment of the specification, by pre-establishing the tone quality detection model, tone quality information corresponding to the voice data can be quickly and accurately obtained directly based on the tone quality detection model in subsequent use, so that the voice data is guaranteed to be audited and quality is enhanced through judgment of the tone quality information, and user experience is enhanced.
In another implementation of this specification, the performing the voice quality processing on the voice data to obtain the voice quality information of the voice data includes:
performing tone quality processing on the voice data, and determining the amplitude, frequency spectrum and transient state of starting and ending of the amplitude of the voice data;
obtaining the volume of the voice data according to the amplitude of the sound of the voice data;
obtaining the tone of the voice data according to the frequency spectrum of the voice data;
and obtaining the waveform envelope of the voice data according to the transient state of the beginning and the end of the amplitude of the sound of the voice data.
Specifically, by acquiring the amplitude, the frequency spectrum, the transient state of the start and end of the amplitude, and the like of the sound of the voice data, the acquisition of the voice quality information such as the volume, the tone color, and the waveform envelope of the voice data is realized.
Before the voice data is subjected to voice quality processing, the voice data can be subjected to noise removal and other processing so as to ensure that more accurate voice quality information of the voice data is obtained.
In the embodiment of the present specification, voice quality processing is performed on voice data to obtain a sound wave diagram of the voice data, a volume of the voice data is obtained according to an amplitude of sound in the sound wave diagram of the voice data, a tone of the voice data is obtained through a frequency spectrum of the sound in the sound wave diagram, and a waveform envelope of the voice data is obtained through a transient form of beginning and ending of the amplitude of the sound in the sound wave diagram, so that subsequent fast and accurate auditing of the sound of the voice data is realized through the volume, the tone, the waveform envelope and the like of the voice data.
Step 208: and determining that the voice data passes the audit under the condition that the text information and the voice quality information meet the preset audit requirement.
The preset auditing requirement can be set according to an actual application scene, and the voice auditing method is still applied to a scene for auditing a voice barrage in the video field, so that the preset auditing requirement can be in accordance with the auditing requirement of text information and tone quality information of a currently played video, for example, the text information cannot contain sensitive words in a preset sensitive word bank of the currently played video, the tone quality information needs to be matched with the tone quality information of the currently played video, and the like.
Specifically, the determining that the voice data passes the audit includes, when the text information and the voice quality information satisfy a preset audit requirement:
and determining that the voice data passes the audit under the condition that the text information is matched with the keywords in the preset word bank and the tone quality information is matched with the preset tone quality information.
The preset word bank can be understood as a preset sensitive word bank, and the word bank comprises a plurality of preset key sensitive words, such as sensitive words related to unhealthy colors and sensitive words with a violent tendency; the preset tone quality information can be determined according to the actual application scene, for example, if in the video playing scene, the preset tone quality information is the tone quality information of the currently played video or the tone quality information superior to the currently played video; and if the preset tone quality information is the tone quality information of the currently played music or the tone quality information superior to the currently played music and the like in the music playing scene.
In practical application, voice data with good tone quality and good tone can be obtained from the internet, the good voice data is analyzed through big data to obtain which tone quality information, the tone quality information of some high-quality voice data is determined from dimensions such as gender, tone color and the like based on different scenes such as singing, dubbing and the like, a voice data sample base is established, and in actual use, the obtained voice quality information of the voice data can be matched with the high-quality tone quality information in the same scene to realize the auditing of the voice data.
For example, if the application scene of the acquired voice data is a singing video, after the tone quality information corresponding to the voice data is obtained, the tone quality information is matched with high-quality tone quality information in the singing scene in the voice data sample base (that is, the tone of the voice data is matched with the tone, tone color, waveform envelope, or music rhythm of the current singing video), and if the tone of the voice data is matched with the tone of the current singing video, the voice data can be audited.
In other realizable scenarios, the threshold of the sound quality information may be preset, and it is determined that the voice data passes the audit when the sound quality information in the voice data is greater than or equal to the preset threshold of the sound quality information.
In addition, since the illegal keywords are far less than the non-illegal keywords, the preset auditing requirement can also be set to detect whether the illegal keywords do not exist in the text information and the sound quality information; based on the above, a violation preset word bank can be created according to preset violation keywords, and then whether the text information is matched with the violation preset word bank or not and whether the tone quality information is matched with preset tone quality information or not are judged;
if the text information is not matched with the violation preset lexicon and the tone quality information is matched with the preset tone quality information, the text information and the tone quality information meet the preset auditing requirement, and the voice data is further indicated to be in compliance, then the voice data can be determined to pass the auditing.
If the text information is matched with the violation preset lexicon and the acoustic quality information is matched with or not matched with the preset acoustic quality information, the text information does not meet the preset examination requirement, and further indicates that the voice data contains unqualified voice content, such as advertisement or abuse, and the like, and at the moment, the voice data can be directly rejected without examination again, so that the examination workload is reduced to a great extent.
According to the voice auditing method provided by the embodiment of the specification, the text information and the tone information of the voice data to be identified are acquired, and the fast and accurate auditing of the voice data is realized according to the two discrimination standards of the text information and the tone information corresponding to the voice data, so that the voice data can be displayed in a video as a safe and compliant voice barrage, and the participation experience of a user in watching the video is improved.
In another embodiment of this specification, after determining that the voice data passes the audit, the method further includes:
and sending the voice data to a corresponding video playing platform.
Specifically, under the condition that the voice data pass the audit, the voice data are sent to the corresponding video playing platform, so that other client sides of the video playing platform can receive the voice data, user interaction is achieved, and user participation is improved.
Referring to fig. 3, the application of the voice auditing method provided in the embodiment of the present specification to auditing a voice barrage in the video field is taken as an example, and the voice auditing method is further described. Fig. 3 shows a flowchart of a processing procedure of a voice auditing method provided in an embodiment of the present specification, which specifically includes the following steps:
wherein, the voice data to be recognized is the voice barrage.
Step 302: and the client receives a voice bullet screen generated by clicking a voice bullet screen recording button on a video watching interface of the user to record voice.
Step 304: the client sends the voice bullet screen to the video server, and the video server stores the voice bullet screen after receiving the voice bullet screen.
Step 306: and the video server adds the voice barrage to an audit queue.
Step 308: the video server acquires the voice bullet screen from the audit queue, performs text processing on the voice bullet screen to acquire text information of the voice bullet screen, and performs tone quality processing on the voice bullet screen to acquire tone quality information of the voice bullet screen.
Referring to fig. 4, fig. 4 is a schematic specific flowchart of model training and model application in the voice auditing method according to an embodiment of the present disclosure.
Specifically, fig. 4 shows a training step of the video server for the acoustic model and the language model in the voice auditing method according to the embodiment of the present specification, and a specific step of converting the voice barrage into text information based on the trained acoustic model and the trained language model.
The training of the acoustic model and the language model comprises the following steps:
the method comprises the following steps: a voice data sample is obtained from a voice database.
Step two: and segmenting and extracting the acquired voice data samples to obtain voice feature samples corresponding to the voice data samples and phoneme information samples corresponding to each voice feature sample, and forming acoustic model training data.
Step three: and training the initial acoustic model according to the acoustic model training data to obtain the trained acoustic model.
Step four: and acquiring text samples and the probability of mutual correlation of single characters or words in each text sample from a text database to form language model training data.
Step five: and training the initial language model based on the language model training data to obtain the trained language model.
The acoustic model training in the step one to the step three and the language model training in the step four to the step five are not performed in sequence, and may also be performed simultaneously, which is not limited in this specification.
The steps of converting the voice barrage into text information based on the trained acoustic model and language model are as follows:
step six: and acquiring the voice barrage, and segmenting the voice barrage into a plurality of voice fragments based on the mute point of the voice barrage.
Step seven: and extracting the voice characteristics corresponding to the voice segments based on a characteristic extraction algorithm.
Step eight: inputting the voice characteristics into a recognition network consisting of an acoustic model, a dictionary and a language model, performing voice decoding on the voice characteristics through the acoustic model to obtain phoneme information of the voice characteristics, acquiring characters corresponding to the phoneme information in the dictionary based on a search algorithm, and finally inputting the characters into the language model to obtain final text information of the voice bullet screen.
The dictionary may be understood as a preset text library, or may be other electronic dictionaries, that is, text libraries for implementing text query based on phoneme information.
Step nine: and outputting the text information to realize the follow-up examination of the text information.
Step 310: and the video server side audits the voice bullet screen based on the text information and the tone quality information of the voice bullet screen.
In practical application, in order to further ensure the auditing accuracy of the voice barrage, while the auditing of the voice barrage is realized through the step 308, the auditing of whether sentences are smooth or not or whether semantics are ambiguous or not can be performed on the text information of the voice barrage, so that the text information with the sentences or the semantics being unclear is screened out from the text information of the voice barrage, the voice barrage corresponding to the text information is determined, and finally, the deep auditing is performed in a manual audition mode; not only can improve and examine efficiency, can also strengthen examining and examining this pronunciation barrage based on artificial experience to promote pronunciation barrage scene more standard.
Step 312: and the video server adds the voice barrage which passes the examination to a barrage list so as to sequentially display the video viewing interface of the user client in the barrage list.
According to the auditing method provided by the embodiment of the specification, through auditing the voice barrage of the client when the video is watched, the phenomenon that advertisements, unhealthy words or contents which are not suitable for the currently played video appear in the voice barrage is avoided; specifically, on one hand, the voice barrage is converted into text information in an artificial intelligence mode, and sensitive words in a sensitive word bank are filtered; on the other hand, the tone quality information of the voice bullet screen can be audited, so that voices which are thriller and high and can cause uncomfortable feelings to listeners are avoided, the tone quality information of the voice bullet screen can be rapidly detected through a machine learning model obtained through training in practical application, and the like, and the filtering of the voice bullet screen is realized; and in order to guarantee the screening quality, should carry out the audition once more through artificial experience through the pronunciation barrage after the screening, the compliance, the security of further assurance pronunciation barrage build a better video for the user that the video was watched and watch the environment, strengthen the user and watch and interactive experience.
Corresponding to the above method embodiment, this specification further provides an embodiment of a voice auditing apparatus, and fig. 5 shows a schematic structural diagram of a voice auditing apparatus provided in an embodiment of this specification. As shown in fig. 5, the apparatus includes:
an obtaining module 502 configured to obtain voice data to be recognized;
a text information obtaining module 504 configured to perform text processing on the voice data to obtain text information of the voice data;
a voice quality information obtaining module 506, configured to perform voice quality processing on the voice data to obtain voice quality information of the voice data;
an auditing module 508 configured to determine that the voice data passes auditing if the text information and the voice quality information meet a preset auditing requirement.
Optionally, the text information obtaining module 504 is further configured to:
preprocessing the voice data, and extracting voice characteristics of the preprocessed voice data;
inputting the voice features into an acoustic model to obtain phoneme information corresponding to the voice features;
determining characters corresponding to the phoneme information in a character library based on a preset search algorithm, and performing semantic analysis on the characters according to a language model to obtain text information of the voice data.
Optionally, the text information obtaining module 504 is further configured to:
performing mute point detection on the voice data, and segmenting the voice data into a plurality of voice segments according to the mute point;
and extracting the voice characteristics of each voice segment based on a preset characteristic extraction algorithm.
Optionally, the apparatus further includes:
a first sample acquisition module configured to acquire a voice data sample;
a segmentation module configured to perform mute point detection on the voice data sample, and segment the voice data sample into a plurality of voice segment samples according to the mute point;
the extraction module is configured to extract a voice feature sample of each voice segment sample based on a preset feature extraction algorithm;
and the acoustic model training module is configured to train an initial acoustic model according to the voice feature sample and the phoneme information sample corresponding to the voice feature sample to obtain the acoustic model.
Optionally, the sound quality information obtaining module 506 is further configured to:
and inputting the voice data into a pre-trained voice quality detection model to obtain voice quality information of the voice data.
Optionally, the apparatus further includes:
the second sample acquisition module is configured to acquire a voice data sample and tone quality information corresponding to the voice data sample, wherein the tone quality information includes volume, tone color and waveform envelope of the voice data sample;
and the voice quality detection model training module is configured to train an initial voice quality detection model based on the voice data sample and voice quality information corresponding to the voice data sample so as to obtain the voice quality detection model.
Optionally, the sound quality information obtaining module 506 is further configured to:
performing tone quality processing on the voice data, and determining the amplitude, frequency spectrum and transient state of starting and ending of the amplitude of the voice data;
obtaining the volume of the voice data according to the amplitude of the sound of the voice data;
obtaining the tone of the voice data according to the frequency spectrum of the voice data;
and obtaining the waveform envelope of the voice data according to the transient state of the beginning and the end of the amplitude of the sound of the voice data.
Optionally, the auditing module 508 is further configured to:
and determining that the voice data passes the audit under the condition that the text information is matched with the keywords in the preset word bank and the tone quality information is matched with the preset tone quality information.
Optionally, the apparatus further includes:
and the sending module is configured to send the voice data to a corresponding video playing platform.
Optionally, the preset feature extraction algorithm includes a linear prediction cepstrum coefficient algorithm or a mel-frequency cepstrum coefficient algorithm.
The voice auditing device provided by the embodiment of the description realizes the fast and accurate auditing of the voice data by acquiring the text information and the tone information of the voice data to be recognized and using the two discrimination standards of the text information and the tone information corresponding to the voice data, so as to ensure that the voice data can be displayed in a video as a safe and compliant voice barrage, and improve the participation experience of a user when watching the video.
The foregoing is a schematic scheme of a voice auditing apparatus according to this embodiment. It should be noted that the technical solution of the voice auditing apparatus and the technical solution of the voice auditing method belong to the same concept, and details that are not described in detail in the technical solution of the voice auditing apparatus can be referred to the description of the technical solution of the voice auditing method.
FIG. 6 illustrates a block diagram of a computing device 600 provided in accordance with one embodiment of the present description. The components of the computing device 600 include, but are not limited to, a memory 610 and a processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to store data.
Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 640 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 6 is for purposes of example only and is not limiting as to the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 600 may also be a mobile or stationary server.
Wherein the processor 620 is configured to execute the computer-executable instructions, and the processor is configured to execute the computer-executable instructions, wherein the processor implements the steps of the voice auditing method when executing the computer-executable instructions.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the voice auditing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the voice auditing method.
An embodiment of the present specification further provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the voice auditing method.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the above-mentioned voice auditing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above-mentioned voice auditing method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the embodiments. The specification is limited only by the claims and their full scope and equivalents.

Claims (13)

1. A voice auditing method comprises the following steps:
acquiring voice data to be recognized;
performing text processing on the voice data to obtain text information of the voice data;
performing tone quality processing on the voice data to obtain tone quality information of the voice data;
and determining that the voice data passes the audit under the condition that the text information and the voice quality information meet the preset audit requirement.
2. A voice auditing method according to claim 1, where text processing the voice data to obtain text information of the voice data comprises:
preprocessing the voice data, and extracting voice characteristics of the preprocessed voice data;
inputting the voice features into an acoustic model to obtain phoneme information corresponding to the voice features;
determining characters corresponding to the phoneme information in a character library based on a preset search algorithm, and performing semantic analysis on the characters according to a language model to obtain text information of the voice data.
3. A voice auditing method according to claim 2, where the preprocessing the voice data to extract voice features of the preprocessed voice data comprises:
performing mute point detection on the voice data, and segmenting the voice data into a plurality of voice segments according to the mute point;
and extracting the voice characteristics of each voice segment based on a preset characteristic extraction algorithm.
4. A speech auditing method according to claim 2 or 3, before inputting the speech features into an acoustic model and obtaining phoneme information corresponding to the speech features, further comprising:
acquiring a voice data sample;
performing mute point detection on the voice data sample, and segmenting the voice data sample into a plurality of voice fragment samples according to the mute point;
extracting a voice feature sample of each voice fragment sample based on a preset feature extraction algorithm;
and training an initial acoustic model according to the voice feature sample and the phoneme information sample corresponding to the voice feature sample to obtain the acoustic model.
5. The voice auditing method according to claim 1, 2 or 3, said performing voice quality processing on the voice data to obtain voice quality information of the voice data comprising:
and inputting the voice data into a pre-trained voice quality detection model to obtain voice quality information of the voice data.
6. The voice auditing method of claim 5, before inputting the voice data into a pre-trained voice quality detection model and obtaining voice quality information of the voice data, further comprising:
acquiring a voice data sample and tone quality information corresponding to the voice data sample, wherein the tone quality information comprises volume, tone and waveform envelope of the voice data sample;
and training an initial tone quality detection model based on the voice data sample and tone quality information corresponding to the voice data sample to obtain the tone quality detection model.
7. The voice auditing method according to claim 1, 2 or 3, said performing voice quality processing on the voice data to obtain voice quality information of the voice data comprising:
performing tone quality processing on the voice data, and determining the amplitude, frequency spectrum and transient state of starting and ending of the amplitude of the voice data;
obtaining the volume of the voice data according to the amplitude of the sound of the voice data;
obtaining the tone of the voice data according to the frequency spectrum of the voice data;
and obtaining the waveform envelope of the voice data according to the transient state of the beginning and the end of the amplitude of the sound of the voice data.
8. The voice auditing method according to claim 1, 2 or 3, where determining that the voice data passes the audit comprises, in the event that the text information and the voice quality information meet a preset audit requirement:
and determining that the voice data passes the audit under the condition that the text information is matched with the keywords in the preset word bank and the tone quality information is matched with the preset tone quality information.
9. A voice audit method according to claim 1, 2 or 3 further comprising, after determining that the voice data passes the audit:
and sending the voice data to a corresponding video playing platform.
10. A speech auditing method according to claim 3 in which the pre-set feature extraction algorithm comprises a linear prediction cepstral coefficient algorithm or a mel-frequency cepstral coefficient algorithm.
11. A voice audit apparatus comprising:
the acquisition module is configured to acquire voice data to be recognized;
the text information obtaining module is configured to perform text processing on the voice data to obtain text information of the voice data;
the voice quality information obtaining module is configured to perform voice quality processing on the voice data to obtain voice quality information of the voice data;
and the auditing module is configured to determine that the voice data passes the auditing under the condition that the text information and the tone quality information meet the preset auditing requirement.
12. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, wherein the processor when executing the computer-executable instructions performs the steps of the voice audit method of claims 1-10.
13. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the voice auditing method of claims 1-10.
CN202010887653.1A 2020-08-28 2020-08-28 Voice auditing method and device Active CN114125506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010887653.1A CN114125506B (en) 2020-08-28 2020-08-28 Voice auditing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010887653.1A CN114125506B (en) 2020-08-28 2020-08-28 Voice auditing method and device

Publications (2)

Publication Number Publication Date
CN114125506A true CN114125506A (en) 2022-03-01
CN114125506B CN114125506B (en) 2024-03-19

Family

ID=80375148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010887653.1A Active CN114125506B (en) 2020-08-28 2020-08-28 Voice auditing method and device

Country Status (1)

Country Link
CN (1) CN114125506B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114329075A (en) * 2022-03-15 2022-04-12 飞狐信息技术(天津)有限公司 Method and device for determining playing page
CN114666618B (en) * 2022-03-15 2023-10-13 广州欢城文化传媒有限公司 Audio auditing method, device, equipment and readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679729A (en) * 2015-02-13 2015-06-03 广州市讯飞樽鸿信息技术有限公司 Recorded message effective processing method and system
CN107480152A (en) * 2016-06-08 2017-12-15 北京新岸线网络技术有限公司 A kind of audio analysis and search method and system
CN109065069A (en) * 2018-10-10 2018-12-21 广州市百果园信息技术有限公司 A kind of audio-frequency detection, device, equipment and storage medium
CN109410918A (en) * 2018-10-15 2019-03-01 百度在线网络技术(北京)有限公司 For obtaining the method and device of information
CN110991427A (en) * 2019-12-25 2020-04-10 北京百度网讯科技有限公司 Emotion recognition method and device for video and computer equipment
CN111402887A (en) * 2018-12-17 2020-07-10 北京未来媒体科技股份有限公司 Method and device for escaping characters by voice
CN111462735A (en) * 2020-04-10 2020-07-28 网易(杭州)网络有限公司 Voice detection method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679729A (en) * 2015-02-13 2015-06-03 广州市讯飞樽鸿信息技术有限公司 Recorded message effective processing method and system
CN107480152A (en) * 2016-06-08 2017-12-15 北京新岸线网络技术有限公司 A kind of audio analysis and search method and system
CN109065069A (en) * 2018-10-10 2018-12-21 广州市百果园信息技术有限公司 A kind of audio-frequency detection, device, equipment and storage medium
CN109410918A (en) * 2018-10-15 2019-03-01 百度在线网络技术(北京)有限公司 For obtaining the method and device of information
CN111402887A (en) * 2018-12-17 2020-07-10 北京未来媒体科技股份有限公司 Method and device for escaping characters by voice
CN110991427A (en) * 2019-12-25 2020-04-10 北京百度网讯科技有限公司 Emotion recognition method and device for video and computer equipment
CN111462735A (en) * 2020-04-10 2020-07-28 网易(杭州)网络有限公司 Voice detection method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114329075A (en) * 2022-03-15 2022-04-12 飞狐信息技术(天津)有限公司 Method and device for determining playing page
CN114666618B (en) * 2022-03-15 2023-10-13 广州欢城文化传媒有限公司 Audio auditing method, device, equipment and readable storage medium

Also Published As

Publication number Publication date
CN114125506B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
US11475897B2 (en) Method and apparatus for response using voice matching user category
CN110517689B (en) Voice data processing method, device and storage medium
Tagliasacchi et al. Pre-training audio representations with self-supervision
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
CN109256133A (en) A kind of voice interactive method, device, equipment and storage medium
CN112735371B (en) Method and device for generating speaker video based on text information
Kopparapu Non-linguistic analysis of call center conversations
CN112116903A (en) Method and device for generating speech synthesis model, storage medium and electronic equipment
Wang et al. Comic-guided speech synthesis
CN115171731A (en) Emotion category determination method, device and equipment and readable storage medium
CN114125506B (en) Voice auditing method and device
CN112185363A (en) Audio processing method and device
Xin et al. Exploring the effectiveness of self-supervised learning and classifier chains in emotion recognition of nonverbal vocalizations
CN110781327B (en) Image searching method and device, terminal equipment and storage medium
US20140074478A1 (en) System and method for digitally replicating speech
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN112235183B (en) Communication message processing method and device and instant communication client
WO2022041192A1 (en) Voice message processing method and device, and instant messaging client
Gao Audio deepfake detection based on differences in human and machine generated speech
Li et al. Audio-journey: Efficient visual+ llm-aided audio encodec diffusion
Mahum et al. Text to speech synthesis using deep learning
Noriy et al. EMNS/Imz/Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels
Larisa et al. Speech Emotion Recognition Using 1D/2D Convolutional Neural Networks
CN110148427B (en) Audio processing method, device, system, storage medium, terminal and server
Liu et al. M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant