CN114125506B

CN114125506B - Voice auditing method and device

Info

Publication number: CN114125506B
Application number: CN202010887653.1A
Authority: CN
Inventors: 雒晓帆; 余帆帆; 费凡
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2024-03-19
Anticipated expiration: 2040-08-28
Also published as: CN114125506A

Abstract

The embodiment of the specification provides a voice auditing method and device, wherein the voice auditing method comprises the steps of acquiring voice data to be identified; performing text processing on the voice data to obtain text information of the voice data; performing tone quality processing on the voice data to obtain tone quality information of the voice data; under the condition that the text information and the tone quality information meet the preset auditing requirements, determining that the voice data pass auditing; according to the voice auditing method, through acquiring text information and voice quality information of voice data to be identified and using two discriminant standards of the text information and the voice quality information corresponding to the voice data, the voice data is audited rapidly and accurately, so that the voice data can be ensured to be displayed in a video as a safe and compliant voice barrage, and the participation experience of a user when watching the video is improved.

Description

Voice auditing method and device

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a voice auditing method. One or more embodiments of the present specification also relate to a voice auditing apparatus, a computing device, and a computer-readable storage medium.

Background

The barrage is a user comment displayed in the video, and the barrage in the video field can give the audience a real-time interaction feeling, so that the video watching interest and participation feeling of the audience can be greatly improved; at present, the video field is mainly in the form of a text barrage, a video player can review the text comment content sent by a spectator to display the comments of the user to a host or other users in the form of the text barrage, and no particularly suitable review scheme exists for the voice barrage sent by the user at present so as to ensure the compliance of the voice barrage sent by the user.

It is therefore desirable to provide a voice auditing method that can audit a voice barrage quickly and accurately.

Disclosure of Invention

In view of this, the present embodiments provide a voice auditing method. One or more embodiments of the present disclosure relate to a voice auditing apparatus, a computing device, and a computer-readable storage medium, so as to solve the technical defect that in the prior art, a voice barrage cannot be audited to ensure compliance of the voice barrage.

According to a first aspect of embodiments of the present disclosure, there is provided a voice auditing method, including:

Acquiring voice data to be recognized;

performing text processing on the voice data to obtain text information of the voice data;

performing tone quality processing on the voice data to obtain tone quality information of the voice data;

and under the condition that the text information and the tone quality information meet the preset auditing requirements, determining that the voice data pass auditing.

According to a second aspect of embodiments of the present specification, there is provided a voice auditing apparatus, comprising:

the acquisition module is configured to acquire voice data to be recognized;

the text information obtaining module is configured to perform text processing on the voice data to obtain text information of the voice data;

the voice quality information obtaining module is configured to perform voice quality processing on the voice data to obtain voice quality information of the voice data;

and the auditing module is configured to determine that the voice data passes auditing under the condition that the text information and the tone quality information meet the preset auditing requirements.

According to a third aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, wherein the processor, when executing the computer-executable instructions, implements the steps of the voice auditing method.

According to a fourth aspect of embodiments of the present description, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the voice auditing method.

One embodiment of the specification realizes a voice auditing method and a device, wherein the voice auditing method comprises the steps of acquiring voice data to be recognized; performing text processing on the voice data to obtain text information of the voice data; performing tone quality processing on the voice data to obtain tone quality information of the voice data; under the condition that the text information and the tone quality information meet the preset auditing requirements, determining that the voice data pass auditing; according to the voice auditing method, through acquiring text information and voice quality information of voice data to be identified and using two discriminant standards of the text information and the voice quality information corresponding to the voice data, the voice data is audited rapidly and accurately, so that the voice data can be ensured to be displayed in a video as a safe and compliant voice barrage, and the participation experience of a user when watching the video is improved.

Drawings

FIG. 1 is a system architecture diagram of a voice auditing method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a voice auditing method provided by an embodiment of the present description;

FIG. 3 is a flowchart of a voice auditing method applied to auditing a voice bullet screen in the video domain according to one embodiment of the present disclosure;

fig. 4 is a schematic flowchart of model training and model application in the voice auditing method according to an embodiment of the present disclosure.

FIG. 5 is a schematic diagram of a voice auditing apparatus according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

First, terms related to one or more embodiments of the present specification will be explained.

Voice barrage: a bullet screen generally refers to comments, usually of the text type, that appear along with the video play time axis while watching a video, and a voice bullet screen particularly refers to a bullet screen that contains audio content generated by transmitting voice.

Bullet screen checking: because the barrage is independently sent by the user and the content is not limited, in order to create and maintain a healthy network environment, a video player normally can check and pass the content of the barrage and display the barrage on a video interface, so that the open browsing of the user is realized.

In this specification, a voice auditing method is provided. One or more embodiments of the present specification relate to a voice auditing apparatus, a computing device, and a computer-readable storage medium, one by one, as described in detail in the following embodiments.

The voice auditing method provided by the embodiment of the specification can be applied to any field needing to audit voice, such as auditing a voice barrage in the video field, auditing a voice barrage in the audio field, auditing a voice dialogue in the communication field, auditing a voice message in the self-media field and the like; for easy understanding, the embodiment of the present disclosure will be described in detail with reference to the application of the voice auditing method to auditing voice barrages in the video field, but is not limited thereto.

In the case that the voice auditing method is applied to auditing the voice barrage in the video field, the voice data to be recognized obtained in the voice auditing method can be understood as the voice barrage.

In particular, the voice bullet screen of the embodiments of the present disclosure may be presented in a large video playing device, a game console, a desktop computer, a smart phone, a tablet, an MP3 (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio layer 4) player, a laptop portable computer, an e-book reader, and other clients such as display terminals.

In addition, the voice barrage of the embodiments of the present disclosure may be applied to any video or audio that may present a voice barrage, for example, a voice barrage may be presented in a video such as live, on-demand, recorded broadcast, or an audio such as listening to songs, books, etc. online or offline.

Referring to fig. 1, fig. 1 illustrates a system architecture diagram of a voice auditing method according to an embodiment of the present description.

In fig. 1, a user a views a video a through a client a, and sends a voice barrage through the client a on a playing interface of the video a, the client a transmits the voice barrage to a server corresponding to the video a, after the server performs text processing and tone quality processing on the voice barrage, text information and tone quality information corresponding to the voice barrage are obtained, and the server performs auditing on the voice barrage according to the text information and the tone quality information corresponding to the voice barrage, and sends the voice barrage to a user B who views the video a through the client B simultaneously and a user C who views the video a through the client C simultaneously under the condition that the voice barrage meets auditing requirements of the current video a is determined.

Referring to fig. 2, fig. 2 shows a flowchart of a voice auditing method according to an embodiment of the present description, including the following steps:

Step 202: and acquiring voice data to be recognized.

The voice data to be recognized may be understood as a voice barrage to be recognized, including but not limited to voice data generated by any language and dialect.

Taking the auditing of the voice bullet screen in the video field as an example, if the video is a live video, the voice data to be identified can trigger the voice bullet screen generated by the client in real time under the condition that the user watches the live video, and the server for acquiring the voice bullet screen is the live video; in practical application, the video is not limited to live video, but can also include video on demand, video recorded and the like.

In particular, the live video may face multiple clients, and at the same time, the server of the live video may receive voice barrages sent by the multiple clients, and at this time, the server may sequentially store the voice barrages sent by the multiple clients to form an audit queue, and then acquire each voice barrage to be identified from the audit queue for subsequent audit.

Step 204: and carrying out text processing on the voice data to obtain text information of the voice data.

Because the auditing of the voice data is not visual in words, and the voice data is directly audited, the problem of low auditing efficiency may exist (the auditing is realized by listening to the audio file corresponding to the voice data in a manual auditing mode), so that the acquired voice data to be identified is converted into text information for auditing, and the auditing speed of the voice data can be improved; specifically, the specific manner of converting the voice data to be recognized into text information is as follows:

the text processing is performed on the voice data, and obtaining text information of the voice data comprises the following steps:

preprocessing the voice data, and extracting voice characteristics of the preprocessed voice data;

inputting the voice characteristics into an acoustic model to obtain phoneme information corresponding to the voice characteristics;

determining characters corresponding to the phoneme information in a character library based on a preset search algorithm, and carrying out semantic analysis on the characters according to a language model to obtain text information of the voice data.

Wherein, the preprocessing the voice data, extracting the voice characteristics of the preprocessed voice data includes:

detecting mute points of the voice data, and dividing the voice data into a plurality of voice fragments according to the mute points;

And extracting the voice characteristics of each voice segment based on a preset characteristic extraction algorithm.

In practical application, the internal structure of the longer voice data is complex, if the phoneme information corresponding to the whole sentence of voice data is obtained directly based on the acoustic model, the acoustic model needs to consider the sequence or causal relation among each word or word in the whole sentence of voice data during recognition, the recognition efficiency is low, and the recognition error rate is high under the condition that the voice speed of the voice data is high.

In this embodiment of the present disclosure, after obtaining speech data to be recognized, silence point detection is performed on the speech data, so as to segment the speech data according to silence points, and segment the speech data into shorter multiple speech segments, so that an effect that each speech segment is a speech frame can be achieved, then a speech feature of each speech segment is extracted based on a preset extraction algorithm, and then the speech feature of each speech segment can be recognized more quickly and accurately based on an acoustic model to obtain phoneme information corresponding to the speech feature of each speech segment.

In specific implementation, the preset feature extraction algorithm includes a linear prediction cepstral coefficient algorithm or a mel frequency cepstral coefficient algorithm.

In practical application, the voice data is voice, the voice is analog, the time domain waveform of the voice only represents the relationship of sound pressure changing along with time, and the characteristic of the voice cannot be represented well, so that the voice waveform is converted into the acoustic characteristic vector through the linear prediction cepstrum coefficient algorithm or the mel frequency cepstrum coefficient, the voice data can be more effectively close to and real to be not distorted, the linear prediction cepstrum coefficient algorithm or the mel frequency cepstrum coefficient algorithm is based on cepstrum, the auditory principle of people is met, and the voice characteristic extraction algorithm is effective.

In practical application, before inputting the voice feature into the acoustic model to obtain the phoneme information corresponding to the voice feature, the method further includes:

acquiring a voice data sample;

performing mute point detection on the voice data sample, and dividing the voice data sample into a plurality of voice fragment samples according to the mute point;

extracting a voice characteristic sample of each voice fragment sample based on a preset characteristic extraction algorithm;

Training an initial acoustic model according to the voice characteristic sample and a phoneme information sample corresponding to the voice characteristic sample to obtain the acoustic model.

The acoustic model inputs the voice characteristic sample and outputs a phoneme information sample corresponding to the voice characteristic sample.

Wherein, the phonemes are the minimum phonetic units divided according to the natural attribute of the voice, and are analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme.

In the embodiment of the present disclosure, the phoneme information is information representing the identification composition of pronunciation, for example, for chinese, the phoneme information is pinyin corresponding to chinese. The phoneme information may include one or more phoneme units, each phoneme unit corresponding to a word, each phoneme unit may be composed of an identification of one or more pronunciations; for Chinese, the pronunciation is identified as the initial and final of each pinyin, e.g., "I" the corresponding phoneme unit of Chinese is "wo".

Specifically, in order to obtain the phoneme information corresponding to the voice features more quickly and accurately, in the embodiment of the present disclosure, a deep learning model is used to obtain the phoneme information corresponding to the voice features, where the acoustic model is a hidden markov model HMM (Hidden Markov Model), a deep neural network DNN (Deep Neural Networks) -HMM (Hidden Markov Model) model, and a convolutional neural network CNN (Convolutional Neural Networks).

In the specific implementation, voice data are acquired from a pre-existing sample database, voice features of the acquired voice data are determined through artificial experience, corresponding phoneme information is determined for the voice features through the mode, the voice data are voice data samples, the voice features corresponding to the voice data are voice feature samples, the voice information is input data as samples, the phoneme information is phoneme information samples, the phoneme information is output data as samples, training samples are formed by the voice feature samples and the phoneme information samples corresponding to each voice feature sample, an initial acoustic model is trained, a trained acoustic model is obtained, voice features are input based on the acoustic model, and phoneme information corresponding to the voice features is output.

In practical application, the more the number of training samples is, the better the training effect of the acoustic model is, in the embodiment of the specification, the voice data sample is segmented into a plurality of voice segment samples based on mute points by performing mute point detection on the voice data sample, then the voice feature sample of each segmented voice segment sample is accurately extracted through a preset feature extraction algorithm, the training of the acoustic model can be realized based on a plurality of groups of voice feature samples and phoneme information corresponding to the voice feature samples, so that the training effect of the acoustic model can be enhanced, and the recognition accuracy of the acoustic model can be improved; the preset feature extraction algorithm is the same as that of the above embodiment, and will not be described herein.

Specifically, after the phoneme information corresponding to the voice features is obtained based on the acoustic model, the text corresponding to each phoneme information can be determined in a text library based on a preset search algorithm, then all the text is input into the language model, and voice analysis of the text is realized based on the language model, so that text information of voice data with accurate semantics is obtained.

The preset searching algorithm comprises a breadth first searching (Breadth First Search) algorithm of frame synchronization (Time-synchronization) or a depth first searching (Depth First Search) algorithm of frame asynchronization (Time-asynchronization), wherein the breadth first searching algorithm of frame synchronization comprises a Viterbi searching algorithm of frame synchronization, and the depth first searching algorithm of frame asynchronization comprises a stack searching algorithm of frame asynchronization and an A-type algorithm; the text library is a preset text database, and can be understood as an electronic dictionary and the like, and the text library comprises characters or words corresponding to each piece of phoneme information, for example, the phoneme information is "wo", and the characters corresponding to the phoneme information in the text library comprise I, nest, holding and the like.

The language model is a language model trained by a large amount of text information in advance, the probability of correlation of single characters or words can be obtained based on the language model, and the language model can reasonably analyze the semantic grammar of a section of voice data through a context structure so as to determine accurate text information corresponding to the section of voice data.

Specifically, after obtaining phoneme information corresponding to a voice feature of certain voice data, determining an optimal search path based on a preset search algorithm, quickly searching characters corresponding to each phoneme information in a text base, and inputting the characters into a language model for semantic analysis to obtain accurate text information of the voice data.

In the embodiment of the specification, the segmentation and feature extraction of the voice data can realize the accurate acquisition of the phoneme information based on the acoustic model, and the accurate acquisition of the phoneme information further influences the accuracy of determining the characters of the phoneme information in the text library based on the search algorithm, so that the accuracy of the text information of the voice data obtained after the semantic analysis of the language model can be ensured on the basis of the accuracy of the characters corresponding to the phoneme information of the voice data.

Step 206: and carrying out tone quality processing on the voice data to obtain tone quality information of the voice data.

Specifically, in order to obtain the tone quality information of the voice data with high efficiency, a pre-trained tone quality detection model may be used to implement tone quality processing on the voice data, so as to obtain the tone quality information of the voice data, where a specific implementation manner is as follows:

The voice quality processing is performed on the voice data, and the obtaining of the voice quality information of the voice data includes:

and inputting the voice data into a pre-trained tone quality detection model to obtain tone quality information of the voice data.

In a specific implementation, before inputting the voice data into a pre-trained voice quality detection model to obtain voice quality information of the voice data, the method further includes:

acquiring a voice data sample and tone quality information corresponding to the voice data sample, wherein the tone quality information comprises the volume, tone color and waveform envelope of the voice data sample;

and training an initial tone quality detection model based on the voice data sample and tone quality information corresponding to the voice data sample so as to obtain the tone quality detection model.

And the tone quality detection model inputs the voice data sample and outputs tone quality information corresponding to the voice data sample.

Where the waveform envelope refers to the transients that start and end with a single tone amplitude when the sound is produced, i.e. the envelope of the waveform. These waveform envelope variations also affect the timbre of the sound.

In practical application, a voice data sample and tone quality information corresponding to each voice data sample are obtained from a pre-established sample database, or the voice data sample is obtained on the internet, the tone quality information corresponding to each voice data sample is determined through manual experience, the voice data sample and the tone quality information corresponding to each voice data sample form a training sample, and an initial tone quality detection model is trained based on the training sample so as to obtain a trained tone quality detection model.

In the specific implementation, the verification of the voice data is not enough only through the text information of the voice data, but in many cases, although the text information corresponding to the voice data has no problem, low-quality tone quality such as thrilling, harshness and the like may exist in the voice data, psychological barriers are caused for users receiving the voice data, so that in order to ensure the safety of the voice data, the verification of the text information corresponding to the voice data is needed, and the verification of the tone quality of the voice data is needed.

In the embodiment of the specification, the tone quality detection model is pre-established, so that tone quality information corresponding to the voice data can be obtained quickly and accurately directly based on the tone quality detection model in subsequent use, the judgment of the tone quality information is realized, the auditing quality of the voice data is ensured, and the user experience is enhanced.

In another implementation manner of the present disclosure, the performing the voice quality processing on the voice data, and obtaining the voice quality information of the voice data includes:

performing tone quality processing on the voice data, and determining the amplitude, frequency spectrum and starting and ending transient states of the voice data;

According to the amplitude of the sound of the voice data, obtaining the volume of the voice data;

obtaining the tone of the voice data according to the frequency spectrum of the voice data;

and obtaining the waveform envelope of the voice data according to the starting and ending transient states of the amplitude of the voice data.

Specifically, the acquisition of sound information such as the volume, tone, and waveform envelope of voice data is achieved by acquiring the amplitude, frequency spectrum, transients in the beginning and ending of the amplitude, and the like of the sound of the voice data.

Before the voice data is subjected to voice quality processing, noise removal and other processing can be performed on the voice data, so that more accurate voice quality information of the voice data can be obtained.

In the embodiment of the present disclosure, sound quality processing is performed on voice data to obtain a sound wave diagram of the voice data, volume of the voice data is obtained according to amplitude of sound in the sound wave diagram of the voice data, timbre of the voice data is obtained through frequency spectrum of sound in the sound wave diagram, waveform envelope of the voice data is obtained through instant forms of starting and ending of the amplitude of sound in the sound wave diagram, and rapid and accurate verification of sound of the voice data is achieved through volume, timbre, waveform envelope and the like of the voice data.

Step 208: and under the condition that the text information and the tone quality information meet the preset auditing requirements, determining that the voice data pass auditing.

The preset auditing requirements can be set according to actual application scenes, and the voice auditing method is still applied to scenes for auditing voice barrages in the video field, so that the preset auditing requirements can be auditing requirements of text information and voice quality information of the currently played video, for example, the text information cannot contain sensitive words in a preset sensitive word stock of the currently played video, and the voice quality information is matched with the voice quality information of the currently played video.

Specifically, under the condition that the text information and the tone quality information meet a preset auditing requirement, determining that the voice data passes the auditing includes:

and under the condition that the text information is matched with the keywords in the preset word stock and the tone quality information is matched with the preset tone quality information, determining that the voice data passes the auditing.

The preset word stock can be understood as a preset sensitive word stock, and the word stock comprises a plurality of preset key sensitive words, such as sensitive words related to unhealthy colors, sensitive words with violence tendency and the like; the preset tone quality information can be determined according to an actual application scene, for example, if in a video playing scene, the preset tone quality information is the tone quality information of the currently played video or the tone quality information superior to the tone quality information of the currently played video; if the preset tone quality information is in the music playing scene, the preset tone quality information is the tone quality information of the currently playing music or the tone quality information superior to the tone quality information of the currently playing music.

In practical application, voice data with better voice quality tones can be obtained from the internet, which voice quality information the voice data have is analyzed through big data, voice quality information of some high-quality voice data is determined from the dimensions of gender, tone quality and the like based on different scenes such as singing, dubbing and the like, a voice data sample base is established, and in practical application, the voice quality information of the obtained voice data can be matched with the high-quality voice quality information in the same scene, so that auditing of the voice data can be realized.

For example, the application scenario of the obtained voice data is singing video, after the tone quality information corresponding to the voice data is obtained, the tone quality information is matched with the high-quality tone quality information in the singing scenario in the voice data sample library (namely, the tone of the voice data is matched with the tone of the current singing video, tone matching and waveform envelope matching or music rhythm matching), and if the tone quality information is matched, the auditing of the voice data can be realized.

In other realizable scenarios, the voice quality information threshold may be preset, and the voice data is determined to pass the audit when the voice quality information in the voice data is equal to or greater than the preset voice quality information threshold.

In addition, since the number of illegal keywords is far less than that of non-illegal keywords, the preset auditing requirement can be set to detect whether the illegal keywords are not present in the text information and the voice quality information; based on the method, a violation preset word stock can be created according to preset violation keywords, and then whether the text information is matched with the violation preset word stock or not and whether the tone quality information is matched with preset tone quality information or not are judged;

if the text information is not matched with the illegal preset word stock and the tone quality information is matched with the preset tone quality information, the text information and the tone quality information are proved to meet the preset auditing requirements, and further the fact that the voice data are compliant is further indicated, and then the voice data can be confirmed to pass auditing.

If the text information is matched with the illegal preset word stock and the tone quality information is matched or not matched with the preset tone quality information, the text information is not satisfied with the preset auditing requirement, and further the fact that non-compliant voice content exists in the voice data, such as advertisement or abuse content exists, is indicated, at the moment, the voice data can be directly refused without auditing, and auditing workload is reduced to a great extent.

According to the voice auditing method provided by the embodiment of the specification, through acquiring text information and voice quality information of voice data to be identified and using two discriminant standards of the text information and the voice quality information corresponding to the voice data, the voice data is audited rapidly and accurately, so that the voice data can be ensured to be displayed in a video as a safe and compliant voice barrage, and the participation experience of a user when watching the video is improved.

In another embodiment of the present disclosure, after the determining that the voice data passes the audit, the method further includes:

and sending the voice data to a corresponding video playing platform.

Specifically, under the condition that the voice data passes the audit, the voice data is sent to the corresponding video playing platform, so that other clients of the video playing platform can receive the voice data, user interaction is realized, and user participation is improved.

Referring to fig. 3, an application of the voice auditing method provided in the embodiment of the present disclosure to auditing a voice bullet screen in a video field is taken as an example, and the voice auditing method is further described. Fig. 3 shows a process flow chart of a voice auditing method according to an embodiment of the present disclosure, specifically including the following steps:

The voice data to be recognized is the voice barrage.

Step 302: the client receives the voice bullet screen generated by clicking a voice bullet screen recording button on a video watching interface of the client to record the voice.

Step 304: the client sends the voice barrage to the video server, and the video server receives the voice barrage and stores the file of the voice barrage.

Step 306: and the video server adds the voice barrage into an audit queue.

Step 308: the video server acquires the voice barrage from the auditing queue, obtains text information of the voice barrage by performing text processing on the voice barrage, and obtains sound quality information of the voice barrage by performing sound quality processing on the voice barrage.

Referring to fig. 4, fig. 4 is a schematic flowchart of model training and model application in the voice auditing method according to an embodiment of the present disclosure.

Specifically, fig. 4 shows a training step of the video server side on the acoustic model and the language model and a specific step of converting the voice barrage into text information based on the trained acoustic model and language model in the voice auditing method according to the embodiment of the present disclosure.

The training steps of the acoustic model and the language model are as follows:

step one: a voice data sample is obtained from a voice database.

Step two: and cutting and extracting features of the acquired voice data samples to obtain voice feature samples corresponding to the voice data samples and phoneme information samples corresponding to each voice feature sample, thereby forming acoustic model training data.

Step three: and training the initial acoustic model according to the acoustic model training data to obtain a trained acoustic model.

Step four: and obtaining text samples and the probabilities of the mutual association of single words or words in each text sample from a text database to form language model training data.

Step five: training the initial language model based on the language model training data to obtain a trained language model.

The acoustic model training in the first to third steps and the language model training in the fourth to fifth steps may be performed simultaneously without any limitation in the embodiment of the present disclosure.

The step of converting the voice barrage into text information based on the trained acoustic model and language model is as follows:

step six: and acquiring a voice barrage, and dividing the voice barrage into a plurality of voice fragments based on the mute point of the voice barrage.

Step seven: and extracting the voice characteristics corresponding to the voice fragments based on a characteristic extraction algorithm.

Step eight: inputting the voice characteristics into a recognition network formed by an acoustic model, a dictionary and a language model, decoding the voice of the voice characteristics through the acoustic model to obtain the phoneme information of the voice characteristics, obtaining characters corresponding to the phoneme information in the dictionary based on a search algorithm, and finally inputting the characters into the language model to obtain the final text information of the voice barrage.

The dictionary may be understood as a preset text library, or may be another electronic dictionary, that is, a text library capable of implementing text query based on phoneme information.

Step nine: outputting the text information to realize the subsequent auditing of the text information.

Step 310: and the video server side audits the voice barrage based on the text information and the sound quality information of the voice barrage.

In practical application, in order to further ensure the auditing accuracy of the voice barrage, the auditing of the voice barrage is realized through step 308, and meanwhile, whether the sentence is smooth or whether the semantic ambiguity exists in the text information of the voice barrage can also be audited, so that the text information of which the sentence is not smooth or the semantic ambiguity is screened out from the text information of the voice barrage, the voice barrage corresponding to the text information is determined, and finally, the deep auditing is performed through a manual listening test mode; not only can the auditing efficiency be improved, but also the voice barrage can be enhanced and audited based on manual experience, so that the voice barrage scene is promoted to be more standard.

Step 312: and the video server adds the audited voice barrage to a barrage list so as to sequentially display a video watching interface at the user client in the barrage list.

According to the auditing method provided by the embodiment of the specification, through auditing the voice barrage of the client side when the video is watched, advertisements, unhealthy words or contents which are untimely with the currently played video are prevented from appearing in the voice barrage; specifically, on one hand, a voice barrage is converted into text information in an artificial intelligence mode, and sensitive words in a sensitive word stock are filtered; on the other hand, the tone quality information of the voice barrage can be checked, the voice which is thrill and high and causes uncomfortable feeling to a listener is avoided in the voice barrage, and in practical application, the tone quality information of the voice barrage and the like can be rapidly detected through a machine learning model obtained through training, so that the filtering of the voice barrage is realized; in order to ensure screening quality, the screened voice barrage can be listened again through manual experience, compliance and safety of the voice barrage are further ensured, a better video watching environment is created for a user watching videos, and watching and interaction experience of the user is enhanced.

Corresponding to the above method embodiment, the present disclosure further provides a voice auditing apparatus embodiment, and fig. 5 shows a schematic structural diagram of a voice auditing apparatus according to one embodiment of the present disclosure. As shown in fig. 5, the apparatus includes:

an acquisition module 502 configured to acquire voice data to be recognized;

a text information obtaining module 504 configured to perform text processing on the voice data to obtain text information of the voice data;

a voice quality information obtaining module 506 configured to perform voice quality processing on the voice data to obtain voice quality information of the voice data;

and the auditing module 508 is configured to determine that the voice data passes auditing under the condition that the text information and the tone quality information meet preset auditing requirements.

Optionally, the text information obtaining module 504 is further configured to:

Optionally, the apparatus further includes:

a first sample acquisition module configured to acquire a voice data sample;

the segmentation module is configured to detect mute points of the voice data samples and segment the voice data samples into a plurality of voice fragment samples according to the mute points;

the extraction module is configured to extract a voice feature sample of each voice fragment sample based on a preset feature extraction algorithm;

and the acoustic model training module is configured to train an initial acoustic model according to the voice characteristic sample and a phoneme information sample corresponding to the voice characteristic sample to obtain the acoustic model.

Optionally, the sound quality information obtaining module 506 is further configured to:

Optionally, the apparatus further includes:

A second sample acquisition module configured to acquire a voice data sample and sound quality information corresponding to the voice data sample, wherein the sound quality information comprises volume, tone and waveform envelope of the voice data sample;

and the voice quality detection model training module is configured to train an initial voice quality detection model based on the voice data sample and voice quality information corresponding to the voice data sample so as to obtain the voice quality detection model.

Optionally, the auditing module 508 is further configured to:

Optionally, the apparatus further includes:

and the sending module is configured to send the voice data to a corresponding video playing platform.

Optionally, the preset feature extraction algorithm includes a linear prediction cepstral coefficient algorithm or a mel frequency cepstral coefficient algorithm.

According to the voice auditing device provided by the embodiment of the specification, through acquiring text information and voice quality information of voice data to be identified and using two discriminant standards of the text information and the voice quality information corresponding to the voice data, the voice data is audited rapidly and accurately, so that the voice data can be ensured to be displayed in a video as a safe and compliant voice barrage, and the participation experience of a user when watching the video is improved.

The foregoing is a schematic solution of a voice auditing apparatus of this embodiment. It should be noted that, the technical solution of the voice auditing device and the technical solution of the voice auditing method belong to the same concept, and details of the technical solution of the voice auditing device, which are not described in detail, can be referred to the description of the technical solution of the voice auditing method.

Fig. 6 illustrates a block diagram of a computing device 600 provided in accordance with one embodiment of the present description. The components of computing device 600 include, but are not limited to, memory 610 and processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to hold data.

Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 640 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 6 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 600 may also be a mobile or stationary server.

Wherein the processor 620 is configured to execute computer-executable instructions for performing the steps of the voice audit method when the processor executes the computer-executable instructions.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the voice auditing method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the voice auditing method.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the voice auditing method.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the above voice auditing method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the above voice auditing method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. A voice auditing method, comprising:

acquiring voice data to be recognized;

obtaining phoneme information corresponding to the voice characteristics of the voice data, wherein the phoneme information is information used for representing the identification composition of pronunciation, and comprises one or more phoneme units, and each phoneme unit corresponds to a word;

determining characters corresponding to the phoneme information in a character library based on a preset search algorithm, and carrying out semantic analysis on the characters according to a language model to obtain text information of the voice data;

under the condition that the text information and the tone quality information meet the preset auditing requirements, determining that the voice data pass auditing; the preset auditing requirements comprise that the tone quality information is matched with preset tone quality information, and the preset tone quality information is determined based on a playing scene of the voice data.

2. The voice auditing method according to claim 1, the obtaining phoneme information corresponding to voice features of the voice data includes:

And inputting the voice characteristics into an acoustic model to obtain phoneme information corresponding to the voice characteristics.

3. The voice auditing method according to claim 2, the preprocessing the voice data, extracting voice features of the preprocessed voice data includes:

4. A voice auditing method according to claim 2 or 3, before the voice features are input into an acoustic model to obtain phoneme information corresponding to the voice features, further comprising:

acquiring a voice data sample;

5. The voice auditing method according to claim 1, 2 or 3, the performing voice quality processing on the voice data, obtaining voice quality information of the voice data includes:

6. The voice auditing method according to claim 5, before the voice data is input into a pre-trained voice quality detection model to obtain voice quality information of the voice data, further comprising:

7. The voice auditing method according to claim 1, 2 or 3, the performing voice quality processing on the voice data, obtaining voice quality information of the voice data includes:

8. A voice auditing method according to claim 1, 2 or 3, in which, in the case where the text information and the sound quality information meet preset auditing requirements, determining that the voice data passes auditing includes:

9. A voice auditing method according to claim 1, 2 or 3, the determining that the voice data passes an audit further comprising:

and sending the voice data to a corresponding video playing platform.

10. A voice auditing method according to claim 3, the pre-set feature extraction algorithm comprising a linear predictive cepstral coefficient algorithm or a mel-frequency cepstral coefficient algorithm.

11. A voice auditing apparatus, comprising:

the acquisition module is configured to acquire voice data to be recognized;

a text information obtaining module configured to obtain phoneme information corresponding to a speech feature of the speech data, wherein the phoneme information is information composed of an identifier for representing pronunciation, and the phoneme information includes one or more phoneme units, each of which corresponds to a word; determining characters corresponding to the phoneme information in a character library based on a preset search algorithm, and carrying out semantic analysis on the characters according to a language model to obtain text information of the voice data;

the auditing module is configured to determine that the voice data passes auditing under the condition that the text information and the tone quality information meet preset auditing requirements; the preset auditing requirements comprise that the tone quality information is matched with preset tone quality information, and the preset tone quality information is determined based on a playing scene of the voice data.

12. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, wherein the processor, when executing the computer-executable instructions, performs the steps of the voice auditing method of any of claims 1-10.

13. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the voice auditing method of any of claims 1-10.