CN117809655A - Audio processing method, device, equipment and storage medium - Google Patents

Audio processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN117809655A
CN117809655A CN202311841788.4A CN202311841788A CN117809655A CN 117809655 A CN117809655 A CN 117809655A CN 202311841788 A CN202311841788 A CN 202311841788A CN 117809655 A CN117809655 A CN 117809655A
Authority
CN
China
Prior art keywords
voice
content
model
text
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311841788.4A
Other languages
Chinese (zh)
Inventor
轩晓光
劳振锋
陈传艺
黄杰雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kugou Computer Technology Co Ltd
Original Assignee
Guangzhou Kugou Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kugou Computer Technology Co Ltd filed Critical Guangzhou Kugou Computer Technology Co Ltd
Priority to CN202311841788.4A priority Critical patent/CN117809655A/en
Publication of CN117809655A publication Critical patent/CN117809655A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses an audio processing method, device, equipment and storage medium, and belongs to the field of artificial intelligence. The method comprises the following steps: acquiring a voice fragment to be predicted in audio; identifying the voice segment to be predicted to obtain text identification content, wherein the text identification content is text content corresponding to the voice segment to be predicted; and detecting the text recognition content to obtain a detection result of the text recognition content. According to the method and the device, the audio can be checked in a targeted manner by acquiring the voice fragments to be predicted in the audio, voice contents can be converted into text forms by identifying the voice fragments, text identification contents converted into the text forms are detected, the detection result of the audio can be directly obtained, and the accuracy and the efficiency of audio detection are improved.

Description

Audio processing method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to an audio processing method, apparatus, device, and storage medium.
Background
In processing audio, it is often necessary to identify the content of the audio and determine whether the audio belongs to an advertisement.
In the related art, when identifying the content of audio, features are generally extracted from advertisement portions of known audio, and an audio advertisement feature library is established. And for the unknown audio, extracting the characteristics of the advertisement part in the unknown audio, and carrying out matching search on the advertisement part and the characteristics in the audio advertisement characteristic library, and if the matching is successful, identifying the unknown audio as an advertisement.
However, the feature matching method has high dependence on the audio advertisement feature library, if the audio advertisement feature library is smaller or is not updated timely, whether the new unknown audio belongs to advertisements or not cannot be monitored, and the condition of omission exists. Therefore, how to efficiently and accurately identify and judge the audio is a problem to be solved at present.
Disclosure of Invention
The application provides an audio processing method, an audio processing device, audio processing equipment and a storage medium, wherein the technical scheme is as follows:
according to an aspect of the present application, there is provided an audio processing method, the method including:
acquiring a voice fragment to be predicted in audio;
identifying the voice segment to be predicted to obtain text identification content, wherein the text identification content is text content corresponding to the voice segment to be predicted;
And detecting the text recognition content to obtain a detection result of the text recognition content.
According to another aspect of the present application, there is provided an audio processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring the voice fragments to be predicted in the audio;
the recognition module is used for recognizing the voice segment to be predicted to obtain text recognition content, wherein the text recognition content is text content corresponding to the voice segment to be predicted;
and the detection module is used for detecting the text recognition content to obtain a detection result of the text recognition content.
According to an aspect of the present application, there is provided an audio processing method, the method including: dividing the audio into a plurality of segments, the plurality of segments including the speech segment and a non-speech segment; and marking out the voice fragments in the fragments based on the neural network model, and outputting the voice fragments.
According to an aspect of the present application, there is provided an audio processing method, the method including: inputting the voice segment into the voice separation model, wherein the voice segment comprises the voice segment and a background disturbance sound segment, and the background disturbance sound segment is other segments except the voice segment in the voice segment; and separating the voice segment from the background disturbance voice segment based on the voice separation model, and outputting the voice segment to be predicted.
According to an aspect of the present application, there is provided a computer device comprising: a processor and a memory, wherein at least one section of program is stored in the memory; the processor is configured to execute the at least one program in the memory to implement the audio processing method.
According to an aspect of the present application, there is provided a computer-readable storage medium having stored therein executable instructions that are loaded and executed by a processor to implement the above-described audio processing method.
According to an aspect of the present application, there is provided a computer program product comprising computer instructions stored in a computer readable storage medium, from which a processor reads and executes the computer instructions to implement the above-mentioned audio processing method.
The beneficial effects that this application provided technical scheme brought include at least:
by identifying the voice segments to be predicted in the audio, text identification content can be obtained, the text identification content is text content corresponding to the voice segments to be predicted, and the text identification content is detected, so that a detection result of the text identification content can be obtained. By acquiring the voice fragments in the audio, the audio can be checked in a targeted manner, voice contents can be converted into a text form by identifying the voice fragments, and text identification contents converted into the text form are detected, so that the detection result of the audio can be directly obtained. The audio detection mode does not need to carry out matching search in a limited audio advertisement feature library, and meanwhile, does not need to carry out manual checking, so that the manual time and cost are saved, and the accuracy and efficiency of audio detection are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a computer system provided in an exemplary embodiment of the present application;
FIG. 2 is a schematic diagram of an audio processing method provided in an exemplary embodiment of the present application;
FIG. 3 is a flow chart of an audio processing method provided by an exemplary embodiment of the present application;
FIG. 4 is a flow chart of an audio processing method provided by an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of an audio processing method provided in an exemplary embodiment of the present application;
FIG. 6 is a flowchart of an audio processing method provided by an exemplary embodiment of the present application;
FIG. 7 is a flowchart of an audio processing method provided by an exemplary embodiment of the present application;
FIG. 8 is a flowchart of an audio processing method provided by an exemplary embodiment of the present application;
FIG. 9 is a flowchart of an audio processing method provided by an exemplary embodiment of the present application;
FIG. 10 is a schematic diagram of a model training method for a generic speech recognition model provided in an exemplary embodiment of the present application;
FIG. 11 is a flowchart of an audio processing method provided by an exemplary embodiment of the present application;
FIG. 12 is a schematic diagram of a model training method for a generic large language model provided in an exemplary embodiment of the present application;
fig. 13 is a block diagram of an audio processing apparatus according to an exemplary embodiment of the present application;
fig. 14 is a block diagram of a server according to an exemplary embodiment of the present application.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions. For example, information such as audio referred to in this application is obtained with sufficient authorization.
It should be understood that, although the terms first, second, etc. may be used in this disclosure to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first parameter may also be referred to as a second parameter, and similarly, a second parameter may also be referred to as a first parameter, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
First, description is made of related terms related to the present application:
artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
Neural network model: is an artificial neural network formed by interconnecting n neurons, wherein n is a positive integer. In this application, a neural network model is an artificial neural network used to identify speech segments in audio. The neural network model can be illustratively divided into an input layer, a hidden layer and an output layer, wherein the terminal inputs the viewfinder image into the input layer of the neural network model, the hidden layer downsamples the input viewfinder image, namely carries out convolution calculation on pixel points in the viewfinder image, and finally outputs the identified portrait type through the output layer. The neural network model includes at least one of a CNN (Convolutional Neural Network ) model, FCN (Fully Convolutional Networks, full convolutional neural network) model, DNN (Deep Neural Network ) model, RNN (Recurrent Neural Networks, cyclic neural network) model, embedding (embedding) model, GBDT (Gradient Boosting Decision Tree, gradient-lifted decision tree) model, LR (Logistic Regression ) model, and the like.
And (3) a human voice separation model: a deep learning model aims at separating specific sound components, such as music, speech or other sounds, from a mixed audio signal. The model has important application in the fields of audio processing, speech recognition, music processing and the like. In the present application, the human voice separation model may be a deep learning model for separating human voice segments among voice segments. The human voice separation model includes at least one of a BSRNN (Blind So urce Separation with Recurrent Neural Networks, blind source separation based on a recurrent neural Network) model, a Residual Network (deep learning based on Residual connection) model, a trans fromerde (Transformer Encoder-Decoder, deep learning based on self-attention mechanism) model, and the like.
Speech recognition model: a deep learning model for converting a human speech signal into recognizable text data. The speech recognition model may convert the sound signal into text or commands to enable understanding and processing of the speech signal. In this application, a speech recognition model is a deep learning model that converts segments of human voice into text recognition content.
Large language model: generally refers to a model with large parameter amount and deep network layer number. A large model refers to a machine learning model with a large number of parameters and computing resources. These models require a large amount of data and computational power in the training process and have millions to billions of parameters. The design purpose of the large model is to improve the representation capability and performance of the model, and to better capture patterns and rules in the data when processing complex tasks.
FIG. 1 illustrates a schematic diagram of a computer system provided in one embodiment of the present application. The computer system may be implemented as a system architecture for an audio processing method. The computer system may include: a terminal 100 and a server 120, wherein the terminal 100 and the server 120 are connected through a communication network 140.
After the terminal 100 sends the audio to the server 120 through the communication network 140 and the server 120 obtains the audio, the server 120 firstly separates the audio into voice fragments in the audio through the neural network model, then inputs the voice fragments into the voice separation model, extracts the voice fragments in the voice fragments, the voice recognition model recognizes the voice fragments as text recognition contents, and the large language model detects the text recognition contents and obtains a detection result. The server 120 returns the detection result to the terminal 100 through the communication network 140.
The terminal 100 may be an electronic device such as a mobile phone, a tablet computer, a vehicle-mounted terminal (car machine), a wearable device, a personal computer (Personal Computer, PC), a vehicle-mounted terminal, or the like. The terminal 100 may be provided with a client for running a target application, which may be an application for audio processing or another application provided with an audio processing function, which is not limited in this application. In addition, the form of the target Application program is not limited, and includes, but is not limited to, an Application (App) installed in the terminal 100, an applet, and the like, and may also be in the form of a web page.
The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud computing services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), and a cloud server of basic cloud computing services such as a big data and manual palm image recognition platform. The server 120 may be a background server of the target application program, and is configured to provide background services for clients of the target application program.
According to the audio processing method provided by the embodiment of the application, the execution main body of each step can be computer equipment, and the computer equipment refers to electronic equipment with data calculation, processing and storage capabilities. Taking the implementation environment of the solution shown in fig. 1 as an example, the terminal 100 may perform the audio processing method (for example, the client of the terminal 100 that installs the running target application program performs the audio processing method), or the server 120 may perform the audio processing method, or the terminal 100 and the server 120 may perform the audio processing method in an interactive and coordinated manner, which is not limited in this application.
Fig. 2 shows a schematic diagram of an audio processing method according to an embodiment of the present application.
When auditing the audio, the content of the audio is generally required to be identified, and whether the audio belongs to the audio advertisement is judged. In order to improve accuracy and high efficiency of audio auditing, the application builds an automatic audio advertisement auditing system, and adopts a plurality of modules to audit the audio, so that accuracy and efficiency of audio detection are improved.
(1) Audio 10 is acquired.
Audio 10 is a signal that digitally records sound. Alternatively, the audio 10 may be at least one of various types of sounds such as human voice, songs, sounds in natural environments, and the like. Optionally, the audio 10 includes various characteristics of sound, such as at least one of pitch, volume, tempo, and the like. Audio 10 may be stored in various file formats such as MP3 (Moving Picture Experts Group Audio Layer III, moving picture experts group audio layer three), WAV (Waveform Audio File Format ). In some embodiments, audio 10 refers to an audio file that contains speech or singing.
(2) The audio 10 is input into a neural network model 20, and the audio 10 is detected segment by segment based on the neural network 20.
In some embodiments, the audio 10 is divided into a plurality of segments based on a predetermined time interval, and the neural network model 20 performs segment-by-segment detection on the plurality of segments in the audio 10, each segment being classified as a speech segment 21 or a non-speech segment 22. The speech segment 21 refers to the part of the audio 10 that contains speech, i.e. the segment that is voiced. The non-speech segments 22 are segments of the audio 10 that have no sound, alternatively the non-speech segments 22 may be silence segments. The neural network model 20 detects the audio 10 segment by segment, determines whether each segment contains speech, and marks the speech segment 21 containing speech in the audio 10.
(3) The voice segment 21 is input into the voice separation model 30, and the voice separation model 30 is invoked to extract the voice segment 32 in the voice segment 21.
In some embodiments, the speech segment 21 contains both human voice and background disturbance sounds. The background disturbance sound is sound other than human voice in the voice section 21, and is at least one of background noise, accompaniment music, traffic noise, and the like. The human voice separation model 30 can separate human voice in the voice section 21 from background disturbance sound through training, and output a clean human voice section 32.
(4) The speech recognition model 40 is invoked to perform text recognition on the voice segment 32 to obtain text-recognized content 42.
The voice segment 32 is input into the voice recognition model 40, the voice recognition model 40 is used to recognize the voice segment 32, and the voice recognition model 40 can recognize the audio content in the voice segment 32 as corresponding text content and output the corresponding text content as text recognition content 42.
In some embodiments, the text identifying content 42 is advertising content having some terminology, such as brand name, product name. To improve the accurate recognition of advertising content by the speech recognition model 40, fine tuning of the speech recognition model 40 may be performed based on the advertising content. The fine tuning means that the fine tuning training is performed on the voice recognition model 40 by using the data set of the advertisement content based on the existing voice recognition model 40, so as to meet the requirement of accurately recognizing the advertisement content. During the fine tuning process, the speech recognition model 40 adjusts its parameters according to the data set of the advertisement content, so as to improve the recognition capability of the advertisement content.
(5) The large language model 50 is invoked to test the text recognition content 42 to obtain test results 52.
In some embodiments, the large language model 50 may be fine-tuned based on data samples that include text recognition content 42 and detection results 52. Optionally, the format of the data samples is as follows:
{ prompt: text recognition content 42, label: detection result 52 })
Wherein, the sample is used to indicate the input content of the text recognition content 42, the text recognition content 42 may be advertisement content to be recognized, and the label represents the detection result 52 of the text recognition content.
In some embodiments, the macro language model 50 is trimmed based on the data samples, and when training, the macro language model 50 learns and predicts the corresponding detection result 52 according to the text recognition content 42 in the prompt, so as to continuously improve the accuracy and generalization capability of the macro language model 50 in the advertisement content recognition task.
Alternatively, in a two-category task, the detection result 52 may be one of two categories, such as "offending advertisements" and "non-offending advertisements. In a multi-classification task, the detection result 52 may be one of a plurality of categories, such as "make-up advertisement," "harassment advertisement," "dummy advertisement," and the like. The large language model 50 will output corresponding results based on the entered text recognition content 42.
Prediction for segments of human voice in audio:
fig. 3 shows a flowchart of an audio processing method according to an exemplary embodiment of the present application. The method may be performed by a computer device. The method comprises the following steps:
step 210: acquiring a voice fragment to be detected in audio;
where audio is a signal in which sound is digitally recorded.
Alternatively, the audio may be at least one of various types of sounds such as human voice, songs, sounds in natural environment, and the like. The audio contains various characteristics of sound such as at least one of pitch, volume, tempo, etc. Audio may be stored in various file formats, such as MP3, WAV.
Wherein a human voice clip refers to a clip extracted from audio that contains only a human voice portion.
In some embodiments, the human voice clip is a portion of the audio that is related to the human voice. Alternatively, the human voice segment may be at least one of a single word, phrase, sentence, or a continuous paragraph of speech, etc.
In some embodiments, the segment of human voice to be predicted refers to a segment of the portion of human voice extracted from audio for prediction. The segments of human voice to be predicted may be used for performing speech recognition tasks.
Step 220: identifying the voice fragment to be predicted to obtain text identification content;
the text identification content is text content corresponding to the voice fragment to be predicted.
Optionally, voice recognition is used to convert the voice segments into text recognition content. Speech recognition analyzes segments of human voice and converts the voice information therein into corresponding text content.
Alternatively, the text content may be at least one of an utterance, a sentence, a phrase, or the like contained in the speech segment.
Illustratively, the audio contains a segment of voice that is "weather today" and is processed by speech recognition to be converted into text form "weather today".
Step 230: and detecting the text recognition content to obtain a detection result of the text recognition content.
The detection result comprises a judgment result and an evaluation result of the text identification content.
Alternatively, in a classification task, the detection result may be one of two categories, such as "offending advertisements" and "non-offending advertisements. In a multi-classification task, the detection result may be one of a plurality of categories, such as at least one of "make-up advertisement", "harassment advertisement", "dummy advertisement", and the like.
In summary, according to the method provided by the application, the voice segment to be predicted in the audio is identified, so that the text identification content can be obtained, the text identification content is the text content corresponding to the voice segment to be predicted, the text identification content is detected, and the detection result of the text identification content can be obtained. And the audio is audited through different modules, so that the accuracy and the efficiency of audio detection are improved.
Fig. 4 shows a flowchart of an audio processing method according to an exemplary embodiment of the present application. The method may be performed by a computer device. That is, in the embodiment shown in fig. 3, step 210 may be implemented as steps 211, 212:
step 211: detecting audio based on the neural network model, and outputting voice fragments in the audio;
illustratively, the neural network model is an artificial neural network for identifying speech segments in audio. Alternatively, the neural network model may be at least one of an RNN model, a CNN model, an FCN model, and the like.
Illustratively, the neural network model is exemplified as the CNN model.
In one possible implementation, firstly, audio is input into a convolutional neural network model, input audio data is preprocessed, and the audio signal is converted into a digital form; secondly, the convolutional neural network model performs feature extraction on the input audio signal, and relevant features of the audio signal in time and frequency are extracted through a convolutional layer and a pooling layer; the convolutional neural network model is then trained using a labeled set of audio data that contains the audio and corresponding voice activity tags, i.e., whether each point in time in the audio is a segment of voice.
In some embodiments, audio is input into a neural network model that detects speech segments in the audio, and the result output by the neural network model is typically a binary sequence that indicates whether each point in time in the audio is a speech segment. Optionally, the speech segments in the audio are extracted from the binary sequence.
For example, for an output binary sequence, the speech segments may be divided according to consecutive time points of 1 or 0. When the consecutive time points are 1, the segment is indicated as a voice segment; when the consecutive time points are 0, this segment is indicated as a non-speech segment. Corresponding speech segments may be extracted from the audio according to the partitioning.
In some embodiments, the audio may be divided into a plurality of segments, each of which is detected by the neural network model, and speech segments in the audio are output.
Dividing the audio into a plurality of segments;
marking out the voice segments in the plurality of segments based on the neural network model, and outputting the voice segments.
In one alternative example, the audio is divided into a plurality of segments.
In some embodiments, the neural network model performs segment-by-segment detection on the plurality of segments in the audio based on a preset time interval to divide the audio into a plurality of segments (e.g., 30 milliseconds), each of the plurality of segments being classified as either speech segments or non-speech segments. A speech segment refers to a portion of audio that contains speech, i.e., a segment that is voiced. The non-speech segments are segments of audio that have no sound, alternatively, the non-speech segments are silence segments.
In some embodiments, the audio is divided into a plurality of segments by a sliding window. First, the size of the window, i.e., the length of time for each segment in the audio, is determined, and illustratively, a length of time that varies from tens to hundreds of milliseconds may be selected as the window size. Next, starting from the starting position of the audio, a length of time of one window size is selected as the first segment, the window is moved back a fixed time interval to obtain the next segment, and this process is repeated until the entire audio is covered. By this sliding window method, the audio can be divided into a plurality of segments of a fixed time length.
In an alternative example, speech segments in the plurality of segments are labeled based on a neural network model, and the speech segments are output.
In some embodiments, the sample speech segment and the sample non-speech segment are used as training data sets, the training data sets are labeled, and the sample speech segment and the sample non-speech segment are distinguished. Alternatively, a binary classification approach may be used, where the sample speech segments are positive examples and the sample non-speech segments are negative examples. And training the universal neural network model by using the marked training data set. And inputting the training data set into the universal neural network model, outputting to obtain a prediction result, and updating model parameters of the universal neural network model according to the difference between the prediction result and the training data set to obtain the trained neural network model.
In some embodiments, the trained neural network model performs segment-by-segment detection of the plurality of segments in the audio, and the plurality of segments are input into the neural network model. Optionally, a threshold is set to determine whether each segment is a speech segment, and if the probability value is higher than or equal to the threshold, the segment is marked as a speech segment; if the probability value is below the threshold, the segment is marked as a non-speech segment. All the voice fragments in the input audio can be finally obtained through the discrimination and marking of the fragments.
Step 212: and extracting the voice fragments based on the voice separation model to obtain the voice fragments to be predicted.
By way of example, the human voice separation model may be a deep learning model for separating human voice segments in a speech segment.
In some embodiments, the speech segments are typically mixed from multiple sources, and the target source (human voice segment) in the speech segments is separated from other sources of noise for subsequent processing and analysis.
In some embodiments, the speech segments may be input into a human voice separation model, which extracts human voice segments from the speech segments.
Inputting the speech segments into a human voice separation model;
And separating the voice segment from the background disturbance sound segment based on the voice separation model, and outputting the voice segment to be predicted.
In an alternative example, a speech segment is input into the human voice separation model.
The voice segment comprises a voice segment and a background disturbance sound segment, wherein the background disturbance sound segment is other segments except the voice segment in the voice segment.
The background disturbance sound segment refers to a portion other than the human voice segment in the entire voice segment. Alternatively, the background disturbance sound may be from at least one of background noise, accompaniment music, traffic noise, and the like.
In some embodiments, the background disturbance sound is separated from the human voice segment by a human voice separation model.
In some embodiments, when a speech segment is input to the vocal separation model, the entire speech segment including the vocal segment and the background disturbance sound segment is input. By inputting the whole voice segment into the voice separation model, the voice separation model analyzes and processes the input voice segment, separates the voice segment from the background disturbance voice segment, and outputs the voice segment to be predicted which only contains voice.
In an alternative example, the human voice segment and the background disturbance sound segment are separated based on a human voice separation model, and the human voice segment to be predicted is output.
In some embodiments, the universal human voice separation model is trained by taking the sample background disturbance sound segment and the sample human voice segment as training data sets, and the human voice separation model after training better separates the human voice segment and the background disturbance sound segment in the voice segment. By providing diversified data in the training process, the universal human voice separation model can learn the characteristics of human voice and background disturbance sound in different scenes, so that the effect of separating the human voice fragments from the background disturbance sound fragments in the voice fragments is improved.
Optionally, the human voice separation model includes at least one of a BSRNN model, a ResUnet model, a Transfromerde model, and the like model types.
The following describes the training process of the separation model of human voice.
In some embodiments, a sample voice segment is obtained, wherein the sample voice segment comprises a sample human voice segment and a sample background disturbance sound segment; inputting the sample voice fragment into a sample voice separation model, and simultaneously providing the sample voice fragment as a target output; and training the sample human voice separation model by comparing the difference between the output of the sample human voice separation model and the target output to obtain the trained human voice separation model capable of separating human voice fragments.
In some embodiments, firstly, a model structure of a general human voice separation model is initialized, and secondly, a sample human voice segment and a sample background disturbance sound segment are acquired. The sample background disturbance sound segment comprises at least one of a plurality of types, such as background noise, accompaniment music, traffic noise and the like, and at least one background disturbance sound segment type is randomly selected from the sample background disturbance sound segment and added into the sample voice segment, so that a sample voice segment comprising the sample voice segment and the sample background disturbance sound segment is obtained.
In some embodiments, the sample voice segment is input into the universal human voice separation model, a prediction result corresponding to the sample voice segment is output, model parameters of the universal human voice separation model are updated in a gradient mode according to the difference between the prediction result and the sample human voice segment, convergence is achieved, and finally the universal human voice separation model achieving convergence expectation is used as the trained human voice separation model.
In some embodiments, a loss value between the prediction result and the sample human voice segment is calculated by a loss function, thereby updating model parameters of the universal human voice separation model according to the loss value.
Optionally, the loss function includes at least one of a loss function type of a square error loss function (L2 loss), a regression loss function (L1 loss), a cross entropy loss function, and the like.
Schematically, referring to fig. 5 in combination, a schematic diagram of a training process of a voice separation model provided by an exemplary embodiment of the present application is shown, as shown in fig. 5, a sample voice segment 401 is currently obtained, a sample background disturbance sound segment 402 is superimposed on the sample voice segment 401, a sample voice segment 403 is obtained, the sample voice segment 403 is input into a general voice separation model 404, a prediction result 405 is obtained by outputting, a square error loss between the prediction result 405 and the sample voice segment 401 is calculated through an L2 loss function, and the general voice separation model 404 is trained, so as to obtain a trained voice separation model 406.
In summary, according to the method provided by the application, based on the neural network model, audio is detected, the audio is divided into a plurality of segments, voice segments in the plurality of segments are output, and the voice segments output by the neural network model are extracted by the voice separation model, so that voice segments to be predicted are obtained. When the voice segment is extracted by using the voice separation model, the voice separation model is trained for better separating the voice segments in the voice segments, and in the training stage, the voice segments comprising the voice segments and the background disturbance voice segments are used as input, and meanwhile, the voice segments are provided as target output. By comparing the difference between the output of the human voice separation model and the target output, the human voice separation model is optimized by using a proper loss function, so that the human voice separation model can learn how to accurately separate the human voice fragments and the background disturbance sound fragments in the voice fragments.
FIG. 6 shows a flow chart of an audio processing method provided by an exemplary embodiment of the present application The method may be performed by a computer device. That is, in the embodiment shown in fig. 3, step 220 may be implemented as steps 221, 222:
step 221: inputting the voice segment to be predicted into a voice recognition model;
illustratively, the speech recognition model is a deep learning model that converts segments of human voice into text recognition content.
Optionally, the speech recognition model includes at least one of a parameter model, a configurator model, a transducer model, and the like.
In some embodiments, the segment of human voice to be predicted is passed as input to an already trained speech recognition model that will recognize the segment of human voice to be predicted.
In some embodiments, before inputting the segment of human voice to be predicted into the speech recognition model, it is necessary to check whether the sampling rate of the segment of human voice to be predicted is consistent with the requirements of the speech recognition model. If not, sampling rate conversion is needed, and the sampling rate of the voice fragments to be predicted is adjusted to be compatible with the voice recognition model.
Step 222: and identifying the voice fragment to be predicted based on the voice identification model to obtain text identification content.
The text identification content is text content corresponding to the voice fragment to be predicted.
In some embodiments, the voice segments are input into a voice recognition model, the voice segments to be predicted are recognized through the voice recognition model, the voice recognition model can recognize the audio content in the voice segments as corresponding text content, and the corresponding text content is output as text recognition content. Alternatively, the text content may be a piece of text, a sentence, or a word, which is a textual representation of an input piece of speech.
For example, a piece of audio contains a voice segment, and a voice recognition model can be used for recognizing the voice segment to be predicted, so that text recognition content is obtained. For example, the voice clip in the audio is a sentence "weather today is good", and the corresponding text content "weather today is good" can be obtained through the voice recognition model. And identifying the input voice fragments by using a voice identification model to obtain corresponding text identification content.
In an alternative example, keywords in the voice segments to be predicted are identified, and the voice segments corresponding to the keywords are obtained.
The keywords refer to words or phrases related to the identification content in the voice fragments to be predicted. Keywords play a key role in the speech recognition model, and the keywords can guide the speech recognition model to perform more accurate text recognition. By way of example, in a segment of human voice about weather, the keywords may be "weather", "clear", "temperature", etc. By recognizing these keywords, the speech recognition model can more accurately understand the content of the human voice segment and convert it into corresponding text recognition content.
In some embodiments, the segment of voice to be predicted is a segment of advertising content, and the keywords refer to words or phrases in the segment of voice to be predicted that are related to the advertising content. Illustratively, the keywords are proper nouns related to the advertising content, such as brand names, product names, and the like. Keywords may be specific and may vary from advertising content to advertising content. By way of example, taking a piece of voice to be predicted as a mobile advertisement, the keywords may include at least one of a mobile brand, model, feature, etc. Taking a voice segment to be predicted as an automobile advertisement as an example, the keywords may cover at least one of automobile brands, automobile types, performance characteristics and the like.
In some embodiments, a keyword list containing proper nouns such as brand names, product names, etc. may be constructed. In the text results outputted by the speech recognition model, it is searched whether these keywords are included, thereby determining whether these proper nouns are mentioned in the human voice segment.
In an alternative example, the voice recognition model is used for recognizing the voice fragments corresponding to the keywords, so that text recognition content is obtained.
In some embodiments, voice segments corresponding to the keywords are identified based on a voice identification model, text conversion is performed, and the voice segments are converted into corresponding text identification content.
Illustratively, the segments of human voice in audio of a certain car brand are: "this is a brand new XYZ car with the extreme experience of you't get a pilot. Processing the input voice fragment through the voice recognition model, and if the voice recognition model recognizes the keyword 'XYZ car', obtaining text recognition content: "this is a brand new XYZ car with the extreme experience of you't get a pilot. "wherein," XYZ car "is a proper noun corresponding to a keyword.
In summary, according to the method provided by the application, the voice segment to be predicted is input into the voice recognition model, the voice segment to be predicted is recognized based on the voice recognition model, so that text recognition content can be obtained, and in a possible implementation manner, the voice recognition model can recognize the voice segment to be detected as text recognition content by performing keyword recognition on the voice segment. The method for detecting the keywords of the voice fragments through the voice recognition model improves the auditing accuracy of the audio advertisements.
It should be noted that, the voice recognition model in the embodiment of the present application is a voice recognition model after fine adjustment based on the general voice recognition model, and the voice recognition model after fine adjustment can improve the recognition rate of audio.
Fig. 7 shows a flowchart of an audio processing method according to an exemplary embodiment of the present application. The method may be performed by a computer device. That is, in the embodiment shown in fig. 3, step 230 may be implemented as steps 231, 232:
step 231: inputting text recognition content into a large language model;
illustratively, a large language model is a model that detects text recognition content.
Alternatively, the large-scale language model is a large-scale language model using an open source, such as at least one of Chatglm-6B (Chatbot GPT-6B), GPT (generating Pre-trained Transformer, pre-training generating transducer), LLaMA (Language Model for Large-scale Analysis large-scale Analysis language model), GLM (Generalized Language Model, general language model), and the like.
Step 232: and detecting the text recognition content based on the large language model to obtain a detection result of the text recognition content.
In some embodiments, after the text recognition content is detected based on the large language model, a detection result is obtained. The detection result comprises judgment and evaluation on the aspects of whether the text recognition content is compliant or not.
In some embodiments, the input prompt is confirmed based on the text recognition content, the text recognition content is input into a large language model according to the input prompt, and the large language model detects the text recognition content according to the input prompt, so that a detection result of the text recognition content is obtained.
The input prompt refers to a text segment provided as a starting input to the large language model when the detection result is generated. The input prompts may be determined based on the text recognition content in order to direct the large language model to generate a detection of the text recognition content.
Alternatively, the input prompt may be at least one of a keyword, phrase, or question description associated with the text recognition content for directing the large language model to generate a detection of the text recognition content.
The text recognition content is a brand new XYZ automobile, an advanced power system is mounted, and a top driving experience is provided. Unique appearance design and attractive. Now purchase a benefit that can enjoy free maintenance for one year ", an input prompt is determined based on the text recognition content, the input prompt can be a task description of the text recognition content, and the large language model is guided to generate a detection result of the text recognition content. For example, the input prompts are as follows: "text recognition content: ' brand new XYZ automobile is provided with an advanced power system, so that top-level driving experience is provided, and a unique appearance design is provided. At present, purchasing a benefit' which can enjoy free maintenance for one year asks you to detect according to the text identification content, and outputs the detection result of the text identification content. "
In some embodiments, the predictive tasks of the large language model are classified tasks or multi-classified tasks. Alternatively, in a classification task, the detection result may be one of two categories, such as "offending advertisements" and "non-offending advertisements. In a multi-classification task, the detection result may be one of a plurality of categories, such as at least one of "make-up advertisement", "harassment advertisement", "dummy advertisement", and the like. The large language model outputs corresponding detection results according to the input text recognition content.
In an alternative example, matching detection is performed on the text recognition content based on the large language model, so as to obtain a detection result after the text recognition content is matched.
Wherein the match detection comprises at least one of: semantic detection; form detection; and (5) detecting content.
The semantic detection is to analyze semantic information of text recognition content and judge whether meaning and expression mode of the text recognition content accord with preset semantic requirements. For example, it is detected whether the text contains semantic features such as negative emotions, positive recommendations, etc. The form detection is to detect the form of the text recognition content, and mainly comprises matching detection in terms of word collocation and the like. For example, it is detected whether the text conforms to a preset vocabulary combination. The content detection is to perform matching detection on the text identification content and judge whether the text identification content contains preset information or keywords. For example, it is detected whether the text contains contents such as sensitive vocabulary, product name, etc.
In some embodiments, the matching detection of the analyzed text recognition content by the large language model may be implemented in different manners, such as at least one of semantic detection, formal detection, and content detection. The matching method is selected according to the requirements, and can be used in one mode alone or can be combined with a plurality of modes to carry out comprehensive detection. The present application is not limited in this regard.
Illustratively, it is assumed that a test is made to determine if a piece of text-identifying content meets the content requirements of an automotive advertisement. The text recognition content is as follows: the brand new XYZ automobile is provided with an advanced power system, and top-level driving experience is provided. Unique appearance design and attractive. The matching detection of text recognition content can be performed by purchasing a preferential large language model which can enjoy free maintenance for one year. The large language model carries out semantic detection according to an advanced power system and a top-level driving experience, and judges that the text identification content belongs to the field of automobiles; the large language model detects the content according to XYZ automobile mentioned in the text identification content, and can judge that the text identification content is automobile advertisement.
In summary, according to the method provided by the application, the text recognition content is input into the large language model, and the text recognition content is detected based on the large language model, so that the detection result of the text recognition content is obtained. In one possible implementation manner, the text recognition content is subjected to matching detection based on a large language model, the matching detection comprises at least one of semantic detection, form detection and content detection, and the large language model obtains a detection result after the text recognition content is matched according to the matching detection. The method greatly improves the accuracy of the large language model in detecting the text recognition content.
It should be noted that, the large language model in the embodiment of the present application is a large language model after fine adjustment based on a general large language model, and the fine-adjusted large language model can improve the detection rate of audio.
It should be noted that, in the embodiment of the present application, audio of any one language may be detected, and a detection result corresponding to the audio may be obtained.
Training for audio models:
fig. 8 shows a flowchart of an audio processing method according to an exemplary embodiment of the present application. The method may be performed by a computer device. The method comprises the following steps:
step 310: acquiring a first data sample and a second data sample;
wherein the first data sample comprises a data set of advertising content, the first data sample being used to train a speech recognition model. The second data sample includes a pre-set data format for indicating an input of the second data sample, the second data sample being used to train the large language model.
Step 320: inputting the first data sample into a general speech recognition model for training to obtain a speech recognition model;
the general speech recognition model is a model which can convert speech into text after large-scale training. The universal speech recognition model learns acoustic features and speech patterns through a large-scale speech data set, thereby realizing recognition of speech.
In some embodiments, the generic speech recognition model is typically trained from a large number of speech data sets that contain various speech samples. Illustratively, the different speech samples encompass at least one of different types of human voice, different speech rates, different tones, and background noise, among others. The generic speech recognition model may convert the input speech into a corresponding text output. Alternatively, the generic speech recognition model may recognize and understand the content of a variety of speech, including at least one of words, phrases, sentences, and the like.
In some embodiments, fine tuning the generic speech recognition model means retraining the generic speech recognition model using a data set of a domain or task to adapt the generic speech recognition model to the speech recognition needs of the domain or task.
In some embodiments, fine-tuning is performed by inputting the first data sample into the generic speech recognition model with advertising content as the first data sample. In the fine tuning process, the voice recognition model gradually learns and improves the recognition accuracy of the advertisement content. By fine tuning for advertising content, the generic speech recognition model is better suited to recognize segments of voice that contain advertising content. When the voice recognition model after fine tuning is used for recognizing the voice fragment to be predicted, the voice recognition model can more accurately transcribe the text recognition content corresponding to the voice fragment.
Step 330: and inputting the second data sample into the universal large language model for training to obtain the large language model.
The general large language model is a model which is trained on a large scale and has the capability of understanding and generating natural language.
In some embodiments, the generic large language model is typically trained from a large number of text data sets. Alternatively, the generic large language model can understand the semantics, context, and sentence structure of the input text and analyze the input text; alternatively, the generic large language model can retrieve relevant information from a large corpus and generate corresponding answers based on keywords or questions in the input text. The input of the generic large language model may be at least one of a text, a question, a sentence, etc. The output of the generic large language model is the result of processing the input text.
In some embodiments, fine-tuning the generic large language model means retraining the generic large language model using a dataset of a domain or task to adapt the generic large language model to the language processing needs of the domain or task.
In some embodiments, the second data sample is input to the universal large language model according to a preset data format for retraining so as to fine tune the large language model, so that the large language model is more suitable for understanding text recognition content and outputting corresponding detection results for the text recognition content.
In some embodiments, the second data sample may comprise a marked data set.
Alternatively, in a two-classification task, where the detection result may be one of two classifications, namely "offending advertisement" and "non-offending advertisement", the second data sample needs to contain both a positive sample data set and a negative sample data set. The positive sample data set refers to a data set containing a detection result of "non-offending advertisement", and the negative sample data set refers to a data set containing a detection result of "offending advertisement". The labeled dataset may be used to fine tune a large language model.
In some embodiments, the fine tuning process is to input the sample data set into a large language model for training, and adjust parameters of the large language model to enable the detection result of the text recognition content to be more accurately judged. In the fine tuning process, the pre-training capability and the context understanding capability of the large-scale language model can be utilized, so that the recognition accuracy of the large-scale language model on text recognition content is improved.
In summary, according to the method provided by the application, the first data sample and the second data sample are obtained, and the voice recognition model is subjected to fine adjustment by using the first data sample, so that the voice recognition model after fine adjustment is obtained; and fine-tuning the large language model by using the second data sample to obtain a fine-tuned large language model. The fine-tuned voice recognition model improves the understanding capability and recognition capability of the voice fragment, and is more suitable for the recognition requirement of advertisement content; the trimmed large language model improves the detection capability of text recognition content and is more suitable for the detection requirement of advertisement content. By optimizing and fine-tuning two key components, namely a voice recognition model and a large language model, the audio auditing system can conduct audio auditing more accurately and smoothly.
Fig. 9 shows a flowchart of a model training method of an audio processing model according to an exemplary embodiment of the present application. The method may be performed by a computer device. That is, in the embodiment shown in fig. 8, step 320 may be implemented as steps 321, 322:
step 321: inputting the first data sample into a general speech recognition model to obtain a first prediction result;
wherein the first data sample comprises a data set of advertising content and the first prediction is a prediction of advertising content.
Alternatively, the first data sample may include advertising content from different industries, various types, such as product introduction, promotional information, branding, and the like. Alternatively, the advertising content may contain industry terms, brand names, product features, and the like.
In some embodiments, the first data sample is a data set of a piece of voice containing advertising content.
In some embodiments, a first data sample is acquired, the first data sample is input into a general speech recognition model, and a first prediction result is output; and training the universal speech recognition model based on the error between the first data sample and the first prediction result to obtain a fine-tuned speech recognition model.
Step 322: model parameters of the generic speech recognition model are updated based on an error between the first data sample and the first prediction result.
In some embodiments, the first data sample is input into a general speech recognition model, a first prediction result corresponding to the first data sample is output, the first prediction result is compared with the first data sample, and a loss value is obtained, wherein the loss value is used for indicating an error between the first prediction result and the first data sample.
In some embodiments, a loss value between the first prediction result and the first data sample is calculated by a loss function, whereby model parameters of the generic speech recognition model are updated according to the loss value.
Optionally, the loss function includes at least one of a loss function type of a square error loss function (L2 loss), a regression loss function (L1 loss), a cross entropy loss function, and the like.
Schematically, referring to fig. 10 in combination, a schematic diagram of a training process of a generic speech recognition model according to an exemplary embodiment of the present application is shown, as shown in fig. 10, a first data sample 501 is currently acquired, the first data sample 501 is input into a generic speech recognition model 502, a first prediction result 503 is obtained by output, a square error loss between the first prediction result 503 and the first data sample 501 is calculated through an L2 loss function, and the generic speech recognition model 502 is trained to obtain a fine-tuned speech recognition model 504. The speech recognition model 504 is used to convert the segment of human voice to be detected into text recognition content.
It should be noted that, the training method of the general speech recognition model provided in this embodiment is only schematically illustrated, and does not limit the training method of the general speech recognition model.
In summary, according to the method provided by the application, the first data sample is input into the universal voice recognition model to obtain the prediction result of the first data sample, the first data sample comprises the data set of the advertisement content, and in the training process, the model parameters of the universal voice recognition model are updated according to the error between the first data sample and the first prediction result, so that the universal voice recognition model can better adapt to the recognition requirement of the advertisement content, and the performance and the generalization capability of the universal voice recognition model are improved.
FIG. 11 illustrates a flow chart of a model training method for an audio processing model provided in an exemplary embodiment of the present application. The method may be performed by a computer device. That is, in the embodiment shown in fig. 8, step 330 may be implemented as steps 331 and 332:
step 331: inputting the first field and the second field into a general large language model based on a preset data format to obtain a second prediction result;
the first field is used for indicating the input content of the text identification content, and the second field is used for indicating the detection result of the text identification content.
The preset data format is used to indicate the input of the second data sample.
In some embodiments, the generic large language model may be fine-tuned based on a second data sample that includes the first field and the second field. Optionally, the preset data format is as follows:
{ prompt: text identification content, label: detection result }
The sample is used for indicating input content of text recognition content, the text recognition content can be advertisement content to be recognized, and the label represents a detection result of the text recognition content. The promt is used for guiding the input of the general large language model, and the label is used for obtaining the output of the general large language model.
In some embodiments, by training the generic large language model using the template and label as the second data samples, the generic large language model can learn how to output the corresponding detection results based on the input text recognition content.
Step 332: model parameters of the generic large language model are updated based on errors between the second data samples and the second prediction result.
In some embodiments, the second data sample is input into a generic large language model, a second prediction result corresponding to the second data sample is output, the second prediction result is compared with a second field in the second data sample, and a loss value is obtained, where the loss value is used to indicate an error between the second prediction result and the second data sample.
In some embodiments, a penalty value between the second prediction result and a second field in the second data sample is calculated by a penalty function, whereby model parameters of the generic large language model are updated according to the penalty value.
Optionally, the loss function includes at least one of a loss function type of a square error loss function (L2 loss), a regression loss function (L1 loss), a cross entropy loss function, and the like.
Schematically, referring to fig. 12 in combination, a schematic diagram of a training process of a generic large language model provided in an exemplary embodiment of the present application is shown, as shown in fig. 12, a second data sample 601 is currently acquired, the second data sample 601 is input into a generic large language model 602, a second prediction result 603 is output and obtained, a square error loss between the second prediction result 603 and a second field in the second data sample 601 is calculated through an L2 loss function, and the generic large language model 602 is trained to obtain a trimmed large language model 604. The large language model 604 is used to detect text recognition content.
The training method of the general large language model provided in this embodiment is only schematically illustrated, and does not limit the training method of the general large language model.
In summary, according to the method provided by the application, the prediction result of the second data sample is obtained by inputting the second data sample into the universal large-scale language model based on the preset data format, and the accuracy of the large-scale language model in outputting the detection result to the text recognition content can be improved by training the universal large-scale language model and updating the model parameters by using the second data sample. The trimmed large language model can automatically classify and judge the text recognition content.
It will be appreciated by those skilled in the art that the above embodiments may be implemented independently, or the above embodiments may be freely combined to form new embodiments to implement the audio processing method of the present application.
Fig. 13 is a block diagram illustrating an audio processing apparatus according to an exemplary embodiment of the present application. The device has the function of realizing the audio processing method example, and the function can be realized by hardware or can be realized by executing corresponding software by hardware. The device may be the server described above or may be provided in the server. As shown in fig. 13, the apparatus 1500 may include: acquisition module 1510, identification module 1520, and detection module 1530:
An obtaining module 1510, configured to obtain a voice segment to be predicted in audio;
an identifying module 1520, configured to identify the voice segment to be predicted, to obtain text identifying content, where the text identifying content is text content corresponding to the voice segment to be predicted;
and the detection module 1530 is configured to detect the text recognition content, and obtain a detection result of the text recognition content.
In some embodiments, the recognition module 1520 includes an input sub-module and a recognition sub-module:
the input sub-module is used for inputting the voice fragment to be predicted into a voice recognition model;
and the recognition sub-module is used for recognizing the voice fragments to be predicted based on the voice recognition model to obtain the text recognition content.
In some embodiments, the recognition submodule includes a recognition unit:
the identification unit is used for identifying the keywords in the voice segments to be predicted to obtain voice segments corresponding to the keywords;
and the recognition unit is used for recognizing the voice fragments corresponding to the keywords based on the voice recognition model to obtain the text recognition content.
In some embodiments, the detection module 1530 includes an input sub-module and a detection sub-module:
An input sub-module for inputting the text recognition content into a large language model;
and the detection sub-module is used for detecting the text recognition content based on the large language model to obtain a detection result of the text recognition content.
In some embodiments, the detection sub-module further comprises a detection sub-unit:
the detection subunit is used for carrying out matching detection on the text recognition content based on the large language model to obtain a detection result after the text recognition content is matched;
wherein the match detection comprises at least one of: semantic detection; form detection; and (5) detecting content.
In some embodiments, the acquisition module 1510 further includes an acquisition sub-module.
An acquisition sub-module for acquiring a first data sample comprising a data set of advertising content and a second data sample comprising a preset data format for indicating an input of the second data sample;
in some embodiments, apparatus 1500 further comprises a training module.
The training module is used for inputting the first data sample into a general speech recognition model for training to obtain a fine-tuned speech recognition model;
And the training module is used for inputting the second data sample into the universal large language model for training to obtain the trimmed large language model.
In some embodiments, the training module includes an input sub-module and an update sub-module:
the input sub-module is used for inputting the first data sample into the general speech recognition model to obtain a first prediction result, wherein the first prediction result is a prediction result of the advertisement content;
and the updating sub-module is used for updating the model parameters of the universal voice recognition model based on the error between the first data sample and the first prediction result.
The input sub-module is used for inputting the first field and the second field into the universal large language model based on the preset data format to obtain a second prediction result, wherein the second prediction result is a prediction result of the second data sample;
an updating sub-module for updating model parameters of the generic large language model based on an error between the second data sample and the second prediction result;
the first field is used for indicating the input content of the text recognition content, and the second field is used for indicating the detection result of the text recognition content.
In some embodiments, the acquisition module 1510 includes an output sub-module and an extraction sub-module:
the output sub-module is used for detecting the audio based on the neural network model and outputting voice fragments in the audio;
and the extraction sub-module is used for extracting the voice fragments based on the voice separation model to obtain voice fragments to be predicted.
It should be noted that, when the apparatus provided in the foregoing embodiment performs the functions thereof, only the division of the respective functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules according to actual needs, that is, the content structure of the device is divided into different functional modules, so as to perform all or part of the functions described above.
With respect to the apparatus in the above embodiments, the specific manner in which the respective modules perform the operations has been described in detail in the embodiments regarding the method; the technical effects achieved by the execution of the operations by the respective modules are the same as those in the embodiments related to the method, and will not be described in detail herein.
The embodiment of the application also provides a computer device, which comprises: a processor and a memory, the memory storing a computer program; the processor is configured to execute the computer program in the memory to implement the audio processing method or the model training method of the audio processing model provided in the above method embodiments. Optionally, the computer device is a server.
Illustratively, fig. 14 is a block diagram of a server provided in an exemplary embodiment of the present application.
In general, the server 2300 includes: a processor 2301 and a memory 2302.
The processor 2301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 2301 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 2301 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a central processor (Central Processing Unit, CPU), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 2301 may be integrated with an image processor (Graphics Processing Unit, GPU) for use in connection with rendering and rendering of content to be displayed by the display screen. In some embodiments, the processor 2301 may also include an artificial intelligence (Artificial Intelligence, AI) processor for processing computing operations related to machine learning.
Memory 2302 may include one or more computer-readable storage media, which may be non-transitory. Memory 2302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 2302 is used to store at least one instruction for execution by processor 2301 to implement the audio processing methods provided by the method embodiments herein.
In some embodiments, server 2300 may further optionally include: an input interface 2303 and an output interface 2304. The processor 2301 and the memory 2302 may be connected to the input interface 2303 and the output interface 2304 through buses or signal lines. The respective peripheral devices may be connected to the input interface 2303 and the output interface 2304 through buses, signal lines, or a circuit board. Input interface 2303, output interface 2304 may be used to connect at least one Input/Output (I/O) related peripheral device to processor 2301 and memory 2302. In some embodiments, the processor 2301, memory 2302, and input interface 2303, output interface 2304 are integrated on the same chip or circuit board; in some other embodiments, the processor 2301, the memory 2302, and either or both of the input interface 2303 and the output interface 2304 may be implemented on separate chips or circuit boards, which are not limited in this application.
Those skilled in the art will appreciate that the structures shown above are not limiting of server 2300 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.
In an exemplary embodiment, a computer program product is also provided, the computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor reads and executes the computer instructions from the computer-readable storage medium to implement the audio processing method provided by the above-mentioned method embodiments.
In an exemplary embodiment, there is also provided a computer-readable storage medium having stored therein a computer program that is loaded and executed by a processor to implement the audio processing method provided by the above-described method embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims (13)

1. A method of audio processing, the method comprising:
acquiring a voice fragment to be predicted in audio;
identifying the voice segment to be predicted to obtain text identification content, wherein the text identification content is text content corresponding to the voice segment to be predicted;
And detecting the text recognition content to obtain a detection result of the text recognition content.
2. The method according to claim 1, wherein the identifying the voice segment to be predicted to obtain text identifying content includes:
inputting the voice segment to be predicted into a voice recognition model;
and identifying the voice segment to be predicted based on the voice identification model to obtain the text identification content.
3. The method according to claim 2, wherein the identifying the voice segments to be predicted based on the speech recognition model to obtain the text recognition content includes:
identifying keywords in the voice segments to be predicted to obtain voice segments corresponding to the keywords;
and identifying the voice segments corresponding to the keywords based on the voice identification model to obtain the text identification content.
4. A method according to any one of claims 1 to 3, wherein the detecting the text recognition content to obtain the detection result of the text recognition content includes:
inputting the text recognition content into a large language model;
And detecting the text recognition content based on the large language model to obtain a detection result of the text recognition content.
5. The method of claim 4, wherein the detecting the text recognition content based on the large language model to obtain the detection result of the text recognition content comprises:
performing matching detection on the text recognition content based on the large language model to obtain a detection result after the text recognition content is matched;
wherein the match detection comprises at least one of: semantic detection; form detection; and (5) detecting content.
6. The method according to any one of claims 1 to 5, further comprising:
obtaining a first data sample and a second data sample, wherein the first data sample comprises a data set of advertisement content, the second data sample comprises a preset data format, and the preset data format is used for indicating the input of the second data sample;
inputting the first data sample into a general speech recognition model for training to obtain the speech recognition model;
and inputting the second data sample into a general large language model for training to obtain the large language model.
7. The method of claim 6, wherein the training the first data sample into a generic speech recognition model to obtain the speech recognition model comprises:
inputting the first data sample into the universal voice recognition model to obtain a first prediction result, wherein the first prediction result is a prediction result of the advertisement content;
model parameters of the generic speech recognition model are updated based on an error between the first data sample and the first prediction result.
8. The method of claim 6, wherein the second data sample comprises a first field and a second field;
the step of inputting the second data sample into a general large language model for training to obtain the large language model comprises the following steps:
inputting the first field and the second field into the universal large language model based on the preset data format to obtain a second prediction result, wherein the second prediction result is a prediction result of the second data sample;
updating model parameters of the generic large language model based on an error between the second data sample and the second prediction result;
The first field is used for indicating the input content of the text recognition content, and the second field is used for indicating the detection result of the text recognition content.
9. The method according to any one of claims 1 to 8, wherein the obtaining a segment of human voice to be predicted in audio comprises:
detecting the audio based on a neural network model, and outputting a voice fragment in the audio;
and extracting the voice segment based on the voice separation model to obtain the voice segment to be predicted.
10. An audio processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring the voice fragments to be predicted in the audio;
the recognition module is used for recognizing the voice segment to be predicted to obtain text recognition content, wherein the text recognition content is text content corresponding to the voice segment to be predicted;
and the detection module is used for detecting the text recognition content to obtain a detection result of the text recognition content.
11. A computer device, the computer device comprising: a processor and a memory, wherein at least one section of program is stored in the memory; the processor is configured to execute the at least one program in the memory to implement the audio processing method according to any one of claims 1 to 9.
12. A computer readable storage medium having stored therein executable instructions that are loaded and executed by a processor to implement the audio processing method of any of the preceding claims 1 to 9.
13. A computer program product, characterized in that it comprises computer instructions stored in a computer-readable storage medium, from which a processor reads and executes them to implement the audio processing method according to any of the preceding claims 1 to 9.
CN202311841788.4A 2023-12-28 2023-12-28 Audio processing method, device, equipment and storage medium Pending CN117809655A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311841788.4A CN117809655A (en) 2023-12-28 2023-12-28 Audio processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311841788.4A CN117809655A (en) 2023-12-28 2023-12-28 Audio processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117809655A true CN117809655A (en) 2024-04-02

Family

ID=90433523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311841788.4A Pending CN117809655A (en) 2023-12-28 2023-12-28 Audio processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117809655A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118248133A (en) * 2024-05-27 2024-06-25 暗物智能科技(广州)有限公司 Two-stage speech recognition method, device, computer equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118248133A (en) * 2024-05-27 2024-06-25 暗物智能科技(广州)有限公司 Two-stage speech recognition method, device, computer equipment and readable storage medium
CN118248133B (en) * 2024-05-27 2024-09-20 暗物智能科技(广州)有限公司 Two-stage speech recognition method, device, computer equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN109196495B (en) System and method for fine-grained natural language understanding
CN110211565A (en) Accent recognition method, apparatus and computer readable storage medium
US11030999B1 (en) Word embeddings for natural language processing
CN110097870A (en) Method of speech processing, device, equipment and storage medium
KR20200105057A (en) Apparatus and method for extracting inquiry features for alalysis of inquery sentence
CN113035231A (en) Keyword detection method and device
Kumar et al. Machine learning based speech emotions recognition system
Yasmin et al. Graph based feature selection investigating boundary region of rough set for language identification
CN117010907A (en) Multi-mode customer service method and system based on voice and image recognition
CN117352000A (en) Speech classification method, device, electronic equipment and computer readable medium
Izbassarova et al. Speech recognition application using deep learning neural network
CN117809655A (en) Audio processing method, device, equipment and storage medium
Birla A robust unsupervised pattern discovery and clustering of speech signals
CN113853651B (en) Apparatus and method for speech-emotion recognition with quantized emotion state
CN113887239A (en) Statement analysis method and device based on artificial intelligence, terminal equipment and medium
CN112037772B (en) Response obligation detection method, system and device based on multiple modes
CN113593523A (en) Speech detection method and device based on artificial intelligence and electronic equipment
CN112542173A (en) Voice interaction method, device, equipment and medium
CN112116181A (en) Classroom quality model training method, classroom quality evaluation method and classroom quality evaluation device
CN115132195B (en) Voice wakeup method, device, equipment, storage medium and program product
CN112071304B (en) Semantic analysis method and device
CN112687296B (en) Audio disfluency identification method, device, equipment and readable storage medium
CN112420022B (en) Noise extraction method, device, equipment and storage medium
CN115132170A (en) Language classification method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination