CN116564294A

CN116564294A - Training method and device for noise identification model

Info

Publication number: CN116564294A
Application number: CN202310467856.9A
Authority: CN
Inventors: 崔午阳
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2023-04-27
Filing date: 2023-04-27
Publication date: 2023-08-08

Abstract

The invention discloses a training method and device of a noise identification model, and relates to the technical field of big data. One embodiment of the method comprises the following steps: according to the interactive scene division rule, determining an interactive scene corresponding to the acquired audio fragment; for each audio fragment, acquiring a corresponding marking rule according to the interaction scene of the audio fragment, and marking the audio fragment according to the marking rule; extracting the characteristics of each marked audio fragment to obtain a training characteristic set; model training is carried out based on the training feature set, a noise recognition model is generated, and the noise recognition model is used for carrying out noise recognition on the audio fragments generated in the voice interaction process. According to the embodiment, the interactive scenes are divided, marking is conducted on marking rules of different interactive scenes, so that the noise recognition model is trained, marking accuracy can be improved, recognition effects of voice and noise dimensions are considered, and training effects and recognition accuracy of the noise recognition model are improved.

Description

Training method and device for noise identification model

Technical Field

The invention relates to the technical field of big data, in particular to a training method and device of a noise identification model.

Background

In an intelligent voice customer service robot system, audio data is required to be subjected to audio pre-processing through a noise recognition model, and the evaluation standard of the noise recognition model is that the respective accuracy rates are respectively counted from two dimensions of voice and noise. The current training method of the noise identification model is that training data is generated based on manual marking, and the noise identification model is trained through the training data.

In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art:

the subjectivity of manual marking is large, and recognition effects of voice and noise dimensions cannot be considered at the same time, so that training effects and recognition accuracy of a noise recognition model are poor.

Disclosure of Invention

In view of the above, embodiments of the present invention provide a training method and apparatus for a noise recognition model, by dividing interactive scenes, performing marking according to marking rules of different interactive scenes, generating training data to train the noise recognition model, so as to improve accuracy of marking, and meanwhile, consider recognition effects of speech and noise dimensions, and improve training effect and recognition accuracy of the noise recognition model.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a training method of a noise recognition model.

A training method of a noise recognition model, comprising: according to the interactive scene division rule, determining an interactive scene corresponding to the acquired audio fragment; for each audio fragment, acquiring a corresponding marking rule according to the interaction scene of the audio fragment, and marking the audio fragment according to the marking rule; extracting the characteristics of each marked audio fragment to obtain a training characteristic set; model training is carried out based on the training feature set, and a noise recognition model is generated and used for carrying out noise recognition on the audio frequency fragments generated in the voice interaction process.

Optionally, before determining the interaction scene corresponding to the acquired audio clip according to the interaction scene division rule, the method further includes: determining that the audio segment includes user speech audio and that a ratio of a length of the user speech audio to a length of the audio segment is greater than a preset threshold.

Optionally, the interaction scene comprises a breaking scene that a user obtains speaking right in the process of customer service speaking; and under the condition of breaking the scene, marking the audio fragment according to the marking rule, wherein the method comprises the following steps: in the case that the audio piece comprises background noise audio or background user speech audio, the audio piece is marked as noise.

Optionally, the feature extraction of each marked audio segment includes: extracting pure voice audio fragments from each marked audio fragment, and dividing the pure voice audio fragments according to a preset audio fragment length threshold to obtain a plurality of audio divided fragments; and respectively carrying out feature extraction on each audio segmentation segment to generate feature information.

Optionally, the interaction scene comprises a filtering scene for effectively identifying the voice of the user; under the condition of filtering the scene, the marking the audio clip according to the marking rule includes: in the case where the audio clip includes background user speech audio, the audio clip is labeled as speech.

Optionally, the feature extraction of each marked audio segment includes: and extracting pure voice audio fragments in each marked audio fragment, and carrying out feature extraction on the pure voice audio fragments to generate feature information.

Optionally, the training feature set includes feature information and a labeling result, the feature information having a plurality of dimension features; the model training is performed based on the training feature set, and a noise identification model is generated, which comprises the following steps: taking the characteristic information as training input, taking the marking result as a training target, performing model training to obtain model parameters, and generating importance sequences of a plurality of dimension characteristics; and according to the importance ranking, the model parameters are adjusted, model training is conducted again based on the training feature set until the recognition accuracy of the model meets the preset requirement, and the model with the recognition accuracy meeting the preset requirement is used as the noise recognition model.

According to another aspect of the embodiment of the invention, a training device for a noise identification model is provided.

A training device for a noise recognition model, comprising: the interactive scene determining module is used for determining an interactive scene corresponding to the acquired audio fragment according to the interactive scene dividing rule; the marking module is used for obtaining corresponding marking rules according to the interaction scene of the audio clips for each audio clip, and marking the audio clips according to the marking rules; the training feature set generation module is used for extracting features of each marked audio fragment to obtain a training feature set; the model training module is used for carrying out model training based on the training feature set to generate a noise recognition model, and the noise recognition model is used for carrying out noise recognition on the audio frequency fragments generated in the voice interaction process.

Optionally, the method further comprises an audio fragment determining module for: determining that the audio segment includes user speech audio and that a ratio of a length of the user speech audio to a length of the audio segment is greater than a preset threshold.

Optionally, the interaction scene comprises a breaking scene that a user obtains speaking right in the process of customer service speaking; in the case of the breaking scene, the marking module is further configured to: in the case that the audio piece comprises background noise audio or background user speech audio, the audio piece is marked as noise.

Optionally, the training feature set generating module is further configured to: extracting pure voice audio fragments from each marked audio fragment, and dividing the pure voice audio fragments according to a preset audio fragment length threshold to obtain a plurality of audio divided fragments; and respectively carrying out feature extraction on each audio segmentation segment to generate feature information.

Optionally, the interaction scene comprises a filtering scene for effectively identifying the voice of the user; in the case of the filtering scene, the marking module is further configured to: in the case where the audio clip includes background user speech audio, the audio clip is labeled as speech.

Optionally, the training feature set generating module is further configured to: and extracting pure voice audio fragments in each marked audio fragment, and carrying out feature extraction on the pure voice audio fragments to generate feature information.

Optionally, the training feature set includes feature information and a labeling result, the feature information having a plurality of dimension features; the model training module is further configured to: taking the characteristic information as training input, taking the marking result as a training target, performing model training to obtain model parameters, and generating importance sequences of a plurality of dimension characteristics; and according to the importance ranking, the model parameters are adjusted, model training is conducted again based on the training feature set until the recognition accuracy of the model meets the preset requirement, and the model with the recognition accuracy meeting the preset requirement is used as the noise recognition model.

According to yet another aspect of an embodiment of the present invention, an electronic device is provided.

An electronic device, comprising: one or more processors; and the memory is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the training method of the noise identification model provided by the embodiment of the invention.

According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.

A computer readable medium having stored thereon a computer program which, when executed by a processor, implements a method of training a noise recognition model provided by an embodiment of the present invention.

One embodiment of the above invention has the following advantages or benefits: determining an interaction scene corresponding to the acquired audio fragment according to the interaction scene division rule; for each audio fragment, acquiring a corresponding marking rule according to the interaction scene of the audio fragment, and marking the audio fragment according to the marking rule; extracting the characteristics of each marked audio fragment to obtain a training characteristic set; model training is carried out based on the training feature set, a noise recognition model is generated, the noise recognition model is used for carrying out noise recognition on audio fragments generated in the voice interaction process, marking is carried out on marking rules of different interaction scenes through dividing the interaction scenes, training data are generated to train the noise recognition model, marking accuracy can be improved, recognition effects of voice and noise dimensions are considered, and training effect and recognition accuracy of the noise recognition model are improved.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main steps of a training method of a noise recognition model according to one embodiment of the present invention;

FIG. 2 is an interactive schematic diagram of a break-up scenario according to one embodiment of the present invention;

FIG. 3 is an interactive schematic diagram of a filtered scenario in accordance with one embodiment of the present invention;

FIG. 4 is a schematic diagram of audio clip marking according to one embodiment of the present invention;

FIG. 5 is a flow diagram of a training method of a noise recognition model according to one embodiment of the invention;

FIG. 6 is a schematic diagram of the main blocks of a training apparatus of a noise recognition model according to one embodiment of the invention;

FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

fig. 8 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the invention, the aspects of the related personal information of the user, such as acquisition, collection, updating, analysis, processing, use, transmission, storage and the like, all conform to the rules of related laws and regulations, are used for legal purposes, and do not violate the popular public order. Necessary measures are taken for the personal information of the user, illegal access to the personal information data of the user is prevented, and the personal information security, network security and national security of the user are maintained.

The intelligent voice customer service field is accompanied with rapid iteration of artificial intelligent voice and natural language processing technology in recent years, and the reusability of the intelligent voice customer service field in replacing the traditional artificial customer service and solving the core interaction scenes such as a large number of telephone bills, outbound calls and the like is more and more prominent. The result of voice recognition of an intelligent voice customer service robot system usually needs to be subjected to a series of audio pre-processing to improve the recognition accuracy of the system. In the pre-processing process of the audio, the audio enhancement is performed by a signal processing technology, and the core of the noise reduction technology in the audio enhancement is a noise recognition model, so that the accuracy of the noise recognition model directly determines whether the whole intelligent voice dialogue interaction process can be performed normally.

At present, the effect tuning of the noise model is divided into two directions for iterative updating: one common method is a noise confidence scoring method realized based on an acoustic model and a neural network engine, wherein whether the current frame of audio is voice or noise is judged by giving a confidence score of each frame of audio, and a confidence score of a corresponding frame is given; the other reliable scheme is to perform post-processing based on the noise model, namely adding more types of characteristic information after the score is calculated by the noise model, and re-scoring and predicting the result so as to realize further effect optimization under different scenes.

However, in either case, it is necessary to calculate the respective accuracy/recall from the two dimensions of speech and noise based on the criteria for evaluating the noise model, and comprehensively evaluate the overall effect of the model. No matter which scheme is used for technical iteration and evolution, the problem of 'failure in consideration of noise model tuning' is difficult to solve all the time, namely after the voice index is improved, the noise index is often accompanied with the problem of reduction, and even though the threshold value is adjusted, the overall index is difficult to be improved greatly. If one wants to compromise the results of both speech and noise, more data needs to be introduced, but this can lead to problems with overfitting of training parameters.

The root cause of the failure to give consideration to the voice/noise effect is that when a data set is marked in noise model training, marking standards which are partially difficult to define exist, and uncertainty between audio and marking can cause negative influence in model training. Therefore, the invention aims to solve the technical problem that the voice/noise effect cannot be simultaneously considered, and provides a model training method based on multi-scene classification labeling.

FIG. 1 is a schematic diagram of the main steps of a training method of a noise recognition model according to one embodiment of the present invention.

As shown in fig. 1, the training method of the noise recognition model according to an embodiment of the present invention mainly includes the following steps S101 to S104.

Step S101: and determining the interaction scene corresponding to the acquired audio fragment according to the interaction scene division rule. The interaction scene can comprise a breaking scene for a user to acquire speaking right in the customer service speaking process and a filtering scene for effectively recognizing the voice of the user.

Specifically, the breaking scene is a scene in which a user side (a person who is speaking a telephone) wants to acquire speaking right in the intelligent voice customer service speaking process; the filtering scene is an identification scene for the validity of the speaking content at the user side.

In the speech customer service speaking process, the user can interrupt the speaking content in progress of the customer service at any time, and insert the speaking content which the user wants to express. This type of scenario is called "breaking scenario" in the intelligent man-machine conversation, i.e. the user side forces to break the behavior of the system broadcasting during the intelligent robot broadcasting session.

After the voice customer service speaking is completed, the user waits for the speaking on the side, and carries out the next round of receipt according to the speaking content of the user. In the intelligent man-machine conversation, the speaking effectiveness of the user side can influence the natural language understanding and the intention recognition, so that whether the current user speaking content is meaningful, whether the current user side speaking content is the main speaker speaking or whether the current user side is background noise needs to be judged in the process of speaking. This type of user-side speech content recognition scenario is referred to as a "filtering scenario".

In one embodiment, before determining the interaction scene corresponding to the acquired audio clip according to the interaction scene division rule, the method further includes: the audio clip is determined to include user speech audio, and a ratio of a length of the user speech audio to a length of the audio clip is greater than a preset threshold.

Specifically, whether the audio segment contains user voice audio (i.e., human voice audio) is determined, if the ratio of the length of the user voice audio to the length of the audio segment is greater than a preset threshold (e.g., 10%), the audio segment is marked as voice, and if the ratio of the length of the user voice audio to the length of the audio segment does not reach the preset threshold, or if the audio segment does not contain the user voice audio, the audio segment is marked as noise. In the VAD (Voice Activity Detection, audio activity detection) algorithm, if a large number of silence areas are included in an audio segment, the VAD does not intercept most of the silence areas into the audio segment, so that when labeling the audio segment, it is necessary to distinguish in advance according to the user voice audio duty ratio in the audio.

Step S102: and for each audio fragment, acquiring a corresponding marking rule according to the interaction scene of the audio fragment, and marking the audio fragment according to the marking rule.

FIG. 2 is an interactive schematic diagram of a break-up scenario according to one embodiment of the invention.

As shown in fig. 2, in the breaking scene, if the user side makes a sound at this time during the voice broadcasting process of the intelligent customer service robot, the sound of the current user side needs to be identified by a noise identification model, so as to determine whether the sound is a real speaking sound and intention of the user. If the noise recognition model judges that the voice is generated at the moment, the user is indicated to want to interrupt the speech of the intelligent customer service robot; if the noise recognition model determines noise, it indicates that only ambient background noise or a large noisy sound is present around the user, and does not represent speech that the user would like to interrupt customer service. Therefore, in the breaking scene, if the noise recognition model judges the noise as voice, frequent breaking of customer service speaking can be caused, which is not acceptable and should be avoided as much as possible; if the noise recognition model judges the voice as noise, only part of the voice at the user side cannot interrupt the broadcasting of the customer service, and the user does not have a real interrupt intention, so that the situation is acceptable in an interrupt scene. So for a break scene, the optimization emphasis is to improve the prediction accuracy of the noise recognition model for the speech.

In the case of breaking a scene, marking the audio clip according to a marking rule may include: in the case where the audio piece includes background noise audio or background user speech audio, the audio piece is marked as noise.

In particular, since each audio clip includes a number of complications during the annotation process, the two problems including background noise and background human voice need to be further refined. In a voice robot interrupt scenario, because frequent false interrupts have a large impact on system interactions, annotations of audio segments including background noise and background human voices of non-main speakers (i.e., background user voice audio) need to be modified to noise when generating training data biased toward the interrupt scenario.

FIG. 3 is an interactive schematic diagram of a filtered scenario in accordance with one embodiment of the present invention.

As shown in fig. 3, in the filtering scenario, after the user performs the current round of speaking, the noise recognition model needs to recognize the sound made by the user side in the round, so as to determine whether the current user is actually speaking. If the noise recognition model judges that the voice is the voice, the current dialogue of the user is normally performed, and answers are needed to be carried out on questions of customer service; if the noise recognition model judges that the noise is generated, the current voice of the user side is not real voice of a person, but is caused by background noise. Thus, in the filtering scenario, if the noise recognition model predicts speech as noise, it will cause the speech spoken by the current user to fail to be transcribed into text by ASR (Automatic Speech Recognition ), resulting in serious problems in the subsequent intent recognition of NLP (Natural Language Processing ), which is unacceptable and should be avoided as much as possible; if the noise recognition model judges the noise as voice, the current round of speech even if the user is not actually speaking can be filtered by the condition that NLU (Natural Language Understanding ) intends to understand as nonsensical after ASR is transcribed into text, and normal interaction of the system is not affected. So for filtering scenes, the optimization emphasis is on improving the prediction accuracy of the noise recognition model for noise.

In the case of filtering a scene, marking the audio clip according to a marking rule may include: in the case where the audio clip includes background user speech audio, the audio clip is labeled as speech.

Specifically, since false detection often results in misjudgment of user speech as noise, resulting in lack of user speech input during robot interaction, and resulting in unsmooth interaction, it is necessary to modify the annotation of an audio clip including background user speech audio (i.e., background human voice) to speech.

In one embodiment, the user speech audio, the background noise audio, and the background user speech audio may each correspond to one or more audio annotation types, where the audio annotation types include: the method comprises the steps of dialect data of a main speaker, sound which can be heard by the main speaker, sound which can not be heard by the main speaker, heavy accent, frame loss of data of the main speaker, background noise, sound which can not be heard by a secondary speaker, frame loss of data of the secondary speaker, sound which can be heard by the secondary speaker, simultaneous speaking of multiple persons in a time period, speech of customer service staff and synthetic speech. For example, the user speech audio may correspond to dialect data of the master speaker, audible sounds of the master speaker; the background noise audio may correspond to background noise; the voice audio of the background user can correspond to the sound of the sub-speaker which is not clearly heard, the sound of the sub-speaker which can be heard, and a plurality of people simultaneously speaking in a time period.

Step S103: and extracting the characteristics of each marked audio fragment to obtain a training characteristic set. The training feature set may include feature information and a labeling result, among other things.

The feature information may have a plurality of dimensional features, and the feature information may include: noise confidence, language model score, acoustic model score, bayesian minimum risk confidence score, candidate word results, audio signal status, audio start and end time, or any combination thereof. The noise confidence is extracted through an MFCC (Mel frequency cepstrum coefficient), the language model score, the acoustic model score, the Bayes minimum risk confidence score, the candidate word result is extracted through ASR (automatic Speech recognition), and the audio signal state, the audio start time and the audio end time are extracted through VAD (Audio Activity detection).

Fig. 4 is a schematic diagram of audio clip marking according to one embodiment of the invention.

As shown in fig. 4, the audio segment obtained by VAD (audio activity detection) is generally not very accurate, and the audio segment includes silence areas and clean speech areas, and the silence areas are usually at the beginning and end of the audio (such as silence area 1 and silence area 2 in fig. 4), so that the clean speech in the audio segment needs to be extracted to obtain the clean speech audio segment. Wherein a pure voice audio clip (e.g. "you please ask now where my express was delivered") can be obtained from the time of the first word and the time of the last word identified.

In one embodiment, in the case of breaking a scene, feature extraction is performed on each audio segment after marking, which may include: extracting pure voice audio fragments from each marked audio fragment, and dividing the pure voice audio fragments according to a preset audio fragment length threshold to obtain a plurality of audio divided fragments; and respectively carrying out feature extraction on each audio segmentation segment to generate feature information.

Specifically, the breaking scene is sensitive to the detection of the user voice, the data length in the audio stream segment needs to be reduced as much as possible, and the length of the audio segment needs to be reasonably set in order to ensure the effect of the noise recognition model. After extracting the pure voice audio segments in the audio segments, the pure voice audio segments are segmented according to a preset audio segment length threshold (i.e. the breaking scene granularity in fig. 4 may be set to 200 ms), so as to obtain a plurality of audio segmentation segments (e.g. IPU 1, IPU 2. And respectively carrying out feature extraction on each audio segmentation segment to generate feature information corresponding to each audio segmentation segment, wherein the marks of each audio segmentation segment are marks of the audio segment.

In one embodiment, in the case of filtering a scene, feature extraction is performed on each audio segment after the marking, which may include: and extracting pure voice audio fragments in the audio fragments for each marked audio fragment, and carrying out feature extraction on the pure voice audio fragments to generate feature information.

Specifically, in the filtering scene, as noise detection is performed on the speaking content of the whole speaker, a part of pure voice in the whole speaker needs to be obtained, and feature extraction is performed, that is, the granularity of the filtering scene is the whole pure voice audio fragment, and after the pure voice audio fragment in the audio fragment is extracted, feature extraction is directly performed on the pure voice audio fragment.

Step S104: model training is carried out based on the training feature set, a noise recognition model is generated, and the noise recognition model is used for carrying out noise recognition on the audio fragments generated in the voice interaction process.

In one embodiment, model training based on a training feature set, generating a noise recognition model may include: taking the feature information as training input, taking a marking result as a training target, performing model training to obtain model parameters, and generating importance sequences of a plurality of dimension features; according to the importance ranking, the model parameters are adjusted, model training is conducted again based on the training feature set until the recognition accuracy of the model meets the preset requirement, and the model with the recognition accuracy meeting the preset requirement is used as the noise recognition model.

Specifically, the training feature set may include feature information and a labeling result of a training audio segment, where the training audio segment is an audio segmentation segment or a pure speech audio segment. The model training method uses XGBoost (a machine learning algorithm) realized based on a gradient descent algorithm, the XGBoost generally has better performance in the aspect of model stability, and the model training process is simpler. Training and preservation are performed by importing a training feature set and using an XGBoostSaveModel (namely, preserving a post-training final gradient iterative decision tree). The model parameters in the model training process may mainly include max_depth (the greater the depth of the built tree, the easier the fit), num_round (the slower the training process the more iterations), min_child_weight (the smallest sample weight sum in the child node), and if the weight sum of one leaf node is less than this value, the splitting process ends.

And respectively carrying out accuracy statistics and calculation on the multiple groups of models and training feature sets which are completed by training, verifying the stability of the models, and simultaneously storing the importance of the dimension features in the model training process. Through model experiments, the importance ranking of the dimension features is as follows: language model score ≡ acoustic model score ≡ bayesian score ≡ candidate word length ≡ noise confidence > voice signal start-stop time. And (3) according to the importance ranking, adjusting model parameters, and training the model again based on the training feature set until the recognition accuracy of the model meets the preset requirement (for example, the recognition accuracy reaches 90%), and taking the model with the recognition accuracy meeting the preset requirement as a noise recognition model.

The embodiment of the invention can effectively improve the robustness of the noise recognition model, thereby improving the overall interaction experience of the intelligent voice customer service dialogue system.

Fig. 5 is a flow chart of a training method of a noise recognition model according to an embodiment of the present invention.

As shown in fig. 5, the embodiment of the invention provides an overall implementation scheme of noise identification model effect optimization based on multiple scenes, and noise identification model feature selection, optimization strategies and model training methods under different scenes. According to the interactive scene division rules, determining the interactive scene corresponding to the acquired audio fragment, acquiring the marking rules corresponding to the interactive scene, and marking the audio fragment according to the marking rules. And extracting the characteristics of the plurality of dimension characteristics of each marked audio fragment to obtain a training characteristic set. Model training is performed based on the training feature set, and a noise identification model is generated.

Fig. 6 is a schematic diagram of main modules of a training apparatus of a noise recognition model according to an embodiment of the present invention.

As shown in fig. 6, a training apparatus 600 for a noise recognition model according to an embodiment of the present invention mainly includes: an interaction scene determining module 601, a marking module 602, a training feature set generating module 603 and a model training module 604.

The interactive scene determining module 601 is configured to determine an interactive scene corresponding to the acquired audio clip according to an interactive scene division rule.

The marking module 602 is configured to obtain, for each audio clip, a corresponding marking rule according to an interaction scene of the audio clip, and mark the audio clip according to the marking rule.

The training feature set generating module 603 is configured to perform feature extraction on each audio segment after the marking, so as to obtain a training feature set.

The model training module 604 is configured to perform model training based on the training feature set, and generate a noise recognition model, where the noise recognition model is used to perform noise recognition on an audio segment generated in the voice interaction process.

In one embodiment, an audio clip determination module (not shown) may be further included for: the audio clip is determined to include user speech audio, and a ratio of a length of the user speech audio to a length of the audio clip is greater than a preset threshold.

In one embodiment, the interaction scenario may include a break scenario in which a user obtains speaking rights during a customer service speaking process; in the case of breaking a scene, the marking module 602 is specifically configured to: in the case where the audio piece includes background noise audio or background user speech audio, the audio piece is marked as noise.

In one embodiment, the training feature set generation module 603 is specifically configured to: extracting pure voice audio fragments from each marked audio fragment, and dividing the pure voice audio fragments according to a preset audio fragment length threshold to obtain a plurality of audio divided fragments; and respectively carrying out feature extraction on each audio segmentation segment to generate feature information.

In one embodiment, the interaction scenario may include a filtering scenario that effectively recognizes the user's voice; in the case of filtering a scene, the marking module 602 is specifically configured to: in the case where the audio clip includes background user speech audio, the audio clip is labeled as speech.

In one embodiment, the training feature set generation module 603 is specifically configured to: and extracting pure voice audio fragments in the audio fragments for each marked audio fragment, and carrying out feature extraction on the pure voice audio fragments to generate feature information.

In one embodiment, the training feature set may include feature information and a labeling result, the feature information may have a plurality of dimensional features; model training module 604 is specifically configured to: taking the feature information as training input, taking a marking result as a training target, performing model training to obtain model parameters, and generating importance sequences of a plurality of dimension features; according to the importance ranking, the model parameters are adjusted, model training is conducted again based on the training feature set until the recognition accuracy of the model meets the preset requirement, and the model with the recognition accuracy meeting the preset requirement is used as the noise recognition model.

In addition, the specific implementation of the training device for the noise recognition model in the embodiment of the present invention is already described in detail in the training method for the noise recognition model, so the description thereof will not be repeated here.

Fig. 7 illustrates an exemplary system architecture 700 of a training method of a noise recognition model or a training apparatus of a noise recognition model to which embodiments of the invention may be applied.

As shown in fig. 7, a system architecture 700 may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 is the medium used to provide communication links between the terminal devices 701, 702, 703 and the server 705. The network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 705 via the network 704 using the terminal devices 701, 702, 703 to receive or send messages or the like. Various communication client applications, such as a noise recognition type application, a voice interaction application, an intelligent customer service type application, an instant messaging tool, a mailbox client, social platform software, etc. (for example only) may be installed on the terminal devices 701, 702, 703.

The terminal devices 701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 705 may be a server providing various services, such as a background management server (by way of example only) providing support for noise recognition type websites browsed by the user using the terminal devices 701, 702, 703. The background management server can determine the interaction scene corresponding to the acquired audio fragment according to the interaction scene division rule by the received data such as the training request of the noise recognition model; for each audio fragment, acquiring a corresponding marking rule according to the interaction scene of the audio fragment, and marking the audio fragment according to the marking rule; extracting the characteristics of each marked audio fragment to obtain a training characteristic set; model training is performed based on the training feature set, a noise recognition model is generated, the noise recognition model is used for performing noise recognition and other processing on the audio fragments generated in the voice interaction process, and processing results (such as training results of the noise recognition model-only examples) are fed back to the terminal equipment.

It should be noted that, the training method of the noise recognition model provided in the embodiment of the present invention is generally executed by the server 705, and accordingly, the training device of the noise recognition model is generally disposed in the server 705.

It should be understood that the number of terminal devices, networks and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 8, there is illustrated a schematic diagram of a computer system 800 suitable for use in implementing a terminal device or server in accordance with an embodiment of the present invention. The terminal device or server shown in fig. 8 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present invention.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU) 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 801.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor comprises an interaction scene determining module, a marking module, a training feature set generating module and a model training module. The names of these modules do not constitute a limitation on the module itself in some cases, and for example, the interactive scene determining module may also be described as "a module for determining an interactive scene corresponding to the acquired audio clip according to an interactive scene dividing rule".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: according to the interactive scene division rule, determining an interactive scene corresponding to the acquired audio fragment; for each audio fragment, acquiring a corresponding marking rule according to the interaction scene of the audio fragment, and marking the audio fragment according to the marking rule; extracting the characteristics of each marked audio fragment to obtain a training characteristic set; model training is carried out based on the training feature set, a noise recognition model is generated, and the noise recognition model is used for carrying out noise recognition on the audio fragments generated in the voice interaction process.

According to the technical scheme of the embodiment of the invention, according to the interactive scene division rule, the interactive scene corresponding to the acquired audio fragment is determined; for each audio fragment, acquiring a corresponding marking rule according to the interaction scene of the audio fragment, and marking the audio fragment according to the marking rule; extracting the characteristics of each marked audio fragment to obtain a training characteristic set; model training is carried out based on the training feature set, a noise recognition model is generated, and the noise recognition model is used for carrying out noise recognition on the audio fragments generated in the voice interaction process. Through dividing the interactive scenes, marking is carried out aiming at marking rules of different interactive scenes, training data are generated to train the noise recognition model, marking accuracy can be improved, recognition effects of voice and noise dimensionality are considered, and training effects and recognition accuracy of the noise recognition model are improved.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of training a noise recognition model, comprising:

according to the interactive scene division rule, determining an interactive scene corresponding to the acquired audio fragment;

for each audio fragment, acquiring a corresponding marking rule according to the interaction scene of the audio fragment, and marking the audio fragment according to the marking rule;

extracting the characteristics of each marked audio fragment to obtain a training characteristic set;

model training is carried out based on the training feature set, and a noise recognition model is generated and used for carrying out noise recognition on the audio frequency fragments generated in the voice interaction process.

2. The method according to claim 1, wherein before determining the interaction scenario corresponding to the acquired audio clip according to the interaction scenario division rule, the method further comprises:

Determining that the audio segment includes user speech audio and that a ratio of a length of the user speech audio to a length of the audio segment is greater than a preset threshold.

3. The method of claim 1, wherein the interaction scenario comprises a break scenario in which a user obtains speaking rights during a customer service speaking process;

and under the condition of breaking the scene, marking the audio fragment according to the marking rule, wherein the method comprises the following steps:

in the case that the audio piece comprises background noise audio or background user speech audio, the audio piece is marked as noise.

4. A method according to claim 3, wherein the feature extraction of each audio segment after the marking comprises:

extracting pure voice audio fragments from each marked audio fragment, and dividing the pure voice audio fragments according to a preset audio fragment length threshold to obtain a plurality of audio divided fragments;

and respectively carrying out feature extraction on each audio segmentation segment to generate feature information.

5. The method of claim 1, wherein the interaction scenario comprises a filtering scenario for validity recognition of user speech;

Under the condition of filtering the scene, the marking the audio clip according to the marking rule includes:

in the case where the audio clip includes background user speech audio, the audio clip is labeled as speech.

6. The method of claim 5, wherein the feature extraction of each audio segment after the marking comprises:

and extracting pure voice audio fragments in each marked audio fragment, and carrying out feature extraction on the pure voice audio fragments to generate feature information.

7. The method of claim 1, wherein the training feature set comprises feature information and a labeling result, the feature information having a plurality of dimensional features;

the model training is performed based on the training feature set, and a noise identification model is generated, which comprises the following steps:

taking the characteristic information as training input, taking the marking result as a training target, performing model training to obtain model parameters, and generating importance sequences of a plurality of dimension characteristics;

and according to the importance ranking, the model parameters are adjusted, model training is conducted again based on the training feature set until the recognition accuracy of the model meets the preset requirement, and the model with the recognition accuracy meeting the preset requirement is used as the noise recognition model.

8. A training device for a noise recognition model, comprising:

the interactive scene determining module is used for determining an interactive scene corresponding to the acquired audio fragment according to the interactive scene dividing rule;

the marking module is used for obtaining corresponding marking rules according to the interaction scene of the audio clips for each audio clip, and marking the audio clips according to the marking rules;

the training feature set generation module is used for extracting features of each marked audio fragment to obtain a training feature set;

the model training module is used for carrying out model training based on the training feature set to generate a noise recognition model, and the noise recognition model is used for carrying out noise recognition on the audio frequency fragments generated in the voice interaction process.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.

10. A computer readable medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method according to any of claims 1-7.