KR20230133260A

KR20230133260A - Automatic voice data inspection system and method using deep learning

Info

Publication number: KR20230133260A
Application number: KR1020230120681A
Authority: KR
Inventors: 손정훈; 김기웅; 이애진
Original assignee: 써큘러스리더(주)
Priority date: 2021-10-08
Filing date: 2023-09-11
Publication date: 2023-09-19
Also published as: KR20230050890A

Abstract

딥 러닝을 이용한 음성 데이터 자동 검수 시스템이 제공된다. 본 발명의 실시예에 따른 딥 러닝을 이용한 음성 데이터 자동 검수 시스템은 n개의 음성 데이터를 획득하고 전처리하여 기 설정된 분류에 따라 분류된 음성 데이터를 출력하는 음성 데이터 전처리부; 상기 분류된 음성 데이터를 이용하여 사전 음향 모델을 생성하고, 생성한 상기 사전 음향 모델을 이용하여 상기 분류된 음성 데이터 중 선정된 상기 음성 데이터들의 집합인 검증 세트를 출력하는 사전 음향 처리부; 및 상기 검증 세트를 이용하여 검수 음향 모델을 생성하며, 생성한 상기 검수 음향 모델을 이용하여 상기 n개의 음성 데이터 각각에 대한 불량 여부를 판단하고 결과를 출력하는 검수 음향 생성부;를 포함한다. An automatic voice data inspection system using deep learning is provided. An automatic voice data inspection system using deep learning according to an embodiment of the present invention includes a voice data preprocessor that acquires n pieces of voice data, preprocesses them, and outputs voice data classified according to a preset classification; a dictionary sound processing unit that generates a dictionary sound model using the classified voice data and outputs a verification set, which is a set of voice data selected from among the classified voice data, using the generated dictionary sound model; and a verification sound generator that generates a verification sound model using the verification set, determines whether each of the n pieces of voice data is defective using the generated verification sound model, and outputs a result.

Description

Automatic voice data inspection system and method using deep learning}

본 발명은 딥 러닝을 이용한 음성 데이터 자동 검수 시스템 및 방법에 관한 것으로, 특히, 초기 검증된 데이터를 이용하지 않고도 딥 러닝을 이용하여 음성 데이터를 자동으로 검수할 수 있는 딥 러닝을 이용한 음성 데이터 자동 검수 시스템 및 방법에 관한 것이다.The present invention relates to a system and method for automatically inspecting voice data using deep learning, and in particular, automatic inspection of voice data using deep learning that can automatically inspect voice data using deep learning without using initially verified data. It relates to systems and methods.

딥 러닝을 활용한 AI 서비스는 일반적으로 고품질 데이터가 많아질수록 학습을 위한 데이터가 증가하기 때문에 성능이 향상된다. 따라서, AI 서비스를 제공하기 위한 학습용 데이터의 중요성이 점점 부각되고 있다. 이러한 고품질 데이터를 확보하기 위해서는 데이터 검수 과정에서 부정확한 데이터를 걸러내는 과정이 매우 중요하다.AI services using deep learning generally improve performance because the more high-quality data there is, the more data for learning. Therefore, the importance of learning data to provide AI services is increasingly highlighted. In order to secure such high-quality data, the process of filtering out inaccurate data during the data inspection process is very important.

종래에는 수집된 음성을 작업자가 수작업으로 검수하여 적합한 고품질 데이터를 검수하였기 때문에 검수 시간이 증가함과 동시에 비용이 증가하고, 검수자에 따라 기준이 변경되는 문제점이 존재한다.In the past, workers manually inspected collected voices to verify appropriate high-quality data, so inspection time increased, costs increased, and standards changed depending on the inspector.

한국 등록특허 제10-2113180호Korean Patent No. 10-2113180

상기와 같은 종래 기술의 문제점을 해결하기 위해, 본 발명의 일 실시예는 작업자가 관여하지 않고 딥 러닝을 이용하여 학습을 위한 고품질 데이터를 획득할 수 있는 딥 러닝을 이용한 음성 데이터 자동 검수 시스템 및 방법을 제공하고자 한다.In order to solve the problems of the prior art as described above, an embodiment of the present invention provides an automatic voice data inspection system and method using deep learning that can acquire high quality data for learning using deep learning without operator involvement. We would like to provide.

또, 본 발명의 일 실시예는 최초 검증된 데이터가 존재하지 않더라도 검수 작업을 수행할 수 있는 딥 러닝을 이용한 음성 데이터 자동 검수 시스템 및 방법을 제공하고자 한다.In addition, an embodiment of the present invention seeks to provide an automatic voice data inspection system and method using deep learning that can perform inspection work even if there is no initially verified data.

위와 같은 과제를 해결하기 위한 본 발명의 일 측면에 따르면, 딥 러닝을 이용한 음성 데이터 자동 검수 시스템이 제공된다. 상기 딥 러닝을 이용한 음성 데이터 자동 검수 시스템은 n개의 음성 데이터를 획득하고 전처리하여 기 설정된 분류에 따라 분류된 음성 데이터를 출력하는 음성 데이터 전처리부; 상기 분류된 음성 데이터를 이용하여 사전 음향 모델을 생성하고, 생성한 상기 사전 음향 모델을 이용하여 상기 분류된 음성 데이터 중 선정된 상기 음성 데이터들의 집합인 검증 세트를 출력하는 사전 음향 처리부; 및 상기 검증 세트를 이용하여 검수 음향 모델을 생성하며, 생성한 상기 검수 음향 모델을 이용하여 상기 n개의 음성 데이터 각각에 대한 불량 여부를 판단하고 결과를 출력하는 검수 음향 생성부;를 포함한다.According to one aspect of the present invention to solve the above problems, an automatic voice data inspection system using deep learning is provided. The automatic voice data inspection system using deep learning includes a voice data preprocessor that acquires and preprocesses n pieces of voice data and outputs voice data classified according to a preset classification; a dictionary sound processing unit that generates a dictionary sound model using the classified voice data and outputs a verification set, which is a set of voice data selected from among the classified voice data, using the generated dictionary sound model; and a verification sound generator that generates a verification sound model using the verification set, determines whether each of the n pieces of voice data is defective using the generated verification sound model, and outputs a result.

상기 음성 데이터 전처리부는, 상기 n개의 음성 데이터를 획득하는 음성 데이터 획득 모듈; 및 획득한 상기 n개의 음성 데이터를 상기 기 설정된 분류 기준을 이용하여 m개의 사전 훈련 세트로 분류하는 음성 데이터 분류 모듈;을 포함할 수 있다.The voice data pre-processing unit includes a voice data acquisition module that acquires the n pieces of voice data; and a voice data classification module that classifies the acquired n pieces of voice data into m pre-training sets using the preset classification criteria.

상기 사전 음향 처리부는, 상기 분류된 음성 데이터의 딥 러닝을 수행하여 m개의 사전 음향 모델을 생성하는 사전 음향 모델 생성 모듈; 및 상기 n개의 음성 데이터를 상기 m개의 사전 음향 모델 각각에 적용하여 각 음성 데이터의 상기 m개의 사전 음향 모델에 대한 m개의 손실값을 획득하고, 상기 손실값을 이용하여 상기 n개의 음성 데이터 중 상기 검증 세트를 선정 및 출력하는 검증 세트 출력 모듈;을 포함할 수 있다.The pre-sound processing unit includes a pre-sound model generation module that generates m pre-sound models by performing deep learning on the classified speech data; and applying the n voice data to each of the m dictionary sound models to obtain m loss values for the m dictionary sound models of each voice data, and using the loss values to obtain m loss values for the m sound models among the n voice data. It may include a verification set output module that selects and outputs a verification set.

상기 사전 음향 모델 생성 모듈은, 상기 n개의 음성 데이터 전체에 대한 딥 러닝을 진행하며, 손실값이 특정 값으로 수렴하는 경우 딥 러닝을 중단하고, 상기 특정 값을 임계값으로 설정할 수 있다.The preliminary acoustic model generation module may perform deep learning on all of the n pieces of voice data, stop deep learning when the loss value converges to a specific value, and set the specific value as a threshold.

상기 사전 음향 모델 생성 모듈은, 상기 기 설정된 분류 기준을 이용하여 분류된 m개의 사전 훈련 세트에 대한 딥 러닝을 진행하여 상기 m개의 사전 음향 모델을 각각 생성하며, 상기 m개의 사전 훈련 세트에 대한 딥 러닝 과정에서 손실값이 상기 임계값 이하로 내려가는 경우 상기 딥 러닝을 중단하고, 상기 임계값 이하로 내려가지 않는 경우 기 설정된 횟수만큼 상기 딥 러닝을 수행할 수 있다.The pre-acoustic model generation module generates each of the m pre-acoustic models by performing deep learning on the m pre-training sets classified using the preset classification criteria, and deep learning the m pre-training sets. If the loss value falls below the threshold during the learning process, the deep learning may be stopped, and if the loss value does not fall below the threshold, the deep learning may be performed a preset number of times.

상기 검증 세트 출력 모듈은, 상기 각 음성 데이터 각각에서 획득한 상기 m개의 손실값의 평균인 평균 손실값을 각각 획득하고, 상기 각 음성 데이터의 상기 평균 손실값을 이용하여 상기 검증 세트 및 훈련 세트를 출력하며, 상기 검증 세트는 상기 평균 손실값들 중 하위 10%이하인 평균 손실값을 가지는 상기 음성 데이터를 포함하고, 상기 훈련 세트는 상기 평균 손실값들 중 하위 10% 초과 하위 50% 이하인 평균 손실값을 가지는 상기 음성 데이터를 포함할 수 있다.The verification set output module obtains an average loss value that is the average of the m loss values obtained from each of the speech data, and uses the average loss value of each speech data to generate the verification set and the training set. Output, wherein the verification set includes the speech data having an average loss value that is less than the bottom 10% of the average loss values, and the training set includes the average loss value that is more than the bottom 10% and less than the bottom 50% of the average loss values. It may include the voice data having .

상기 손실값은 하기 수식 1을 이용하여 획득될 수 있다.The loss value can be obtained using Equation 1 below.

[수식 1][Formula 1]

(여기서, x: 입력 음성 피쳐 시퀀스, π: 정답 레이블에 대응될 수 있는 경로, y: 각 시점에서 해당하는 레이블의 확률값, T: 음성 피쳐 시퀀스 길이, l: 정답 레이블 시퀀스, β^-1(l): blank와 중복 레이블을 제거해서 정답 레이블이 될 수 있는 모든 가능한 경로들의 집합)(where, x: input speech feature sequence, π: path that can correspond to the correct answer label, y: probability value of the corresponding label at each time point, T: speech feature sequence length, l: correct answer label sequence, β ^-1 (l ): set of all possible paths that can lead to the correct label by removing blank and duplicate labels)

상기 검수 음향 생성부는, 상기 검증 세트의 딥 러닝을 수행하여 상기 검수 음향 모델을 생성하는 검수 음향 모델 생성 모듈; 및 상기 생성한 검수 음향 모델에 상기 n개의 음성 데이터를 적용하여 획득되는 각 음성 데이터의 손실값을 통해 상기 불량 여부를 포함하는 음향 검수 결과를 출력하는 음향 검수 결과 출력 모듈;을 포함할 수 있다.The proofreading sound generation unit includes a proofreading sound model generation module that generates the proofreading sound model by performing deep learning on the verification set; and an acoustic inspection result output module that outputs an acoustic inspection result including the defect through a loss value of each audio data obtained by applying the n pieces of audio data to the generated inspection acoustic model.

상기 음향 검수 결과 출력 모듈은, 상기 각 음성 데이터의 손실값이 상기 임계값보다 작은 경우 해당 음성 데이터를 정상 음성 데이터로 결정하고, 상기 손실값이 상기 임계값 이상인 경우 해당 음성 데이터를 불량 음성 데이터로 결정하여 상기 음향 검수 결과를 생성할 수 있다.The audio inspection result output module determines the audio data as normal voice data when the loss value of each voice data is less than the threshold value, and determines the voice data as defective voice data when the loss value is greater than the threshold value. The acoustic inspection results can be generated by determining.

본 발명의 일 측면에 따르면, 딥 러닝을 이용한 음성 데이터 자동 검수 방법이 제공된다. 상기 딥 러닝을 이용한 음성 데이터 자동 검수 방법은 음성 전처리부를 이용하여 n개의 음성 데이터를 획득하고 전처리하여 기 설정된 분류에 따라 분류된 음성 데이터를 출력하는 단계; 상기 분류된 음성 데이터를 사전 음향 처리부에서 이용하여 사전 음향 모델을 생성하고, 생성한 상기 사전 음향 모델을 이용하여 상기 분류된 음성 데이터 중 선정된 상기 음성 데이터들의 집합인 검증 세트를 출력하는 단계; 및 상기 검증 세트를 검수 음향 생성부에서 이용하여 검수 음향 모델을 생성하며, 생성한 상기 검수 음향 모델을 이용하여 상기 n개의 음성 데이터 각각에 대한 불량 여부를 판단하고 결과를 출력하는 단계;를 포함한다.According to one aspect of the present invention, a method for automatically inspecting voice data using deep learning is provided. The automatic voice data inspection method using deep learning includes the steps of using a voice preprocessor to acquire and preprocess n pieces of voice data and output voice data classified according to a preset classification; Generating a dictionary sound model using the classified voice data in a dictionary sound processing unit, and outputting a verification set, which is a set of voice data selected from the classified voice data, using the generated dictionary sound model; and generating a verification sound model using the verification set in a verification sound generation unit, determining whether each of the n pieces of voice data is defective and outputting the result using the generated verification sound model. .

상기 분류된 음성 데이터를 출력하는 단계는, 상기 n개의 음성 데이터를 획득하는 단계; 및 획득한 상기 n개의 음성 데이터를 상기 기 설정된 분류 기준을 이용하여 m개의 사전 훈련 세트로 분류하는 단계;를 포함할 수 있다.Outputting the classified voice data may include obtaining the n pieces of voice data; and classifying the n acquired voice data into m pre-training sets using the preset classification criteria.

상기 음성 데이터들의 집합인 검증 세트를 출력하는 단계는, 상기 분류된 음성 데이터의 딥 러닝을 수행하여 m개의 사전 음향 모델을 생성하는 단계; 및 상기 n개의 음성 데이터를 상기 m개의 사전 음향 모델 각각에 적용하여 각 음성 데이터의 상기 m개의 사전 음향 모델에 대한 m개의 손실값을 획득하고, 상기 손실값을 이용하여 상기 n개의 음성 데이터 중 상기 검증 세트를 선정 및 출력하는 단계;를 포함할 수 있다.The step of outputting the verification set, which is a set of voice data, includes generating m dictionary acoustic models by performing deep learning on the classified voice data; and applying the n voice data to each of the m dictionary sound models to obtain m loss values for the m dictionary sound models of each voice data, and using the loss values to obtain m loss values for the m sound models among the n voice data. It may include selecting and outputting a verification set.

상기 사전 음향 모델을 생성하는 단계는, 상기 n개의 음성 데이터 전체에 대한 딥 러닝을 진행하며, 손실값이 특정 값으로 수렴하는 경우 딥 러닝을 중단하고, 상기 특정 값을 임계값으로 설정할 수 있다.In the step of generating the preliminary acoustic model, deep learning is performed on all of the n voice data, and when the loss value converges to a specific value, deep learning may be stopped and the specific value may be set as a threshold.

상기 사전 음향 모델을 생성하는 단계는, 상기 기 설정된 분류 기준을 이용하여 분류된 m개의 사전 훈련 세트에 대한 딥 러닝을 진행하여 상기 m개의 사전 음향 모델을 각각 생성하며, 상기 m개의 사전 훈련 세트에 대한 딥 러닝 과정에서 손실값이 상기 임계값 이하로 내려가는 경우 상기 딥 러닝을 중단하고, 상기 임계값 이하로 내려가지 않는 경우 기 설정된 횟수만큼 상기 딥 러닝을 수행할 수 있다.In the step of generating the pre-acoustic model, deep learning is performed on the m pre-training sets classified using the preset classification criteria to generate the m pre-acoustic models, respectively, and the m pre-training sets are If the loss value falls below the threshold during the deep learning process, the deep learning may be stopped, and if the loss value does not fall below the threshold, the deep learning may be performed a preset number of times.

상기 검증 세트를 선정 및 출력하는 단계는, 상기 각 음성 데이터 각각에서 획득한 상기 m개의 손실값의 평균인 평균 손실값을 각각 획득하고, 상기 각 음성 데이터의 상기 평균 손실값을 이용하여 상기 검증 세트 및 훈련 세트를 출력하며, 상기 검증 세트는 상기 평균 손실값들 중 하위 10%이하인 평균 손실값을 가지는 상기 음성 데이터를 포함하고, 상기 훈련 세트는 상기 평균 손실값들 중 하위 10% 초과 하위 50% 이하인 평균 손실값을 가지는 상기 음성 데이터를 포함할 수 있다.The step of selecting and outputting the verification set includes obtaining an average loss value that is the average of the m loss values obtained from each of the voice data, and using the average loss value of each voice data to set the verification set. and output a training set, wherein the validation set includes the speech data having an average loss value that is less than or equal to the bottom 10% of the average loss values, and the training set includes more than the bottom 10% of the average loss values and the bottom 50%. It may include the voice data having an average loss value of less than or equal to.

상기 손실값은 하기 수식 2를 이용하여 획득될 수 있다.The loss value can be obtained using Equation 2 below.

[수식 2][Formula 2]

상기 불량 여부를 판단하고 결과를 출력하는 단계는, 상기 검증 세트의 딥 러닝을 수행하여 상기 검수 음향 모델을 생성하는 단계; 및 상기 생성한 검수 음향 모델에 상기 n개의 음성 데이터를 적용하여 획득되는 각 음성 데이터의 손실값을 통해 상기 불량 여부를 포함하는 음향 검수 결과를 출력하는 단계;를 포함할 수 있다.The step of determining whether a defect is defective and outputting a result may include generating the inspection acoustic model by performing deep learning on the verification set; and outputting an acoustic inspection result including whether the defect is detected through a loss value of each audio data obtained by applying the n pieces of audio data to the generated inspection acoustic model.

상기 음향 검수 결과를 출력하는 단계는, 상기 각 음성 데이터의 손실값이 상기 임계값보다 작은 경우 해당 음성 데이터를 정상 음성 데이터로 결정하고, 상기 손실값이 상기 임계값 이상인 경우 해당 음성 데이터를 불량 음성 데이터로 결정하여 상기 음향 검수 결과를 생성할 수 있다.In the step of outputting the sound inspection result, if the loss value of each voice data is less than the threshold value, the corresponding voice data is determined as normal voice data, and if the loss value is greater than the threshold value, the corresponding voice data is determined as bad voice data. The acoustic inspection results can be generated by determining the data.

본 발명의 일 실시예에 따른 딥 러닝을 이용한 음성 데이터 자동 검수 시스템 및 방법은 작업자가 관여하지 않고 딥 러닝을 이용하여 학습을 위한 고품질 데이터를 획득할 수 있는 효과가 있다.The automatic voice data inspection system and method using deep learning according to an embodiment of the present invention is effective in obtaining high-quality data for learning using deep learning without operator involvement.

또, 본 발명의 일 실시예에 따른 딥 러닝을 이용한 음성 데이터 자동 검수 시스템 및 방법은 최초 검증된 데이터가 존재하지 않더라도 검수 작업을 수행할 수 있는 효과가 있다.In addition, the automatic voice data inspection system and method using deep learning according to an embodiment of the present invention has the effect of performing inspection work even if there is no initially verified data.

도 1은 본 발명의 일 실시예에 따른 딥 러닝을 이용한 음성 데이터 자동 검수 시스템을 나타낸 블록도이다.
도 2는 도 1의 음성 데이터 전처리부를 나타낸 블록도이다.
도 3은 도 1의 사전 음향 처리부를 나타낸 블록도이다.
도 4는 도 1의 검수 음향 생성부를 나타낸 블록도이다.
도 5는 본 발명의 일 실시예에 따른 딥 러닝을 이용한 음성 데이터 자동 검수 방법의 순서도이다.
도 6은 도 5의 단계 S11에 대한 순서도이다.
도 7은 도 5의 단계 S12에 대한 순서도이다.
도 8은 도 5의 단계 S13에 대한 순서도이다.Figure 1 is a block diagram showing an automatic voice data inspection system using deep learning according to an embodiment of the present invention.
FIG. 2 is a block diagram showing the voice data preprocessing unit of FIG. 1.
FIG. 3 is a block diagram showing the pre-sound processing unit of FIG. 1.
Figure 4 is a block diagram showing the inspection sound generator of Figure 1.
Figure 5 is a flowchart of a method for automatically verifying voice data using deep learning according to an embodiment of the present invention.
Figure 6 is a flow chart for step S11 of Figure 5.
Figure 7 is a flow chart for step S12 of Figure 5.
Figure 8 is a flow chart for step S13 of Figure 5.

이하, 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 동일 또는 유사한 구성요소에 대해서는 동일한 참조부호를 붙였다.Hereinafter, with reference to the attached drawings, embodiments of the present invention will be described in detail so that those skilled in the art can easily implement the present invention. The present invention may be implemented in many different forms and is not limited to the embodiments described herein. In order to clearly explain the present invention in the drawings, parts not related to the description are omitted, and identical or similar components are given the same reference numerals throughout the specification.

도 1은 본 발명의 일 실시예에 따른 딥 러닝을 이용한 음성 데이터 자동 검수 시스템을 나타낸 블록도이고, 도 2는 도 1의 음성 데이터 전처리부를 나타낸 블록도이며, 도 3은 도 1의 사전 음향 처리부를 나타낸 블록도이고, 도 4는 도 1의 검수 음향 생성부를 나타낸 블록도이다. 이하에서는 도 1 내지 도 4를 이용하여 본 발명의 일 실시예에 따른 딥 러닝을 이용한 음성 데이터 자동 검수 시스템에 대하여 상세히 설명하도록 한다.FIG. 1 is a block diagram showing an automatic voice data inspection system using deep learning according to an embodiment of the present invention, FIG. 2 is a block diagram showing the voice data pre-processing unit of FIG. 1, and FIG. 3 is a pre-sound processing unit of FIG. 1. It is a block diagram showing, and FIG. 4 is a block diagram showing the inspection sound generating unit of FIG. 1. Hereinafter, an automatic voice data inspection system using deep learning according to an embodiment of the present invention will be described in detail using FIGS. 1 to 4.

도 1을 참고하면, 본 발명의 딥 러닝을 이용한 음성 데이터 자동 검수 시스템(1, 이하 자동 검수 시스템이라 함)은 음성 데이터(2)를 전달 받아 처리하여 음성 데이터(2)에 대한 평가 결과(3)를 출력하도록 형성된다. 자동 검수 시스템(1)은 복수개의 음성 데이터를 획득하고 처리하여 사전 음향 모델을 생성하며, 생성한 사전 음향 모델을 이용하여 검수 음향 모델을 생성하고 생성한 검수 음향 모델을 이용하여 복수개의 음성 데이터에 대한 불량 여부 판단 결과를 출력할 수 있다. 이를 위해 본 발명의 자동 검수 시스템(1)은 음성 데이터 전처리부(11), 사전 음향 처리부(12) 및 검수 음향 생성부(13)를 포함할 수 있다.Referring to FIG. 1, the automatic voice data inspection system (1, hereinafter referred to as the automatic inspection system) using deep learning of the present invention receives and processes the voice data (2) and produces an evaluation result (3) for the voice data (2). ) is formed to output. The automatic inspection system (1) acquires and processes a plurality of voice data to generate a preliminary sound model, uses the generated dictionary sound model to generate an inspection sound model, and uses the generated inspection sound model to generate a plurality of voice data. The result of determining whether the product is defective can be output. To this end, the automatic inspection system 1 of the present invention may include a voice data pre-processing unit 11, a pre-sound processing unit 12, and an inspection sound generation unit 13.

음성 데이터 전처리부(11)는 n개의 음성 데이터(2)를 획득하고 전처리하여 기 설정된 분류에 따라 분류된 음성 데이터를 출력하도록 형성된다. 음성 데이터 전처리부(11)에서 획득하는 음성 데이터(2)는 반드시 복수개여야 하며, 이는 복수개의 데이터의 처리 결과를 비교하여 음향 모델을 생성하기 때문이다.The voice data pre-processing unit 11 is formed to acquire and pre-process n pieces of voice data 2 and output voice data classified according to a preset classification. There must be a plurality of voice data 2 obtained from the voice data pre-processing unit 11, because the acoustic model is generated by comparing the processing results of the plurality of data.

음성 데이터 전처리부(11)는 도 2에 도시된 바와 같이 음성 데이터 획득 모듈(111) 및 음성 데이터 분류 모듈(112)을 포함하도록 형성된다.The voice data pre-processing unit 11 is formed to include a voice data acquisition module 111 and a voice data classification module 112, as shown in FIG. 2.

음성 데이터 획득 모듈(111)은 n개의 음성 데이터(2)를 획득하도록 형성된다.The voice data acquisition module 111 is configured to acquire n pieces of voice data 2.

음성 데이터 분류 모듈(112)은 음성 데이터 획득 모듈(111)에서 획득한 n개의 음성 데이터를 기 설정된 분류 기준을 이용하여 m개의 사전 훈련 세트로 분류하도록 형성된다. 음성 데이터 분류 모듈(112)은 사전 훈련 세트 분류를 위해 사용자가 기 설정한 분류 기준을 이용하며, 여기서 기 설정한 분류 기준은 1개 세트의 최소값 및 최대값을 포함한다. 따라서 음성 데이터 분류 모듈(112)은 최소값 및 최대값 범위 내에서 n개의 음성 데이터를 불규칙한 개수가 각각의 사전 훈련 세트에 포함되도록 분류하여 총 m개의 사전 훈련 세트에 n개의 음성 데이터를 분류할 수 있다.The voice data classification module 112 is configured to classify the n pieces of voice data acquired from the voice data acquisition module 111 into m pre-training sets using preset classification criteria. The voice data classification module 112 uses classification criteria preset by the user to classify the pre-training set, where the preset classification criteria include the minimum and maximum values of one set. Therefore, the speech data classification module 112 classifies n speech data within the range of minimum and maximum values so that an irregular number is included in each pre-training set, and can classify n speech data into a total of m pre-training sets. .

사전 음향 처리부(12)는 분류된 음성 데이터를 이용하여 사전 음향 모델을 생성하고, 생성한 사전 음향 모델을 이용하여 분류된 음성 데이터 중 선정된 음성 데이터들의 집합인 검증 세트를 출력하도록 형성된다. 사전 음향 처리부(12)는 이를 위해 도 3에 도시된 바와 같이 사전 음향 모델 생성 모듈(121) 및 검증 세트 출력 모듈(122)을 포함하도록 형성된다.The dictionary sound processing unit 12 is configured to generate a dictionary sound model using the classified voice data and output a verification set, which is a set of voice data selected from the voice data classified using the generated dictionary sound model. For this purpose, the pre-sound processing unit 12 is formed to include a pre-sound model generation module 121 and a verification set output module 122, as shown in FIG. 3.

사전 음향 모델 생성 모듈(121)은 분류된 음성 데이터의 딥 러닝을 수행하여 m개의 사전 음향 모델을 생성할 수 있다. 사전 음향 생성 모듈(121)에서 획득하는 분류된 음성 데이터는 음성 데이터 분류 모듈(112)에서 분류한 m개의 사전 훈련 세트 및 n개의 음성 데이터일 수 있다. 사전 음향 모델 생성 모듈(121)은 n개의 음성 데이터를 획득하여 본 발명의 기준값인 임계값을 획득하는 계산을 수행한다. 사전 음향 모델 생성 모듈(121)은 n개의 음성 데이터에 대한 딥 러닝인 제1딥 러닝을 진행한다. 사전 음향 모델 생성 모듈(121)은 제1딥 러닝을 진행하며 획득되는 손실값이 특정 값으로 수렴하는지를 확인한다. 손실값이 특정 값으로 수렴하는 경우 사전 음향 모델 생성 모듈(121)은 제1딥 러닝을 중단하고, 해당 특정 값을 임계값으로 설정할 수 있다.The dictionary sound model generation module 121 may generate m dictionary sound models by performing deep learning on the classified voice data. The classified voice data acquired by the dictionary sound generation module 121 may be m pre-training sets and n voice data classified by the voice data classification module 112. The preliminary sound model creation module 121 acquires n pieces of voice data and performs calculations to obtain a threshold value, which is a reference value of the present invention. The preliminary acoustic model generation module 121 performs first deep learning, which is deep learning on n pieces of voice data. The preliminary acoustic model generation module 121 performs the first deep learning and checks whether the obtained loss value converges to a specific value. When the loss value converges to a specific value, the preliminary acoustic model generation module 121 may stop the first deep learning and set the specific value as the threshold.

또, 사전 음향 모델 생성 모듈(121)은 m개의 사전 훈련 세트를 획득하면, m개의 사전 훈련 세트 각각에 대한 딥 러닝인 제2딥 러닝을 진행하여 m개의 사전 음향 모델을 각각 생성할 수 있다. 여기서 사전 음향 모델 생성 모듈(121)은 제2딥 러닝 과정에서 m개의 사전 훈련 세트에 포함된 음성 데이터의 손실값이 임계값 이하로 내려가는 경우 해당 딥 러닝을 중단하며, 손실값이 임계값 이하로 내려가지 않는 경우 사용자가 기 설정한 횟수의 딥 러닝을 수행할 수 있다. 이를 통해 사전 음향 모델 생성 모듈(121)은 학습에서의 과적합 문제를 해결할 수 있다.In addition, when the pre-acoustic model generation module 121 obtains m pre-training sets, it can generate m pre-acoustic models by performing second deep learning, which is deep learning, for each of the m pre-training sets. Here, the pre-acoustic model generation module 121 stops deep learning when the loss value of the speech data included in the m pre-training sets falls below the threshold in the second deep learning process, and the loss value falls below the threshold. If it does not go down, deep learning can be performed a preset number of times by the user. Through this, the dictionary acoustic model generation module 121 can solve the overfitting problem in learning.

검증 세트 출력 모듈(122)은 n개의 음성 데이터를 m개의 사전 음향 모델에 각각 적용하고, 각 음성 데이터에 대한 m개의 손실값을 획득한다. 검증 세트 출력 모듈(122)은 n개의 음성 데이터 중 임의의 음성 데이터를 m개의 사전 음향 모델에 모두 적용하고, 해당 음성 데이터에 대한 m개의 손실값을 획득한다. 이를 n번 반복하면, 검증 세트 출력 모듈(122)은 총 n개의 음성 데이터가 모두 m개의 손실값을 가지도록 할 수 있다.The verification set output module 122 applies n speech data to m dictionary acoustic models and obtains m loss values for each speech data. The verification set output module 122 applies all of the n voice data to the m dictionary acoustic models and obtains m loss values for the corresponding voice data. By repeating this n times, the verification set output module 122 can ensure that a total of n pieces of voice data have m loss values.

검증 세트 출력 모듈(122)은 n개의 음성 데이터가 모두 m개의 손실값을 가지게 되면, 각 음성 데이터가 가지는 m개의 손실값의 평균값을 계산한다. 보다 쉽게 설명하면, 제1음성 데이터는 제1-1손실값 내지 제1-m손실값을 가지며, 제n음성 데이터는 제n-1손실값 내지 제n-m손실값을 가지게 된다. 검증 세트 출력 모듈(122)은 제1음성 데이터가 가지는 제1-1손실값 내지 제1-m손실값의 평균인 제1평균 손실값을 획득한다. 이후 검증 세트 출력 모듈(122)은 나머지 음성 데이터에 대해서도 동일한 계산을 수행하여 제1평균 손실값 내지 제n평균 손실값을 획득하게 된다. When all n pieces of voice data have m loss values, the verification set output module 122 calculates the average value of the m loss values of each piece of voice data. To explain more easily, the first voice data has a 1-1th loss value to a 1-mth loss value, and the nth voice data has a n-1th loss value to the n-mth loss value. The verification set output module 122 obtains a first average loss value that is the average of the 1-1 loss values to the 1-m loss values of the first voice data. Afterwards, the verification set output module 122 performs the same calculation on the remaining voice data to obtain first to nth average loss values.

검증 세트 출력 모듈(122)은 n개의 평균 손실값을 획득하면, 평균 손실값을 이용하여 검증 세트와 훈련 세트를 각각 선정하여 출력할 수 있다. 검증 세트 출력 모듈(122)은 n개의 음성 데이터 중 검수 음향 모델을 생성하기 위한 검증 세트를 선정하고, 생성한 검수 음향 모델의 정확도를 높이기 위한 훈련 세트를 선정할 수 있다.When the verification set output module 122 obtains n average loss values, it can select and output a verification set and a training set using the average loss values, respectively. The verification set output module 122 may select a verification set for generating a verification sound model among n pieces of voice data and select a training set to increase the accuracy of the generated verification sound model.

검증 세트 출력 모듈(122)은 일 예로 n개의 평균 손실값 중 하위 10% 이하 범위에 해당하는 평균 손실값들을 선별하고, 선별한 평균 손실값들을 가지는 음성 데이터들을 검증 세트에 포함되는 음성 데이터로 선정하여 검증 세트를 출력할 수 있다.As an example, the verification set output module 122 selects average loss values that fall within the bottom 10% or less of n average loss values, and selects speech data with the selected average loss values as speech data included in the verification set. You can output the verification set by doing this.

또, 검증 세트 출력 모듈(122)은 일 예로 n개의 평균 손실값 중 하위 10% 초과 내지 50% 이하 범위에 해당하는 평균 손실값들을 선별하고, 선별한 평균 손실값들을 가지는 음성 데이터들을 훈련 세트에 포함되는 음성 데이터로 선정하여 훈련 세트를 출력할 수 있다.In addition, the verification set output module 122, as an example, selects average loss values that fall within the lower 10% to 50% or less range among n average loss values, and sends speech data with the selected average loss values to the training set. A training set can be output by selecting the included voice data.

여기서, 하위 10%와 하위 50%는 사용자가 설정한 임의의 기 설정된 범위이며, 사용자는 본 발명에서 원하는 민감도에 따라 해당 % 범위의 설정을 변경할 수도 있다.Here, the bottom 10% and bottom 50% are arbitrary preset ranges set by the user, and the user may change the settings of the corresponding % range according to the desired sensitivity in the present invention.

한편, 본 발명에서 사용하는 손실값은 하기 수학식 1을 이용하여 획득될 수 있다.Meanwhile, the loss value used in the present invention can be obtained using Equation 1 below.

마지막으로 본 발명의 자동 검수 시스템(1)은, 검증 세트를 이용하여 검수 음향 모델을 생성하고, 생성한 검수 음향 모델을 이용하여 n개의 음성 데이터 각각에 대한 불량 여부를 판단하고 결과를 출력하는 검수 음향 생성부(13)를 포함할 수 있다. 검수 음향 생성부(13)는 이를 위해 도 4에 도시된 바와 같이 검수 음향 모델 생성 모듈(131) 및 음향 검수 결과 출력 모듈(132)을 포함할 수 있다.Lastly, the automatic inspection system (1) of the present invention generates an inspection sound model using a verification set, uses the generated inspection sound model to determine whether each of the n pieces of audio data is defective, and outputs the result. It may include a sound generating unit 13. For this purpose, the inspection sound generator 13 may include an inspection sound model generation module 131 and an audio inspection result output module 132, as shown in FIG. 4 .

검수 음향 모델 생성 모듈(131)은 검증 세트의 딥 러닝을 수행하여 검수 음향 모델을 생성하고, 검수 음향 모델에 훈련 세트를 적용하여 검수 음향 모델을 훈련하기 위해 형성된다. 검수 음향 모델 생성 모듈(131)에서 수행되는 검증 세트의 딥 러닝은 제3딥 러닝으로 표현될 수 있다. 검수 음향 모델 생성 모듈(131)은 제3딥 러닝 과정에서 임계값을 사용할 수 있으며, 여기서 임계값은 상술한 사전 음향 생성 모듈(121)에서 생성된 임계값일 수 있다. 검수 음향 모델 생성 모듈(131)은 일 예로 제3딥 러닝을 수행하는 과정에서 검증 세트에 포함된 음성 데이터의 손실값이 임계값 이하로 내려가는 경우 제3딥 러닝을 중단하고, 손실값이 임계값 이하로 내려가지 않는 경우 기 설정된 횟수만큼 제3딥 러닝을 수행하여 검수 음향 모델을 생성할 수 있다.The proof sound model generation module 131 is formed to generate a proof sound model by performing deep learning on a verification set, and to train the proof sound model by applying a training set to the proof sound model. Deep learning of the verification set performed in the verification acoustic model generation module 131 may be expressed as third deep learning. The inspection sound model generation module 131 may use a threshold value in the third deep learning process, where the threshold value may be a threshold value generated in the above-described preliminary sound generation module 121. For example, in the process of performing the third deep learning, the verification sound model generation module 131 stops the third deep learning when the loss value of the speech data included in the verification set falls below the threshold value, and the loss value is lowered to the threshold value. If it does not fall below this level, the third deep learning can be performed a preset number of times to generate a proof sound model.

검수 음향 모델 생성 모듈(131)은 검증 세트의 제3딥 러닝을 통해 검수 음향 모델을 생성하면, 훈련 세트를 검수 음향 모델에 적용하여 검수 음향 모델의 정확도를 증가시키는 훈련을 수행할 수도 있다.When the verification sound model generation module 131 generates a verification sound model through third deep learning of the verification set, it may perform training to increase the accuracy of the verification sound model by applying the training set to the verification sound model.

음향 검수 결과 출력 모듈(132)은 검수 음향 모델 생성 모듈(131)에서 생성된 검증 음향 모델에 n개의 음성 데이터를 적용하고, 결과물로 획득되는 각 음성 데이터의 손실값을 이용하여 해당 음성 데이터의 불량 여부를 포함하는 음향 검수 결과를 출력할 수 있다. 음향 검수 결과 출력 모듈(132)은 음성 데이터의 손실값이 임계값보다 작은 경우 해당 음성 데이터를 정상 음성 데이터로 결정하고, 손실값이 임계값 이상인 경우 해당 음성 데이터를 불량 음성데이터로 결정할 수 있다. 이를 통해 본 발명의 자동 검수 시스템(1)은 사용자가 제공하는 초기 검증된 데이터가 존재하지 않더라도 딥 러닝을 이용하여 데이터 검수를 수행할 수 있다.The sound inspection result output module 132 applies n pieces of voice data to the verification sound model generated by the inspection sound model generation module 131, and uses the loss value of each voice data obtained as a result to detect defects in the corresponding voice data. It is possible to output the acoustic inspection results including whether or not. The sound inspection result output module 132 may determine the audio data to be normal audio data if the loss value of the audio data is less than the threshold, and determine the audio data to be defective if the loss value is greater than the threshold. Through this, the automatic inspection system 1 of the present invention can perform data inspection using deep learning even if there is no initially verified data provided by the user.

한편 본 발명의 딥 러닝을 이용한 음성 데이터 자동 검수 시스템은, 사전 음향 모델 생성 모듈(121)에서 사전 음향 모델을 생성하기 위한 사전 훈련 세트의 양을 증가시키기 위한 구성을 더 포함할 수도 있다. 본 발명의 딥 러닝을 이용한 음성 데이터 자동 검수 시스템은, 비지도 검증 학습 모듈(도면 미도시)을 더 포함하여 형성될 수 있다.Meanwhile, the automatic voice data inspection system using deep learning of the present invention may further include a configuration for increasing the amount of the pre-training set for generating the pre-acoustic model in the pre-acoustic model generation module 121. The automatic voice data inspection system using deep learning of the present invention may further include an unsupervised verification learning module (not shown).

음성 데이터 획득 모듈(111)이 n개의 음성 데이터(2)를 획득하면, 본 발명의 자동 검수 시스템은 n개의 음성 데이터들을 비지도 검증 학습 모듈로 전달할 수 있다. 비지도 검증 학습 모듈은 n개의 음성 데이터들을 기초로 하여 생성적 적대 신경망(GAN)을 통해 복수의 가상 음성 데이터를 생성할 수 있다. 생성적 적대 신경망(GAN)은 대표적인 비지도 학습 방법으로 생성자와 구분자가 서로 데이터를 생성 및 검증하여 실제 데이터와 구별이 어려운 가상 데이터를 생성하도록 하는 알고리즘이다.When the voice data acquisition module 111 acquires n pieces of voice data 2, the automatic inspection system of the present invention can transmit the n pieces of voice data to the unsupervised verification learning module. The unsupervised verification learning module can generate a plurality of virtual voice data through a generative adversarial network (GAN) based on n pieces of voice data. Generative adversarial network (GAN) is a representative unsupervised learning method and is an algorithm that allows a generator and a classifier to generate and verify data with each other to create virtual data that is difficult to distinguish from real data.

따라서, 본 발명에서 획득하는 n개의 음성 데이터들을 기초로 생성적 적대 신경망(GAN)을 이용하면, 본 발명의 비지도 검증 학습 모듈에서는 복수의 가상 음성 데이터를 생성할 수 있으며, 이를 이용하여 사전 음향 모델을 생성함으로써 최초 획득하는 음성 데이터의 양을 증가시켜 보다 정확도 높은 검수 음향 모델을 생성할 수도 있다.Therefore, by using a generative adversarial network (GAN) based on n pieces of voice data obtained in the present invention, a plurality of virtual voice data can be generated in the unsupervised verification learning module of the present invention, and a dictionary sound can be used using this. By creating a model, the amount of voice data initially acquired can be increased to create a more accurate inspection sound model.

한편, 도 5 내지 도 8에는 본 발명의 일 실시예에 따른 딥 러닝을 이용한 음성 데이터 자동 검수 방법이 도시되고 있다. 도 5는 본 발명의 일 실시예에 따른 딥 러닝을 이용한 음성 데이터 자동 검수 방법의 순서도이고, 도 6은 도 5의 단계 S11에 대한 순서도이며, 도 7은 도 5의 단계 S12에 대한 순서도이고, 도 8은 도 5의 단계 S13에 대한 순서도이다. 이하에서는 도 5 내지 도 8을 이용하여 본 발명의 일 실시예에 다른 딥 러닝을 이용한 음성 데이터 자동 검수 방법에 대해 상세히 설명하도록 하며, 설명의 편의상 도 1에 도시된 자동 검수 시스템을 이용하여 설명한다. 하지만, 본 발명은 반드시 이에 한정되어 사용되는 것은 아니며, 유사한 동작 및 처리가 가능한 다양한 장치, 단말기 및 시스템에서 사용될 수 있다.Meanwhile, Figures 5 to 8 show a method of automatically inspecting voice data using deep learning according to an embodiment of the present invention. FIG. 5 is a flowchart of a method for automatically verifying voice data using deep learning according to an embodiment of the present invention, FIG. 6 is a flowchart for step S11 of FIG. 5, and FIG. 7 is a flowchart of step S12 of FIG. 5. Figure 8 is a flow chart for step S13 of Figure 5. Hereinafter, a method for automatically inspecting voice data using deep learning according to an embodiment of the present invention will be described in detail using FIGS. 5 to 8. For convenience of explanation, the automatic inspection system shown in FIG. 1 will be used to explain the method. . However, the present invention is not necessarily limited to this and can be used in various devices, terminals and systems capable of similar operations and processing.

도 5를 참고하면, 본 발명의 딥 러닝을 이용한 음성 데이터 자동 검수 방법(4, 이하 자동 검수 방법이라 함)은 음성 데이터를 전달 받아 처리하여 음성 데이터에 대한 평가 결과를 출력하도록 형성된다. 자동 검수 방법(4)은 복수개의 음성 데이터를 획득하고 처리하여 사전 음향 모델을 생성하며, 생성한 사전 음향 모델을 이용하여 검수 음향 모델을 생성하고 생성한 검수 음향 모델을 이용하여 복수개의 음성 데이터에 대한 불량 여부 판단 결과를 출력할 수 있다. 이를 위해 본 발명의 자동 검수 방법(4)은 음성 데이터를 전처리 하는 단계(S11), 사전 음향 처리를 수행하는 단계(S12) 및 검수 음향을 생성하는 단계(S13)를 포함할 수 있다.Referring to Figure 5, the automatic voice data inspection method using deep learning (4, hereinafter referred to as the automatic inspection method) of the present invention is configured to receive and process voice data and output an evaluation result for the voice data. The automatic verification method (4) acquires and processes a plurality of voice data to generate a dictionary sound model, uses the generated dictionary sound model to generate a proof sound model, and uses the generated proof sound model to create a pre-sound model for a plurality of voice data. The result of determining whether the product is defective can be output. To this end, the automatic inspection method (4) of the present invention may include a step of pre-processing voice data (S11), a step of performing pre-sound processing (S12), and a step of generating inspection sound (S13).

음성 데이터를 전처리 하는 단계(S11)는 음성 데이터 전처리부를 이용하여 n개의 음성 데이터를 획득하고 전처리하여 기 설정된 분류에 따라 분류된 음성 데이터를 출력하도록 형성된다. 음성 데이터를 전처리 하는 단계(S11)에서 획득하는 음성 데이터는 반드시 복수개여야 하며, 이는 복수개의 데이터의 처리 결과를 비교하여 음향 모델을 생성하기 때문이다.The step of preprocessing the voice data (S11) is to obtain and preprocess n pieces of voice data using the voice data preprocessor to output voice data classified according to a preset classification. There must be a plurality of voice data obtained in the voice data preprocessing step (S11), because the acoustic model is created by comparing the processing results of the plurality of data.

음성 데이터를 전처리 하는 단계(S11)는 도 6에 도시된 바와 같이 음성 데이터를 획득하는 단계(S111) 및 음성 데이터를 분류하는 단계(S112)를 포함하도록 형성된다.The step of preprocessing the voice data (S11) is formed to include a step of acquiring the voice data (S111) and a step of classifying the voice data (S112), as shown in FIG. 6.

음성 데이터를 획득하는 단계(S111)는 n개의 음성 데이터를 획득하도록 형성된다.The step of acquiring voice data (S111) is configured to acquire n pieces of voice data.

음성 데이터를 분류하는 단계(S112)는 단계 S111에서 획득한 n개의 음성 데이터를 기 설정된 분류 기준을 이용하여 m개의 사전 훈련 세트로 분류하도록 형성된다. 단계 S112는 사전 훈련 세트 분류를 위해 사용자가 기 설정한 분류 기준을 이용하며, 여기서 기 설정한 분류 기준은 1개 세트의 최소값 및 최대값을 포함한다. 따라서 단계 S112는 최소값 및 최대값 범위 내에서 n개의 음성 데이터를 불규칙한 개수가 각각의 사전 훈련 세트에 포함되도록 분류하여 총 m개의 사전 훈련 세트에 n개의 음성 데이터를 분류할 수 있다.The step S112 of classifying voice data is formed to classify the n pieces of voice data obtained in step S111 into m pre-training sets using preset classification criteria. Step S112 uses a classification standard preset by the user to classify the pre-training set, where the preset classification standard includes the minimum and maximum values of one set. Therefore, step S112 classifies n pieces of voice data within the minimum and maximum value ranges so that an irregular number is included in each pre-training set, so that n pieces of voice data can be classified into a total of m pre-training sets.

사전 음향 처리를 수행하는 단계(S12)는 분류된 음성 데이터를 사전 음향 처리부를 이용하여 사전 음향 모델을 생성하고, 생성한 사전 음향 모델을 이용하여 분류된 음성 데이터 중 선정된 음성 데이터들의 집합인 검증 세트를 출력하도록 형성된다. 사전 음향 처리를 수행하는 단계(S12)는 이를 위해 도 7에 도시된 바와 같이 사전 음향 모델을 생성하는 단계(S121) 및 검증 세트를 출력하는 단계(S122)를 포함하도록 형성된다.In the step of performing pre-sound processing (S12), a pre-sound model is generated from the classified voice data using a pre-sound processing unit, and a set of voice data selected from the classified voice data is verified using the generated pre-sound model. It is configured to output a set. The step S12 of performing preliminary sound processing is formed to include a step S121 of generating a pre-acoustic model and a step S122 of outputting a verification set, as shown in FIG. 7 .

사전 음향 모델을 생성하는 단계(S121)는 분류된 음성 데이터의 딥 러닝을 수행하여 m개의 사전 음향 모델을 생성할 수 있다. 단계 S121에서 획득하는 분류된 음성 데이터는 단계 S112에서 분류한 m개의 사전 훈련 세트 및 n개의 음성 데이터일 수 있다. 단계 S121은 n개의 음성 데이터를 획득하여 본 발명의 기준값인 임계값을 획득하는 계산을 수행한다. 단계 S121은 n개의 음성 데이터에 대한 딥 러닝인 제1딥 러닝을 진행한다. 단계 S121은 제1딥 러닝을 진행하며 획득되는 손실값이 특정 값으로 수렴하는지를 확인한다. 손실값이 특정 값으로 수렴하는 경우 단계 S121은 제1딥 러닝을 중단하고, 해당 특정 값을 임계값으로 설정할 수 있다. In the step of generating a dictionary sound model (S121), m number of dictionary sound models may be generated by performing deep learning on the classified voice data. The classified voice data obtained in step S121 may be m pre-training sets and n voice data classified in step S112. Step S121 acquires n pieces of voice data and performs calculations to obtain a threshold value, which is a reference value of the present invention. Step S121 performs first deep learning, which is deep learning on n pieces of voice data. Step S121 proceeds with the first deep learning and checks whether the obtained loss value converges to a specific value. If the loss value converges to a specific value, step S121 may stop the first deep learning and set the specific value as the threshold.

또, 단계 S121은 m개의 사전 훈련 세트를 획득하면, m개의 사전 훈련 세트 각각에 대한 딥 러닝인 제2딥 러닝을 진행하여 m개의 사전 음향 모델을 각각 생성할 수 있다. 여기서 단계 S121은 제2딥 러닝 과정에서 m개의 사전 훈련 세트에 포함된 음성 데이터의 손실값이 임계값 이하로 내려가는 경우 해당 딥 러닝을 중단하며, 손실값이 임계값 이하로 내려가지 않는 경우 사용자가 기 설정한 횟수의 딥 러닝을 수행할 수 있다. 이를 통해 단계 S121은 학습에서의 과적합 문제를 해결할 수 있다.In addition, in step S121, when m pre-training sets are obtained, second deep learning, which is deep learning for each of the m pre-training sets, can be performed to generate m pre-acoustic models. Here, step S121 stops the deep learning when the loss value of the voice data included in the m pre-training sets falls below the threshold in the second deep learning process, and if the loss value does not fall below the threshold, the user Deep learning can be performed a preset number of times. Through this, step S121 can solve the overfitting problem in learning.

검증 세트를 출력하는 단계(S122)는 n개의 음성 데이터를 m개의 사전 음향 모델에 각각 적용하고, 각 음성 데이터에 대한 m개의 손실값을 획득한다. 단계 S122는 n개의 음성 데이터 중 임의의 음성 데이터를 m개의 사전 음향 모델에 모두 적용하고, 해당 음성 데이터에 대한 m개의 손실값을 획득한다. 이를 n번 반복하면, 단계 S122는 총 n개의 음성 데이터가 모두 m개의 손실값을 가지도록 할 수 있다.In the step of outputting the verification set (S122), n pieces of voice data are applied to m dictionary acoustic models, and m loss values for each voice data are obtained. Step S122 applies all of the n voice data to the m dictionary acoustic models and obtains m loss values for the corresponding voice data. By repeating this n times, step S122 can ensure that a total of n pieces of voice data have m loss values.

단계 S122는 n개의 음성 데이터가 모두 m개의 손실값을 가지게 되면, 각 음성 데이터가 가지는 m개의 손실값의 평균값을 계산한다. 보다 쉽게 설명하면, 제1음성 데이터는 제1-1손실값 내지 제1-m손실값을 가지며, 제n음성 데이터는 제n-1손실값 내지 제n-m손실값을 가지게 된다. 단계 S122는 제1음성 데이터가 가지는 제1-1손실값 내지 제1-m손실값의 평균인 제1평균 손실값을 획득한다. 이후 단계 S122는 나머지 음성 데이터에 대해서도 동일한 계산을 수행하여 제1평균 손실값 내지 제n평균 손실값을 획득하게 된다. In step S122, when all n pieces of voice data have m loss values, the average value of the m loss values of each piece of voice data is calculated. To explain more easily, the first voice data has a 1-1th loss value to a 1-mth loss value, and the nth voice data has a n-1th loss value to the n-mth loss value. Step S122 obtains a first average loss value, which is the average of the 1-1 loss values to the 1-m loss values of the first voice data. Thereafter, in step S122, the same calculation is performed on the remaining voice data to obtain the first average loss value to the nth average loss value.

단계 S122는 n개의 평균 손실값을 획득하면, 평균 손실값을 이용하여 검증 세트와 훈련 세트를 각각 선정하여 출력할 수 있다. 단계 S122는 n개의 음성 데이터 중 검수 음향 모델을 생성하기 위한 검증 세트를 선정하고, 생성한 검수 음향 모델의 정확도를 높이기 위한 훈련 세트를 선정할 수 있다.In step S122, when n average loss values are obtained, a verification set and a training set can be selected and output, respectively, using the average loss values. In step S122, a validation set for generating a verification sound model may be selected from n pieces of voice data, and a training set may be selected to increase the accuracy of the generated verification sound model.

단계 S122는 일 예로 n개의 평균 손실값 중 하위 10% 이하 범위에 해당하는 평균 손실값들을 선별하고, 선별한 평균 손실값들을 가지는 음성 데이터들을 검증 세트에 포함되는 음성 데이터로 선정하여 검증 세트를 출력할 수 있다.Step S122, for example, selects average loss values that fall within the bottom 10% or less of n average loss values, selects speech data with the selected average loss values as speech data included in the verification set, and outputs a verification set. can do.

또, 단계 S122는 일 예로 n개의 평균 손실값 중 하위 10% 초과 내지 50% 이하 범위에 해당하는 평균 손실값들을 선별하고, 선별한 평균 손실값들을 가지는 음성 데이터들을 훈련 세트에 포함되는 음성 데이터로 선정하여 훈련 세트를 출력할 수 있다.In addition, step S122, as an example, selects average loss values that fall within the lower 10% to 50% or less range among n average loss values, and selects speech data with the selected average loss values as speech data included in the training set. You can select and output the training set.

한편, 본 발명에서 사용하는 손실값은 상기 수학식 1을 이용하여 획득될 수 있다.Meanwhile, the loss value used in the present invention can be obtained using Equation 1 above.

마지막으로 본 발명의 자동 검수 방법(4)은, 검증 세트를 검수 음향 생성부에서 이용하여 검수 음향 모델을 생성하고, 생성한 검수 음향 모델을 이용하여 n개의 음성 데이터 각각에 대한 불량 여부를 판단하고 결과를 출력하는 검수 음향을 생성하는 단계(S13)를 포함할 수 있다. 검수 음향을 생성하는 단계(S13)는 이를 위해 도 8에 도시된 바와 같이 검수 음향 모델을 생성하는 단계(S131) 및 음향 검수 결과를 출력하는 단계(S132)를 포함할 수 있다.Finally, the automatic inspection method (4) of the present invention uses the verification set in the inspection sound generation unit to generate an inspection sound model, uses the generated inspection sound model to determine whether each of the n pieces of voice data is defective, and It may include a step (S13) of generating an inspection sound that outputs the result. The step of generating the inspection sound (S13) may include the step of generating an inspection sound model (S131) and outputting the sound inspection result (S132), as shown in FIG. 8.

검수 음향 모델을 생성하는 단계(S131)는 검증 세트의 딥 러닝을 수행하여 검수 음향 모델을 생성하고, 검수 음향 모델에 훈련 세트를 적용하여 검수 음향 모델을 훈련하기 위해 형성된다. 단계 S131에서 수행되는 검증 세트의 딥 러닝은 제3딥 러닝으로 표현될 수 있다. 단계 S131은 일 예로 제3딥 러닝 과정에서 임계값을 사용할 수 있으며, 여기서 임계값은 상술한 단계 S121에서 생성된 임계값일 수 있다. 단계 S131은 제3딥 러닝을 수행하는 과정에서 검증 세트에 포함된 음성 데이터의 손실값이 임계값 이하로 내려가는 경우 제3딥 러닝을 중단하고, 손실값이 임계값 이하로 내려가지 않는 경우 기 설정된 횟수만큼 제3딥 러닝을 수행하여 검수 음향 모델을 생성할 수 있다.The step of generating a verification sound model (S131) is performed to generate a verification sound model by performing deep learning on the verification set, and to train the verification sound model by applying a training set to the verification sound model. Deep learning of the validation set performed in step S131 can be expressed as third deep learning. As an example, step S131 may use a threshold in the third deep learning process, where the threshold may be the threshold generated in step S121 described above. In step S131, in the process of performing the third deep learning, if the loss value of the voice data included in the verification set falls below the threshold, the third deep learning is stopped, and if the loss value does not fall below the threshold, the preset The third deep learning can be performed as many times as the inspection sound model can be created.

단계 S131은 검증 세트의 제3딥 러닝을 통해 검수 음향 모델을 생성하면, 훈련 세트를 검수 음향 모델에 적용하여 검수 음향 모델의 정확도를 증가시키는 훈련을 수행할 수도 있다.In step S131, when the verification sound model is generated through third deep learning of the verification set, training to increase the accuracy of the verification sound model may be performed by applying the training set to the verification sound model.

음향 검수 결과를 출력하는 단계(S132)는 단계 S131에서 생성된 검증 음향 모델에 n개의 음성 데이터를 적용하고, 결과물로 획득되는 각 음성 데이터의 손실값을 이용하여 해당 음성 데이터의 불량 여부를 포함하는 음향 검수 결과를 출력할 수 있다. 단계 S132는 음성 데이터의 손실값이 임계값보다 작은 경우 해당 음성 데이터를 정상 음성 데이터로 결정하고, 손실값이 임계값 이상인 경우 해당 음성 데이터를 불량 음성데이터로 결정할 수 있다. 이를 통해 본 발명의 자동 검수 방법(4)은 사용자가 제공하는 초기 검증된 데이터가 존재하지 않더라도 딥 러닝을 이용하여 데이터 검수를 수행할 수 있다.The step of outputting the sound inspection result (S132) applies n pieces of voice data to the verification sound model generated in step S131, and uses the loss value of each voice data obtained as a result to determine whether the voice data is defective. Acoustic inspection results can be output. In step S132, if the loss value of the voice data is less than the threshold, the voice data may be determined as normal voice data, and if the loss value of the voice data is greater than the threshold, the voice data may be determined as bad voice data. Through this, the automatic inspection method (4) of the present invention can perform data inspection using deep learning even if there is no initially verified data provided by the user.

한편 본 발명의 딥 러닝을 이용한 음성 데이터 자동 검수 방법은, 단계 S121에서 사전 음향 모델을 생성하기 위한 사전 훈련 세트의 양을 증가시키기 위한 구성을 더 포함할 수도 있다.Meanwhile, the method of automatically inspecting voice data using deep learning of the present invention may further include a configuration for increasing the amount of a pre-training set for generating a pre-acoustic model in step S121.

본 발명의 딥 러닝을 이용한 음성 데이터 자동 검수 방법은, 비지도 검증 학습을 수행하는 단계(도면 미도시)를 더 포함하여 형성될 수 있다.The method of automatically verifying voice data using deep learning of the present invention may further include the step of performing unsupervised verification learning (not shown).

단계 S111에서 n개의 음성 데이터를 획득하면, 본 발명의 자동 검수 방법은 n개의 음성 데이터들을 비지도 검증 학습을 수행하는 단계로 전달할 수 있다. 비지도 검증 학습을 수행하는 단계는 n개의 음성 데이터들을 기초로 하여 생성적 적대 신경망(GAN)을 통해 복수의 가상 음성 데이터를 생성할 수 있다. 생성적 적대 신경망(GAN)은 대표적인 비지도 학습 방법으로 생성자와 구분자가 서로 데이터를 생성 및 검증하여 실제 데이터와 구별이 어려운 가상 데이터를 생성하도록 하는 알고리즘이다.If n pieces of voice data are acquired in step S111, the automatic verification method of the present invention can transfer the n pieces of voice data to the step of performing unsupervised verification learning. The step of performing unsupervised verification learning can generate a plurality of virtual voice data through a generative adversarial network (GAN) based on n pieces of voice data. Generative adversarial network (GAN) is a representative unsupervised learning method and is an algorithm that allows a generator and a classifier to generate and verify data with each other to create virtual data that is difficult to distinguish from real data.

따라서, 본 발명에서 획득하는 n개의 음성 데이터들을 기초로 생성적 적대 신경망(GAN)을 이용하면, 본 발명의 비지도 검증 학습을 수행하는 단계에서는 복수의 가상 음성 데이터를 생성할 수 있으며, 이를 이용하여 사전 음향 모델을 생성함으로써 최초 획득하는 음성 데이터의 양을 증가시켜 보다 정확도 높은 검수 음향 모델을 생성할 수도 있다.Therefore, by using a generative adversarial network (GAN) based on n pieces of voice data obtained in the present invention, a plurality of virtual voice data can be generated and used in the step of performing unsupervised verification learning of the present invention. By creating a preliminary sound model, the amount of voice data initially acquired can be increased to generate a more accurate proof sound model.

이상에서 본 발명의 일 실시예에 대하여 설명하였으나, 본 발명의 사상은 본 명세서에 제시되는 실시 예에 제한되지 아니하며, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서, 구성요소의 부가, 변경, 삭제, 추가 등에 의해서 다른 실시 예를 용이하게 제안할 수 있을 것이나, 이 또한 본 발명의 사상범위 내에 든다고 할 것이다.Although one embodiment of the present invention has been described above, the spirit of the present invention is not limited to the embodiment presented in the present specification, and those skilled in the art who understand the spirit of the present invention can add components within the scope of the same spirit. , other embodiments can be easily proposed by change, deletion, addition, etc., but this will also be said to be within the scope of the present invention.

1: 딥 러닝을 이용한 음성 데이터 자동 검수 시스템
2: 음성 데이터 3: 평가 결과
4: 딥 러닝을 이용한 음성 데이터 자동 검수 방법
11: 음성 데이터 전처리부 12: 사전 음향 처리부
13: 검수 음향 생성부 111: 음성 데이터 획득 모듈
112: 음성 데이터 분류 모듈 121: 사전 음향 모델 생성 모듈
122: 검증 세트 출력 모듈 131: 검수 음향 모델 생성 모듈
132: 음향 검수 결과 출력 모듈1: Automatic voice data inspection system using deep learning
2: Voice data 3: Evaluation results
4: How to automatically inspect voice data using deep learning
11: Voice data pre-processing unit 12: Pre-sound processing unit
13: Inspection sound generation unit 111: Voice data acquisition module
112: Voice data classification module 121: Pre-acoustic model generation module
122: Verification set output module 131: Verification acoustic model generation module
132: Acoustic inspection result output module

Claims

A voice data preprocessor that acquires and preprocesses n pieces of voice data and outputs voice data classified according to preset classifications;
a dictionary sound processing unit that generates a dictionary sound model using the classified voice data and outputs a verification set, which is a set of voice data selected from among the classified voice data, using the generated dictionary sound model; and
Deep learning including; a verification sound generation unit that generates a verification sound model using the verification set, determines whether each of the n pieces of voice data is defective using the generated verification sound model, and outputs the result; Automatic voice data inspection system used.

According to clause 1,
The voice data preprocessor,
a voice data acquisition module that acquires the n pieces of voice data; and
A voice data classification module that classifies the n acquired voice data into m pre-training sets using the preset classification criteria. An automatic voice data inspection system using deep learning comprising a.

According to clause 1,
The pre-sound processing unit,
a dictionary sound model generation module that generates m dictionary sound models by performing deep learning on the classified speech data; and
Apply the n voice data to each of the m dictionary sound models to obtain m loss values for the m dictionary sound models of each voice data, and use the loss values to verify the n voice data. An automatic voice data inspection system using deep learning that includes a verification set output module that selects and outputs a set.

According to clause 3,
The pre-acoustic model generation module is,
An automatic voice data inspection system using deep learning that performs deep learning on all of the n voice data, stops deep learning when the loss value converges to a specific value, and sets the specific value as a threshold.

According to clause 4,
The pre-acoustic model generation module is,
Deep learning is performed on the m pre-training sets classified using the preset classification criteria to generate each of the m pre-acoustic models, and the loss value is set to the threshold during the deep learning process for the m pre-training sets. An automatic voice data inspection system using deep learning that stops the deep learning when the value falls below the threshold, and performs the deep learning a preset number of times when the value does not fall below the threshold.

According to clause 3,
The verification set output module is,
Obtaining an average loss value that is an average of the m loss values obtained from each of the voice data, and outputting the verification set and the training set using the average loss value of each voice data,
The verification set includes the voice data having an average loss value that is less than the bottom 10% of the average loss values,
The training set is an automatic voice data inspection system using deep learning that includes the voice data having an average loss value that is more than the bottom 10% and less than the bottom 50% of the average loss values.

According to clause 6,
An automatic voice data inspection system using deep learning where the loss value is obtained using Equation 1 below.
[Formula 1]

(where, x: input speech feature sequence, π: path that can correspond to the correct answer label, y: probability value of the corresponding label at each time point, T: speech feature sequence length, l: correct answer label sequence, β ^-1 (l ): set of all possible paths that can lead to the correct label by removing blank and duplicate labels)

According to clause 1,
The inspection sound generator,
a verification acoustic model generation module that generates the verification acoustic model by performing deep learning on the verification set; and
An acoustic inspection result output module that outputs an acoustic inspection result including the defect through the loss value of each audio data obtained by applying the n pieces of audio data to the generated inspection acoustic model; using deep learning including; Automatic voice data inspection system.

According to clause 8,
The acoustic inspection result output module is,
If the loss value of each voice data is less than the threshold value, the corresponding voice data is determined as normal voice data, and if the loss value is greater than the threshold value, the corresponding voice data is determined as defective voice data to generate the acoustic inspection result. An automatic voice data inspection system using deep learning.

Obtaining and pre-processing n pieces of voice data using a voice preprocessor and outputting voice data classified according to a preset classification;
Generating a dictionary sound model using the classified voice data in a dictionary sound processing unit, and outputting a verification set, which is a set of voice data selected from the classified voice data, using the generated dictionary sound model; and
Using the verification set in a verification sound generation unit to generate a verification sound model, determining whether each of the n pieces of voice data is defective and outputting the result using the generated verification sound model; Deep comprising a. An automatic voice data inspection method using learning.

According to clause 10,
The step of outputting the classified voice data is,
Obtaining the n pieces of voice data; and
Classifying the acquired n pieces of voice data into m pre-training sets using the preset classification criteria. An automatic voice data inspection method using deep learning, including.

According to clause 10,
The step of outputting a verification set, which is a set of voice data, includes:
generating m dictionary sound models by performing deep learning on the classified speech data; and
Apply the n voice data to each of the m dictionary sound models to obtain m loss values for the m dictionary sound models of each voice data, and use the loss values to verify the n voice data. An automatic voice data inspection method using deep learning, including the step of selecting and outputting a set.

According to clause 12,
The step of generating the preliminary acoustic model is,
An automatic voice data inspection method using deep learning that performs deep learning on all of the n pieces of voice data, stops deep learning when the loss value converges to a specific value, and sets the specific value as a threshold.

According to clause 13,
The step of generating the preliminary acoustic model is,
Deep learning is performed on the m pre-training sets classified using the preset classification criteria to generate each of the m pre-acoustic models, and the loss value is set to the threshold during the deep learning process for the m pre-training sets. A method of automatically inspecting voice data using deep learning in which the deep learning is stopped when the value falls below the threshold, and the deep learning is performed a preset number of times when the value does not fall below the threshold.

According to clause 12,
The step of selecting and outputting the verification set is,
Obtaining an average loss value that is an average of the m loss values obtained from each of the voice data, and outputting the verification set and the training set using the average loss value of each voice data,
The verification set includes the voice data having an average loss value that is less than the bottom 10% of the average loss values,
The training set is a voice data automatic inspection method using deep learning, wherein the training set includes the voice data having an average loss value that is greater than the bottom 10% and less than the bottom 50% of the average loss values.

According to clause 15,
An automatic voice data inspection method using deep learning where the loss value is obtained using Equation 2 below.
[Formula 2]

(where, x: input speech feature sequence, π: path that can correspond to the correct answer label, y: probability value of the corresponding label at each time point, T: speech feature sequence length, l: correct answer label sequence, β ^-1 (l ): set of all possible paths that can lead to the correct label by removing blank and duplicate labels)

According to clause 10,
The step of determining whether the defect is defective and outputting the result is,
generating the verification acoustic model by performing deep learning on the verification set; and
Automatically inspecting voice data using deep learning, comprising: outputting an acoustic inspection result including whether the defect is detected through loss values of each audio data obtained by applying the n pieces of audio data to the generated inspection acoustic model; method.

According to clause 17,
The step of outputting the acoustic inspection results is,
If the loss value of each voice data is less than the threshold value, the corresponding voice data is determined as normal voice data, and if the loss value is greater than the threshold value, the corresponding voice data is determined as defective voice data to generate the acoustic inspection result. An automatic voice data inspection method using deep learning.