KR102596521B1

KR102596521B1 - Method and system for analyzing language development disorder and behavior development disorder by processing video information input to the camera and audio information input to the microphone in real time

Info

Publication number: KR102596521B1
Application number: KR1020210067967A
Authority: KR
Inventors: 고연정; 윤재민
Original assignee: 보리 주식회사
Priority date: 2020-05-27
Filing date: 2021-05-26
Publication date: 2023-10-31
Also published as: KR20210146825A

Abstract

본 발명은 언어장애 및 행동장애를 분석하는 방법 및 시스템에 관한 것으로서, 더욱 상세하게는 카메라로 입력되는 영상정보와 마이크로 입력되는 음성정보를 실시간으로 처리하여 음성인식된 문장을 대상으로 언어분석(형태소분석, 구문분석, 의미분석)을 수행하여, 조음음운장애, 언어발달장애, 유창성장애를 분석하고, 영상으로부터 행동장애를 분석하고, 분석된 결과를 통계로 제공하는 방법 및 시스템에 관한 것이다.
이를 위하여, 카메라와 마이크를 통해서 영상과 음성을 실시간 캡쳐해서 스트림으로 서버로 전송하면, 얼굴인식 기능에 의해서 등록되어 있는 얼굴인지 확인하고, 등록되어 있지 않다면 등록하는 과정을 거치고, 등록되어 있다면, 화자로 등록되어 있는지 확인해서 등록되어 있지 않다면, 화자로 등록하는 과정을 거치고, 화자로 등록되어 있으면, 영상으로부터 화자별음성인식 과정을 거치며, 음성인식된 문장을 대상으로 언어를 분석하고 영상으로부터 행동장애를 분석하며, 언어분석된 결과를 계산해서 통계 수치를 계산하고, 음성전송부에서 전송된 음성을 실시간 수신하고, 화자로 등록되어 있는지 확인하며 화자에 등록되어 있지 않다면, 화자로 등록하는 카메라로 입력되는 영상정보, 마이크로 입력되는 음성정보를 실시간으로 처리하여 언어발달장애 및 행동발달장애를 분석하는 시스템과 방법을 제공한다.The present invention relates to a method and system for analyzing speech and behavioral disorders. More specifically, the present invention relates to a method and system for analyzing language disorders and behavioral disorders. More specifically, the present invention relates to a language analysis (morpheme) for a voice-recognized sentence by processing image information input from a camera and voice information input from a microphone in real time. It relates to a method and system that performs analysis, syntactic analysis, and semantic analysis) to analyze articulatory phonological disorders, language development disorders, and fluency disorders, analyzes behavioral disorders from images, and provides the analyzed results as statistics.
For this purpose, video and audio are captured in real time through a camera and microphone and transmitted as a stream to the server. The facial recognition function checks whether the face is registered, and if not registered, goes through the registration process. If registered, the speaker If not, go through the process of registering as a speaker. If you are registered as a speaker, go through the voice recognition process for each speaker from the video, analyze the language for the voice-recognized sentences, and detect behavioral disorders from the video. Analyzes the language analysis results to calculate statistical values, receives the voice transmitted from the voice transmission unit in real time, checks whether the speaker is registered as a speaker, and if not registered as a speaker, inputs it to the camera to register as a speaker. Provides a system and method to analyze language development disorders and behavioral development disorders by processing video information and voice information input from a microphone in real time.

Description

Method and system for analyzing language development disorder and behavior development disorder by processing video information input to the camera and voice information input to the microphone in real time camera and audio information input to the microphone in real time}

본 발명은 언어장애 및 행동장애를 분석하는 방법 및 시스템에 관한 것으로서, 더욱 상세하게는 카메라로 입력되는 영상정보와 마이크로 입력되는 음성정보를 실시간으로 처리하여 음성인식된 문장을 대상으로 언어분석(형태소분석, 구문분석, 의미분석)을 수행하여, 조음음운장애, 언어발달장애, 유창성장애를 분석하고, 영상으로부터 행동장애를 분석하고, 분석된 결과를 통계로 제공하는 방법 및 시스템에 관한 것이다.The present invention relates to a method and system for analyzing speech and behavioral disorders. More specifically, the present invention relates to a method and system for analyzing language disorders and behavioral disorders. More specifically, the present invention relates to a language analysis (morpheme) for a voice-recognized sentence by processing image information input from a camera and voice information input from a microphone in real time. It relates to a method and system that performs analysis, syntactic analysis, and semantic analysis) to analyze articulatory phonological disorders, language development disorders, and fluency disorders, analyzes behavioral disorders from images, and provides the analyzed results as statistics.

정확한 언어발달장애와 행동발달장애 검사를 위해서는 실시간으로 치료대상자의 영상과 음성을 분석하는 기술이 가장 중요하다.In order to accurately test language development disorders and behavioral development disorders, technology that analyzes images and voices of treatment subjects in real time is most important.

실시간 영상과 음성을 처리하기 위해서는 스트리밍 방식으로 클라이언트에서 서버로 영상과 음성 데이터를 전달해야 하고, 전달된 영상과 음성으로부터 화자를 정확하게 분리해하여 각 화자별로 발화내용을 분석해야 한다.In order to process real-time video and audio, video and audio data must be transmitted from the client to the server through streaming, and speakers must be accurately separated from the delivered video and audio and the speech content for each speaker must be analyzed.

종래에는 아동과 부모, 아동과 치료사간의 놀이를 동영상으로 녹화해서, 해당 동영상을 수작업으로 서버에 업로드한 뒤에, 해당 동영상을 분석해서 사람이 음성을 디텍션하고, 화자를 분리한 뒤에, 음성인식하고 수동으로 유창성을 분석해서 그 분석 결과를 그래프로 출력하므로, 언어발달장애를 분석하는데 많은 시간과 비용이 소요되었다.Conventionally, play between children and parents or between children and therapists was recorded in video, the video was manually uploaded to a server, the video was analyzed, a person detected the voice, separated the speaker, and then voice recognition was performed manually. Because fluency was analyzed and the analysis results were output as graphs, a lot of time and money was spent analyzing language development disorders.

종래 기술로서, 발달장애 아동의 언어를 개인별로 커스터마이징 과정에서 데이터베이스화된 언어치료 이력들을 검색하여 현재 아동의 증상 및 발화상태에 가장 유사한 몇 건의 치료이력을 추출하고, 치료이력을 재가공하여 개인별 발화치료 커리큘럼을 새롭게 작성하여 제공하는 발달장애 아동 언어치료 방법 및 장치까지는 이미 개발된 바 있다(특허 제2152500호 참조).As a prior art, in the process of individually customizing the language of children with developmental disabilities, databased speech therapy histories are searched, several treatment histories that are most similar to the current child's symptoms and speech status are extracted, and the treatment history is reprocessed to provide personalized speech therapy. A speech therapy method and device for children with developmental disabilities that provides a new curriculum has already been developed (see Patent No. 2152500).

KR 특허 제2152500호 (2020.08.31)KR Patent No. 2152500 (2020.08.31)

본 발명은 상기와 같은 종래 기술의 문제점을 해결하기 위하여 제안된 것으로서, 별도의 영상녹화나 음성녹음없이 다양한 멀티미디어 장비로부터 영상과 음성을 동시에 입력받거나, 음성만 입력 받아도 실시간 자동으로 각 화자별로 음성인식을 수행해서 언어발달장애와 행동발달장애를 분석해줄 수 있는 영상과 음성을 실시간 처리하여 언어발달장애와 행동발달장애를 분석하기 위한 방법 및 시스템에 대한 것이다.The present invention was proposed to solve the problems of the prior art as described above. Even if video and audio are simultaneously input from various multimedia devices without separate video or audio recording, or only audio is input, voice recognition for each speaker is performed automatically in real time. This is about a method and system for analyzing language development disorders and behavioral development disorders through real-time processing of images and audio that can analyze language development disorders and behavioral development disorders.

이와 같은 목적을 달성하기 위한 카메라로 입력되는 영상정보, 마이크로 입력되는 음성정보를 실시간으로 처리하여 언어발달장애 및 행동발달장애를 분석하는 시스템은 클라이언트와 서버로 구성된다.To achieve this purpose, the system that processes video information input from a camera and audio information input from a microphone in real time to analyze language development disorders and behavioral development disorders is composed of a client and a server.

상기 클라이언트는 카메라와 마이크를 통해서 영상과 음성을 실시간 캡쳐해서 스트림으로 서버로 전송하는 영상/음성전송부, 마이크를 통해서 음성을 실시간 캡쳐해서 스트림으로 서버로 전송하는 음성전송부, 및 언어분석 또는 행동분석 결과 내용을 그래프로 표시하는 결과표시부가 구성된다.The client includes a video/audio transmission unit that captures video and audio in real time through a camera and microphone and transmits it to the server as a stream, an audio transmission unit that captures audio in real time through a microphone and transmits it to the server as a stream, and language analysis or behavior. A result display section is constructed to display the analysis results in a graph.

그리고, 상기 서버는 상기 클라이언트의 영상/음성전송부에서 전송된 영상과 음성을 실시간 수신하는 영상/음성수신부, 얼굴인식 기능에 의해서 등록되어 있는 얼굴인지 판단하는 얼굴분석부, 얼굴에 등록되어 있지 않다면 얼굴을 등록하는 과정을 거치는 얼굴등록부, 화자인식 기능에 의해서 등록되어 있는 화자인지 판단하는 얼굴화자분석부, 화자에 등록되어 있지 않다면 화자로 등록하는 과정을 거치는 얼굴화자등록부, 화자로 등록되어 있으면 영상으로부터 화자별 음성인식 과정을 거치는 영상화자음성인식부, 음성인식된 문장을 대상으로 언어를 분석하는 언어분석부, 영상으로부터 행동장애를 분석하는 행동분석부, 언어분석된 결과를 계산해서 통계 수치를 계산하는 분석결과부, 상기 클라이언트의 음성전송부에서 전송된 음성을 실시간 수신하는 음성수신부, 화자로 등록되어 있는지 확인하는 음성화자분석부, 및 화자에 등록되어 있지 않다면, 화자로 등록하는 화자등록부가 구성되는 것을 특징으로 한다.In addition, the server includes a video/audio receiver that receives video and audio transmitted from the client's video/audio transmission unit in real time, a face analysis unit that determines whether the face is registered by the facial recognition function, and a face analysis unit that determines whether the face is registered by the face recognition function. Face registration unit that goes through the process of registering the face, face speaker analysis unit that determines whether the speaker is registered using the speaker recognition function, face speaker registration unit that goes through the process of registering the speaker as a speaker if not registered as a speaker, video if registered as a speaker A video speaker voice recognition unit that goes through the voice recognition process for each speaker, a language analysis unit that analyzes language for speech-recognized sentences, a behavior analysis unit that analyzes behavioral disorders from images, and statistical values are calculated by calculating the language analysis results. It consists of an analysis result unit that receives the voice transmitted from the client's voice transmission unit in real time, a voice speaker analysis unit that checks whether the voice is registered as a speaker, and a speaker registration unit that registers the voice as a speaker if not registered as a speaker. It is characterized by being

여기서, 상기 서버의 영상화자음성인식부는, 해당 영상으로부터 음성을 최소 단위로 분리해서 검출하고, 각 최소 단위의 연속된 조합으로 2명 이상의 화자를 분류하여, 화자 분류가 가장 정확한 조합에 해당 하는 음성을 화자에 매칭, 각 화자별로 음성인식을 수행하는 것을 특징으로 한다.Here, the video speaker voice recognition unit of the server detects the voice from the video by separating it into minimum units, classifies two or more speakers by successive combinations of each minimum unit, and selects the voice corresponding to the combination with the most accurate speaker classification. It is characterized by matching speakers and performing voice recognition for each speaker.

또한, 상기 서버의 음성화자음성인식부는, 해당 음성으로부터 음성을 최소 단위로 분리해서 검출하고, 각 최소 단위의 연속된 조합으로 2명 이상의 화자를 분류하여, 화자 분류가 가장 정확한 조합에 해당 하는 음성을 화자에 매칭, 각 화자별로 음성인식을 수행하는 것을 특징으로 한다.In addition, the speaker voice recognition unit of the server detects the voice by separating it into minimum units from the corresponding voice, classifies two or more speakers by successive combinations of each minimum unit, and selects the voice corresponding to the combination with the most accurate speaker classification. It is characterized by matching speakers and performing voice recognition for each speaker.

이때, 상기 서버의 행동분석부는 갑자기 고함치는 행위를 분석할 때 언어분석과 행동분석을 동시에 수행하며, 영상정보가 없을 때는 음성으로 분석하되 별도의 머신러닝 소리 분류기를 이용하여 비명, 고함을 포함한 소리를 분류하는 것을 특징으로 한다.At this time, the behavior analysis unit of the server performs language analysis and behavior analysis at the same time when analyzing sudden shouting behavior, and when there is no video information, it analyzes the sound as voice, but uses a separate machine learning sound classifier to identify sounds including screaming and shouting. It is characterized by classification.

또한, 상기 서버의 언어분석부는 음성인식된 문장을 대상으로 분석지표에 따라 형태소분석기, 구문분석기, 의미분석기를 사용하여 조음음운장애, 언어발달장애, 유창성장애를 분석하는 것을 특징으로 한다.In addition, the language analysis unit of the server analyzes articulatory phonological disorders, language development disorders, and fluency disorders using morphological analyzers, syntactic analyzers, and semantic analyzers according to analysis indicators for speech-recognized sentences.

한편, 카메라로 입력되는 영상정보, 마이크로 입력되는 음성정보를 실시간으로 처리하여 언어발달장애 및 행동발달장애를 분석하는 방법은 클라이언트가 카메라와 마이크를 통해서 영상과 음성을 실시간 캡쳐해서 스트림으로 서버로 전송하는 단계; 전송된 영상과 음성을 서버가 실시간 수신하는 단계; 얼굴인식 기능에 의해서 등록되어 있는 얼굴인지 판단하는 단계; 얼굴에 등록되어 있지 않다면 얼굴을 등록하는 과정을 거치는 단계; 화자인식 기능에 의해서 등록되어 있는 화자인지 판단하는 단계; 화자에 등록되어 있지 않다면 화자로 등록하는 과정을 거치는 단계; 화자로 등록되어 있으면 영상으로부터 화자별 음성인식 과정을 거치는 단계; 음성인식된 문장을 대상으로 언어를 분석하는 단계; 영상으로부터 행동장애를 분석하는 단계; 언어분석된 결과를 계산해서 통계 수치를 계산하는 단계; 및 언어분석 또는 행동분석 결과 내용을 그래프로 표시하는 단계를 포함하여 구성되는 것을 특징으로 한다.Meanwhile, a method of analyzing language development disorders and behavioral development disorders by processing video information input from a camera and audio information input from a microphone in real time involves the client capturing video and audio in real time through a camera and microphone and transmitting them to the server as a stream. steps; A server receiving transmitted video and audio in real time; A step of determining whether a face is registered using a face recognition function; If the face is not registered, going through the process of registering the face; A step of determining whether the speaker is registered using the speaker recognition function; If you are not registered as a speaker, go through the process of registering as a speaker; If registered as a speaker, going through a voice recognition process for each speaker from the video; Analyzing language for voice-recognized sentences; Analyzing behavioral disorders from images; Calculating statistical values by calculating the language analysis results; and a step of displaying the language analysis or behavior analysis results in a graph.

또한, 클라이언트가 마이크를 통해서 음성을 실시간 캡쳐해서 스트림으로 서버로 전송하는 단계; 전송된 음성을 서버가 실시간 수신하는 단계; 화자로 등록되어 있는지 확인하는 단계; 화자에 등록되어 있지 않다면, 화자로 등록하는 단계; 화자로 등록되어 있으면 음성으로부터 화자별 음성인식 과정을 거치는 단계; 음성인식된 문장을 대상으로 언어를 분석하는 단계; 언어분석된 결과를 계산해서 통계 수치를 계산하는 단계; 및 언어분석 결과 내용을 그래프로 표시하는 단계를 포함하여 구성되는 것을 특징으로 한다.In addition, the client captures voice in real time through a microphone and transmits it to the server as a stream; A server receiving the transmitted voice in real time; Checking whether you are registered as a speaker; If not registered as a speaker, registering as a speaker; If registered as a speaker, going through a voice recognition process for each speaker from the voice; Analyzing language for voice-recognized sentences; Calculating statistical values by calculating the language analysis results; and a step of displaying the language analysis results in a graph.

이와 같이 구성된 본 발명의 카메라로 입력되는 영상정보, 마이크로 입력되는 음성정보를 실시간으로 처리하여 언어발달장애 및 행동발달장애를 분석하는 방법 및 시스템은 다음과 같은 유용한 효과를 발휘한다.The method and system of the present invention configured as described above for analyzing language development disorders and behavioral development disorders by processing video information input from a camera and voice information input from a microphone in real time exhibits the following useful effects.

1) 영상과 음성을 동시에 입력받거나, 음성만 입력 받아도 자동으로 언어발달장애와 행동발달장애를 분석해준다.1) Even if video and audio are input at the same time or only audio is input, language development disorders and behavioral development disorders are automatically analyzed.

2) 별도의 영상녹화나 음성녹음없이 실시간으로 영상과 음성을 처리해서 언어발달장애와 행동발달장애를 분석해준다.2) It analyzes language and behavioral development disorders by processing video and audio in real time without separate video or audio recording.

3) 로봇이나 테블릿, 스마트폰, 전자펜, 홀로그램, 디지털사이니지, TV, 자동차 등 다양한 멀티미디어 장비로부터 영상과 음성을 실시간 수집하여 언어발달장애와 행동발달장애를 분석해줄 수 있다.3) By collecting images and voices in real time from various multimedia devices such as robots, tablets, smartphones, electronic pens, holograms, digital signage, TVs, and cars, language development disorders and behavioral development disorders can be analyzed.

도 1은 본 발명에 따른 카메라로 입력되는 영상정보, 마이크로 입력되는 음성정보를 실시간으로 처리하여 언어발달장애 및 행동발달장애를 분석하는 시스템의 구성도;
도 2는 본 발명에 따른 음성 스트림으로부터 화자식별 및 음성인식 개요도;
도 3은 본 발명에 따른 음성 스트림으로부터 화자식별 및 음성인식 절차도;
도 4는 본 발명에 따른 결과표시부 요약 화면도;
도 5는 본 발명에 따른 결과표시부 그래프 화면도;1 is a block diagram of a system for analyzing language development disorders and behavioral development disorders by processing image information input from a camera and voice information input from a microphone in real time according to the present invention;
Figure 2 is a schematic diagram of speaker identification and voice recognition from a voice stream according to the present invention;
Figure 3 is a speaker identification and voice recognition procedure from a voice stream according to the present invention;
Figure 4 is a summary screen diagram of the result display unit according to the present invention;
Figure 5 is a graph screen diagram of the result display unit according to the present invention;

이하, 본 발명의 목적이 구체적으로 실현될 수 있는 바람직한 실시예를 첨부된 도면을 참조하여 상세히 설명한다. 본 실시예를 설명함에 있어서, 동일 구성에 대해서는 동일 명칭이 사용되며 이에 따른 부가적인 설명은 생략하기로 한다.Hereinafter, preferred embodiments in which the object of the present invention can be concretely realized will be described in detail with reference to the attached drawings. In describing this embodiment, the same names are used for the same components and additional descriptions thereof will be omitted.

도 1은 본 발명에 따른 카메라로 입력되는 영상정보, 마이크로 입력되는 음성정보를 실시간으로 처리하여 언어발달장애 및 행동발달장애를 분석하는 시스템의 구성도이다.Figure 1 is a diagram illustrating the configuration of a system for analyzing language development disorders and behavioral development disorders by processing image information input from a camera and voice information input from a microphone in real time according to the present invention.

본 발명에 따른 카메라로 입력되는 영상정보, 마이크로 입력되는 음성정보를 실시간으로 처리하여 언어발달장애 및 행동발달장애를 분석하는 시스템은 도 1에 도시된 바와 같이 클라이언트(100)와 서버(200)로 구성된다.The system for analyzing language development disorders and behavioral development disorders by processing video information input from a camera and voice information input from a microphone in real time according to the present invention consists of a client 100 and a server 200, as shown in FIG. 1. It is composed.

클라이언트(100)는 음성전송부(101), 영상/음성전송부(102), 결과표시부(103)으로 구성된다.The client 100 consists of an audio transmission unit 101, a video/audio transmission unit 102, and a result display unit 103.

서버(200)는 음성수신부(201), 영상/음성수신부(202), 얼굴분석부(203), 얼굴등록부(204), 얼굴화자분석부(205), 얼굴화자등록부(206), 영상화자음성인식부(207), 언어분석부(208), 행동분석부(209), 분석결과부(210), 음성화자분석부(211), 음성화자등록부(212), 음성화자음성인식부(212)로 구성된다.The server 200 includes a voice receiver 201, a video/audio receiver 202, a face analysis unit 203, a face registration unit 204, a face speaker analysis unit 205, a face speaker registration unit 206, and a video speaker voice recognition unit. Consists of unit 207, language analysis unit 208, behavior analysis unit 209, analysis result unit 210, speech speaker analysis unit 211, speech speaker registration unit 212, and speech speaker voice recognition unit 212. do.

클라이언트(100)의 영상/음성전송부(102)는 카메라와 마이크를 통해서 영상과 음성을 실시간 캡쳐해서 스트림으로 서버(200)에 전송하는 역할을 수행한다.The video/audio transmission unit 102 of the client 100 captures video and audio in real time through a camera and microphone and transmits them to the server 200 as a stream.

서버(200)의 영상/음성수신부(202)는 영상/음성전송부(102)에서 전송된 영상과 음성을 실시간 수신해서 얼굴분석부(203)으로 전송한다.The video/audio reception unit 202 of the server 200 receives the video and audio transmitted from the video/audio transmission unit 102 in real time and transmits them to the face analysis unit 203.

얼굴분석부(203)는 얼굴인식 기능에 의해서 등록되어 있는 얼굴인지 판단한다.The face analysis unit 203 determines whether the face is registered using the face recognition function.

얼굴등록부(204)는 얼굴이 등록되어 있지 않다면 얼굴을 등록하는 과정을 거친다.If the face is not registered, the face registration unit 204 goes through a process of registering the face.

얼굴화자분석부(205)는 화자인식 기능에 의해서 등록되어 있는 화자인지 판단한다.The face speaker analysis unit 205 determines whether the speaker is registered using the speaker recognition function.

얼굴화자등록부(206)는 화자에 등록되어 있지 않다면 화자로 등록하는 과정을 거친다.If the face speaker registration unit 206 is not registered as a speaker, it goes through a process of registering the speaker as a speaker.

영상화자음성인식부(207)는 화자로 등록되어 있으면 영상으로부터 화자별 음성인식 과정을 거친다.If a speaker is registered as a speaker, the video speaker voice recognition unit 207 goes through a voice recognition process for each speaker from the video.

즉, 도 2 및 도 3에 도시된 바와 같이 서버(200)에서 해당 영상으로부터 음성을 최소 단위로 분리해서 검출하고, 각 최소 단위의 연속된 조합으로 2명 이상의 화자를 분류하여, 화자 분류가 가장 정확한 조합에 해당 하는 음성을 화자에 매칭, 각 화자별로 음성인식을 수행한다.That is, as shown in Figures 2 and 3, the server 200 detects the voice from the corresponding video by separating it into minimum units, and classifies two or more speakers by successive combinations of each minimum unit, so that the speaker classification is the best. The voice corresponding to the correct combination is matched to the speaker, and voice recognition is performed for each speaker.

언어분석부(208)는 음성인식된 문장을 대상으로 언어를 분석한다.The language analysis unit 208 analyzes the language of the voice-recognized sentences.

행동분석부(209)는 영상으로부터 행동장애를 분석한다.The behavior analysis unit 209 analyzes behavioral disorders from the video.

분석결과부(210)는 언어분석된 결과를 계산해서 통계 수치를 계산한다.The analysis result unit 210 calculates statistical values by calculating the language analysis results.

결과표시부(103)는 언어분석 또는 행동분석 결과 내용을 그래프로 표시한다.The result display unit 103 displays the language analysis or behavior analysis results in a graph.

음성전송부(101)는 클라이언트(100) 마이크를 통해서 음성을 실시간 캡쳐해서 스트림으로 서버(200)에 전송한다.The voice transmission unit 101 captures voice in real time through the microphone of the client 100 and transmits it to the server 200 as a stream.

음성수신부(201)는 음성전송부(101)에서 전송된 음성을 실시간 수신한다.The voice receiver 201 receives voice transmitted from the voice transmitter 101 in real time.

음성화자분석부(211)는 화자로 등록되어 있는지 확인한다.The voice speaker analysis unit 211 checks whether the speaker is registered.

화자등록부(212)는 화자에 등록되어 있지 않다면, 화자로 등록한다.If the speaker registration unit 212 is not registered as a speaker, it registers the speaker as a speaker.

음성화자음성인식부(212)는 화자로 등록되어 있으면 음성으로부터 화자별 음성인식 과정을 거친다.If the speaker is registered as a speaker, the voice recognition unit 212 goes through a voice recognition process for each speaker from the voice.

즉, 도 2 및 도 3에 도시된 바와 같이 서버(200)에서 해당 음성으로부터 음성을 최소 단위로 분리해서 검출하고, 각 최소 단위의 연속된 조합으로 2명 이상의 화자를 분류하여, 화자 분류가 가장 정확한 조합에 해당 하는 음성을 화자에 매칭, 각 화자별로 음성인식을 수행한다.That is, as shown in FIGS. 2 and 3, the server 200 detects the voice by separating it into minimum units from the corresponding voice, and classifies two or more speakers by successive combinations of each minimum unit, so that the speaker classification is the best. The voice corresponding to the correct combination is matched to the speaker, and voice recognition is performed for each speaker.

또한, 본 발명의 카메라로 입력되는 영상정보, 마이크로 입력되는 음성정보를 실시간으로 처리하여 언어발달장애 및 행동발달장애를 분석하는 방법은 도 1에 도시된 바와 같이, 카메라와 마이크를 통해서 영상과 음성을 실시간 캡쳐해서 스트림으로 서버로 전송하는 단계를 거친다.In addition, as shown in Figure 1, the method of analyzing language development disorders and behavioral development disorders by processing video information input from a camera and audio information input from a microphone in real time according to the present invention involves analyzing video and audio through a camera and microphone. It goes through the steps of capturing in real time and transmitting it to the server as a stream.

클라이언트에서 전송된 영상과 음성을 실시간 수신하는 단계를 거친다.It goes through the steps of receiving video and audio transmitted from the client in real time.

얼굴인식 기능에 의해서 등록되어 있는 얼굴인지 판단하는 단계를 거친다.It goes through a step to determine whether the face is registered using the face recognition function.

얼굴에 등록되어 있지 않다면 얼굴을 등록하는 과정을 거치는 단계를 거친다.If your face is not registered, you will go through the steps to register your face.

화자인식 기능에 의해서 등록되어 있는 화자인지 판단하는 단계를 거친다.The speaker recognition function goes through a step to determine whether the speaker is registered.

화자에 등록되어 있지 않다면 화자로 등록하는 과정을 거치는 단계를 거친다.If you are not registered as a speaker, you will go through the process of registering as a speaker.

화자로 등록되어 있으면 영상으로부터 화자별 음성인식 과정을 거치는 단계를 거친다.If you are registered as a speaker, you go through a voice recognition process for each speaker from the video.

음성인식된 문장을 대상으로 언어를 분석하는 단계를 거친다.The language is analyzed for the voice-recognized sentences.

영상으로부터 행동장애를 분석하는 단계를 거친다.It goes through the steps of analyzing behavioral disorders from the video.

언어분석된 결과를 계산해서 통계 수치를 계산하는 단계를 거친다.The language analysis results are calculated and statistical values are calculated.

언어분석 또는 행동분석 결과 내용을 그래프로 표시하는 단계로 구성된다.It consists of the step of displaying the results of language analysis or behavior analysis in a graph.

또한, 클라이언트측 마이크를 통해서 음성을 실시간 캡쳐해서 스트림으로 서버로 전송하는 단계를 거친다. In addition, the voice is captured in real time through the client-side microphone and transmitted to the server as a stream.

클라이언트에서 전송된 음성을 실시간 수신하는 단계를 거친다.It goes through the steps of receiving voice transmitted from the client in real time.

화자로 등록되어 있는지 확인하는 단계를 거친다.Go through the steps to check whether you are registered as a speaker.

화자에 등록되어 있지 않다면, 화자로 등록하는 단계를 거친다.If you are not registered as a speaker, go through the steps to register as a speaker.

화자로 등록되어 있으면 음성으로부터 화자별 음성인식 과정을 거치는 단계를 거친다.If you are registered as a speaker, you go through a voice recognition process for each speaker from your voice.

언어분석 결과 내용을 그래프로 표시하는 단계로 구성된다.It consists of the step of displaying the language analysis results in a graph.

구분division 내용detail 조음음운장애Articulatory phonological disorder - 전체 단어 정확도(PWC)
- 평균음운길이(PMLU)
- 전체 단어 근접도(PWP)
- 말명료도
- 자음정확도
- 모음정확도
- 조음정확도
- 낱말명료도
- 순행동화 : 칼날[칼랄] 같이 앞의 자/모음의 영향으로 뒤 음운이 바뀌는 것
- 역행동화 : 편리[펼리] 같이[가치], 독립[동닙]과 같이 뒤의 자/모음의 영향으로 앞 음운이 바뀌는 것
- 파열음, 마찰음, 파찰음, 비음, 유음의 정확도- Whole word accuracy (PWC)
- Average phoneme length (PMLU)
- Whole Word Proximity (PWP)
- Speech intelligibility
- Consonant accuracy
- Vowel accuracy
- Articulation accuracy
- Word clarity
- Forward phonation: Changes in the back phoneme under the influence of the preceding letter/vowel, such as a blade [kalal]
- Reverse pronunciation: Changing the preceding phoneme due to the influence of the following consonant/vowel, such as convenience [pulli], together [value], and independent [dongnip].
- Accuracy of plosives, fricatives, affricates, nasals, and consonants 언어발달장애language development disorder - 발화의 의미관계 분석
- 발화의 어휘다양도(TTR)
- 평균낱말길이(MLUw)
- 평균형태소길이(MLUm)- Analysis of semantic relationships of utterances
- Lexical diversity of utterances (TTR)
- Average word length (MLUw)
- Average morpheme length (MLUm) 유창성장애fluency disorder - 말더듬형태 : 반복, 연장, 막힘
- 말더듬비율
- 총비유창지수(반복, 막힘, 연장 비유창지수합)- Stuttering type: repetition, prolongation, blockage
- Stuttering rate
- Total non-fluency index (sum of repetition, blocking, and extension non-fluency index)

상기 표 1은 언어분석에 대한 설명이다.Table 1 above is a description of language analysis.

언어분석에서는 조음음운장애, 언어발달장애, 유창성장애를 분석한다.In language analysis, articulatory phonological disorders, language development disorders, and fluency disorders are analyzed.

언어분석에서는 음성인식된 문장을 대상으로 분석지표(PWC, PMLU, PWP 등)에 따라 형태소분석기, 구문분석기, 의미분석기를 사용하며, 분석하는 공식은 각 지표의 고유 공식에 따른다.In language analysis, morpheme analyzers, syntactic analyzers, and semantic analyzers are used for speech-recognized sentences according to the analysis indicators (PWC, PMLU, PWP, etc.), and the analysis formula follows the unique formula of each indicator.

조음음운장애는 전체 단어 정확도(PWC), 평균음운길이(PMLU), 전체 단어 근접도(PWP), 말명료도, 자음정확도, 모음정확도, 조음정확도, 낱말명료도, 순행동화, 역행동화, 파열음/마찰음/파찰음/비음/유음의 정확도를 분석한다.Articulatory phonological disorders include whole word accuracy (PWC), average phonological length (PMLU), whole word proximity (PWP), speech intelligibility, consonant accuracy, vowel accuracy, articulation accuracy, word intelligibility, forward speech, retrograde speech, and plosives/affricates. Analyze the accuracy of /affricative consonants/nasal consonants/voices.

예를들어, 전체단어정확도(PWC)는 (정확하게 발음한 단어수/전체 단어수)를 나타낸다.For example, total word accuracy (PWC) represents (number of words pronounced correctly/number of total words).

언어발달장애에서는 발화의 의미관계 분석, 발화의 어휘다양도(TTR), 평균낱말길이(MLUw), 평균형태소길이(MLUm)를 분석한다.In language development disorders, the semantic relationships of utterances, lexical diversity (TTR), average word length (MLUw), and average morpheme length (MLUm) of utterances are analyzed.

유창성장애에서는 말더듬형태, 말더듬비율, 총비유창지수를 분석한다.In fluency disorders, stuttering type, stuttering rate, and total non-fluency index are analyzed.

말더듬형태에서는 반복, 연장, 막힘 등이 있는지 분석하며, 총비유창지수에서는 반복/막힘/연장 비유창지수합을 계산한다.In the form of stuttering, the presence of repetitions, prolongations, and blockages is analyzed, and in the total non-fluency index, the sum of the repetition/blockage/prolongation non-fluency indices is calculated.

구분division 내용detail 행동장애behavioral disorder - 끊임없이 두리번거리는 동작
- 갑자기 손뼉치는 동작
- 제자리에서 뛰는 동작
- 머리를 앞뒤, 좌우로 흔드는 동작
- 고개를 좌우로 돌리는 동작
- 상체를 앞뒤, 좌우로 흔드는 동작
- 팔다리를 끊임없이 움직이는 동작
- 앉았다가 갑자기 일어나는 동작
- 상대를 밀치는 동작
- 상대를 때리는 동작
- 갑자기 고함치는 행위- Constant looking around motion
- Sudden hand clapping motion
- Running in place
- Movement of shaking the head back and forth and side to side
- The movement of turning the head left and right
- Movement of shaking the upper body back and forth and left and right
- Constant movement of limbs
-Suddenly standing up after sitting down
- Movement of pushing the opponent
- Movement of hitting the opponent
- Sudden yelling

상기 표 2는 행동분석에 대한 설명이다.Table 2 above is a description of behavior analysis.

행동분석은 실시간 영상을 분석하여 대상자가 행동장애가 없는지를 분석한다.Behavior analysis analyzes real-time video to determine whether the subject has behavioral disorders.

행동장애 유형은 끊임없이 두리번거리는 동작, 갑자기 손뼉치는 동작, 제자리에서 뛰는 동작, 머리를 앞뒤/좌우로 흔드는 동작, 고개를 좌우로 돌리는 동작, 상체를 앞뒤/좌우로 흔드는 동작, 팔다리를 끊임없이 움직이는 동작, 앉았다가 갑자기 일어나는 동작, 상대를 밀치는 동작, 상대를 때리는 동작, 갑자기 고함치는 행위 등이다.Types of behavioral disorders include constant looking around, sudden clapping of hands, jumping in place, head shaking back and forth/side to side, head turning left and right, upper body shaking back and forth/side to side, constant movement of limbs, These include sitting down and suddenly standing up, pushing the other person, hitting the other person, and suddenly shouting.

특히, 갑자기 고함치는 행위는 언어분석과 행동분석을 동시에 수행하여 분석하며, 영상정보가 없을 때는 음성으로 분석하는데, 비명, 고함 등을 분류하기 위해서 별도의 머신러닝 소리 분류기가 필요하다.In particular, sudden shouting behavior is analyzed by simultaneously performing language analysis and behavior analysis, and when there is no video information, it is analyzed by voice, and a separate machine learning sound classifier is needed to classify screams, shouts, etc.

도 2는 본 발명에 따른 음성 스트림으로부터 화자식별 및 음성인식 개요도이다.Figure 2 is a schematic diagram of speaker identification and voice recognition from a voice stream according to the present invention.

영상화자음성인식부(207)와 음성화자음성인식부(212)에서는 공통적으로 음성 스트림(301)으로부터 음성의 최소 단위(V1, V2, V3 등)를 분리한다(302).The video speaker voice recognition unit 207 and the audio speaker voice recognition unit 212 commonly separate the minimum units of speech (V1, V2, V3, etc.) from the audio stream 301 (302).

이렇게 분리하기 위해서 다양한 음성검출(Detection) 알고리즘을 사용한다. 일반적으로 마할라노비스거리나 ZCR, 에너지, MFCC, 신경망 알고리즘을 이용한 방법 등 다양한 검출 방법이 있으며 하나 이상의 방법을 조합하여 많이 사용한다.To achieve this separation, various voice detection algorithms are used. In general, there are various detection methods, such as methods using Mahalanobis distance, ZCR, energy, MFCC, and neural network algorithms, and a combination of more than one method is often used.

음성을 최소 단위로 분리한 뒤에, 동일한 화자의 발화 음성을 순차적으로 묶는다(303). After separating the voices into minimum units, the speech voices of the same speaker are grouped sequentially (303).

그림에서는 V1, V2 음성은 Speaker1로 구분되었고, V3, V4는 Speaker2로 분류되었다.In the figure, V1 and V2 voices were classified as Speaker1, and V3 and V4 were classified as Speaker2.

이렇게 화자가 동일한 음성 묶음 기준으로 음성을 인식(304)하게 된다.In this way, the speaker recognizes the voice based on the same voice bundle (304).

화자인식에는 문장종속 화자인식과 문장독립 화자인식이 있으며, 본 발명에서는 두가지 방법을 다 사용할 수 있다.Speaker recognition includes sentence-dependent speaker recognition and sentence-independent speaker recognition, and both methods can be used in the present invention.

즉, 문장독립 화자인식에서는 입력되는 음성의 최소 단위가 동일한 화자인지 비교해서 분석한다.In other words, in sentence-independent speaker recognition, the minimum unit of input speech is compared and analyzed to see whether it is the same speaker.

문장종속 화자인식에서는 미리 화자별 음성을 등록하는 과정을 거치게 되며, 이 과정에 의해서 입력되는 음성의 최소 단위가 등록된 화자중 어느 화자인지를 비교하는 과정을 거친다. 화자인식 방법에는 GMM(Gaussian Mixture Model), CSS(Cosine Similarity Scoring), SVM(Support Vector Machine), DNN(Deep Neural Network) 등 다양한 방법을 하나 이상 혼합하여 사용한다.Sentence-dependent speaker recognition goes through a process of registering the voice of each speaker in advance, and through this process, goes through a process of comparing which speaker among the registered speakers is the minimum unit of input voice. Speaker recognition methods use a mixture of one or more of various methods such as GMM (Gaussian Mixture Model), CSS (Cosine Similarity Scoring), SVM (Support Vector Machine), and DNN (Deep Neural Network).

도 3은 본 발명에 따른 음성 스트림으로부터 화자식별 및 음성인식 절차도이다.Figure 3 is a procedure diagram for speaker identification and voice recognition from a voice stream according to the present invention.

음성 스트림으로부터 화자식별 및 음성인식 절차도는 다음과 같다.The procedure for speaker identification and voice recognition from the voice stream is as follows.

먼저 영상스트림을 입력받고(401), 영상에서 음성스트림을 분리한다(402).First, the video stream is input (401), and the audio stream is separated from the video (402).

그후, 분리된 음성 스트림을 입력해서 음성 스트림으로부터 음성을 최소 단위로 분리해서 검출한다(404).Afterwards, the separated voice stream is input, and the voice is separated into minimum units and detected from the voice stream (404).

음성스트림만 입력될 경우(408)에는 음성스트림을 입력해서 음성스트림으로부터 음성을 최소 단위로 분리해서 검출한다(404).When only a voice stream is input (408), the voice stream is input and the voice is separated from the voice stream into minimum units and detected (404).

그 후, 검출된 음성을 각 최소 단위의 연속된 조합으로 2명 이상의 화자를 식별하고(405), 화자 분류가 가장 정확한 조합에 해당 하는 음성을 화자에 매칭한다(406).Afterwards, two or more speakers are identified using a successive combination of the minimum units of each detected voice (405), and the voice corresponding to the most accurate combination of speaker classification is matched to the speaker (406).

그 후, 각 화자별 음성인식을 수행한다(407).Afterwards, voice recognition is performed for each speaker (407).

도 4는 본 발명에 따른 결과표시부 요약 화면도이다.Figure 4 is a summary screen diagram of the result display unit according to the present invention.

본 발명의 결과표시부 요약 화면에는 언어이해와 언어표현 수준을 발달연령의 형태로 나타낼 수 있다.The summary screen of the result display unit of the present invention can display the level of language understanding and language expression in the form of developmental age.

또한, 조음음운장애, 언어발달장애, 유창성장애, 행동장애 등의 발달정도를 구체적인 수치로 표현할 수 있다.In addition, the degree of development of articulatory phonological disorders, language development disorders, fluency disorders, and behavioral disorders can be expressed in specific numbers.

도 5는 본 발명에 따른 결과표시부 그래프 화면도이다.Figure 5 is a graph screen diagram of the result display unit according to the present invention.

본 발명의 결과표시부 그래프 화면에는 조음음운장애, 언어발달장애, 유창성장애, 행동장애 등의 발달정도를 연령별, 기간별, 수준별로 그래프를 표시한다.The graph screen of the result display unit of the present invention displays a graph of the degree of development of articulatory phonological disorders, language development disorders, fluency disorders, and behavioral disorders by age, period, and level.

예를들자면, 현재 치료대상자는 생활연령 6세, 언어발달연령 31개월이며, 전체 단어 정확도가 60%로써, 정상아동의 전체 단어 정확도 80%에 비해서 20% 정도 낮은 것으로 알 수 있다. 기간별 통계는 일별, 주별, 월별, 년별로 현재 언어장애와 행동장애가 개선되어가는 진행 상황을 알 수 있다.For example, the current treatment target has a chronological age of 6 years and a language development age of 31 months, and the overall word accuracy is 60%, which is about 20% lower than the 80% overall word accuracy of normal children. Statistics by period show the progress of improvement in current speech and behavioral disorders by day, week, month, and year.

상술한 바와 같이 본 발명은 클라이언트측 카메라와 마이크를 통해서 영상과 음성을 실시간 캡쳐해서 스트림으로 서버로 전송하면, 얼굴인식 기능에 의해서 등록되어 있는 얼굴인지 확인하고, 등록되어 있지 않다면 등록하는 과정을 거치고, 등록되어 있다면, 화자로 등록되어 있는지 확인해서 등록되어 있지 않다면, 화자로 등록하는 과정을 거치고, 화자로 등록되어 있으면, 영상으로부터 화자별음성인식 과정을 거친다. 그 후, 음성인식된 문장을 대상으로 언어분석(형태소분석, 구문분석, 의미분석)을 수행하여, 조음음운장애, 언어발달장애, 유창성장애를 분석하고, 상기 영상으로부터 행동장애를 분석하고, 분석된 결과를 통계로 제공하는 것이다..As described above, the present invention captures video and audio in real time through a client-side camera and microphone and transmits them to the server as a stream. Then, the facial recognition function checks whether the face is registered, and if not, goes through a registration process. , if registered, check whether the speaker is registered as a speaker, and if not, go through the process of registering as a speaker, and if registered as a speaker, go through a voice recognition process for each speaker from the video. Afterwards, linguistic analysis (morphological analysis, syntactic analysis, semantic analysis) is performed on the voice-recognized sentences to analyze articulatory phonological disorders, language development disorders, and fluency disorders, and behavioral disorders are analyzed from the video. The results are provided as statistics.

또, 한편으로는, 클라이언트측 마이크를 통해서 음성을 실시간 캡쳐해서 스트림으로 서버로 전송하면, 화자로 등록되어 있는지 확인해서 등록되어 있지 않다면, 화자로 등록하는 과정을 거치고, 화자로 등록되어 있으면, 음성으로부터 화자별음성인식 과정을 거친 후 언어분석 단계로 진행하고, 분석된 결과를 통계로 제공하는 것이다.Also, on the other hand, if the voice is captured in real time through the microphone on the client side and transmitted to the server as a stream, it checks whether the voice is registered as a speaker. If not, it goes through the process of registering as a speaker. If the voice is registered as a speaker, the voice is transmitted to the server as a stream. After going through the speech recognition process for each speaker, the process proceeds to the language analysis stage, and the analyzed results are provided as statistics.

상기 영상 또는 음성으로부터 화자별음성인식 과정은, 서버에서 해당 영상으로부터 음성을 최소 단위로 분리해서 검출하고, 각 최소 단위의 연속된 조합으로 2명 이상의 화자를 분류하여, 화자 분류가 가장 정확한 조합에 해당 하는 음성을 화자에 매칭, 각 화자별로 음성인식을 수행하는 것이다.In the process of recognizing each speaker's voice from the video or audio, the server separates and detects the audio from the video into minimum units, classifies two or more speakers by successive combinations of each minimum unit, and classifies the speakers in the most accurate combination. The corresponding voice is matched to the speaker and voice recognition is performed for each speaker.

이와 같이 본 발명에 따른 바람직한 실시예를 살펴보았으며, 앞서 설명된 실시예 이외에도 본 발명이 그 취지나 범주에서 벗어남이 없이 다른 특정 형태로 구체화될 수 있다는 사실은 해당 기술분야에 있어 통상의 지식을 가진 자에게는 자명한 것이다.In this way, preferred embodiments according to the present invention have been examined, and the fact that the present invention can be embodied in other specific forms in addition to the embodiments described above without departing from the spirit or scope thereof is known to ordinary knowledge in the relevant technical field. It is self-evident to those who have it.

그러므로, 상술된 실시예는 제한적인 것이 아니라 예시적인 것으로 여겨져야 하며, 이에 따라 본 발명은 상술한 설명에 한정되지 않고 첨부된 청구항의 범주 및 그 동등 범위 내에서 변경될 수 있다.Therefore, the above-described embodiments are to be regarded as illustrative and not restrictive, and accordingly, the present invention is not limited to the above description but may be modified within the scope of the appended claims and their equivalents.

100...클라이언트 200...서버100...client 200...server

Claims

A video/audio transmission unit that captures video and audio in real time through a camera and microphone and transmits them to the server as a stream.
A voice transmission unit that captures voice in real time through a microphone and transmits it as a stream to the server, and
A client comprised of a result display unit that displays language analysis or behavior analysis results in a graph; and
A video/audio receiver that receives video and audio transmitted from the video/audio transmission unit of the client in real time,
A face analysis unit that determines whether a face is registered using the face recognition function,
Face registration, which goes through the process of registering the face if it is not registered in the face,
A facial speaker analysis unit that determines whether the speaker is registered using the speaker recognition function,
Face speaker register, which goes through the process of registering as a speaker if not registered as a speaker;
If registered as a speaker, a video speaker voice recognition unit that goes through a voice recognition process for each speaker from the video;
A language analysis unit that analyzes language from voice-recognized sentences,
Behavior analysis department, which analyzes behavioral disorders from videos,
An analysis results section that calculates statistical values by calculating the language analysis results,
A voice receiving unit that receives voice transmitted from the client's voice transmitting unit in real time,
A voice speaker analysis unit that checks whether the speaker is registered as a speaker, and
A server configured to register a speaker if the speaker is not registered as a speaker; It consists of:
The video speaker voice recognition unit of the server,
Detects voices by separating them into minimum units from the video, classifies two or more speakers using consecutive combinations of each minimum unit, matches the voice corresponding to the most accurate combination of speaker classification to the speaker, and performs voice recognition for each speaker. perform,
The speaker voice recognition unit of the server,
Detects the voice by separating it into minimum units from the corresponding voice, classifies two or more speakers using consecutive combinations of each minimum unit, matches the voice corresponding to the most accurate combination of speaker classification to the speaker, and performs voice recognition for each speaker. A system that analyzes language development disorders and behavioral development disorders by processing video information input from a camera and voice information input from a microphone in real time.

delete

According to claim 1,
The behavior analysis unit of the server performs language analysis and behavior analysis simultaneously when analyzing sudden shouting. When there is no video information, it analyzes voice, but uses a separate machine learning sound classifier to classify sounds including screaming and shouting. A system that analyzes language development disorders and behavioral development disorders by processing video information input from a camera and voice information input from a microphone in real time.

According to claim 1,
The language analysis unit of the server analyzes the speech-recognized sentences for articulatory phonological disorders, language development disorders, and fluency disorders using a morphological analyzer, a syntactic analyzer, and a semantic analyzer according to analysis indicators. A system that analyzes speech and behavioral development disorders by processing information and voice information input through a microphone in real time.

delete