KR20240021622A

KR20240021622A - Method and server for processing voices embedded in videos

Info

Publication number: KR20240021622A
Application number: KR1020220100177A
Authority: KR
Inventors: 김덕석; 김재경; 박준혁
Original assignee: 주식회사 엠티이지
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2024-02-19

Abstract

의료 행위와 관련되고, 음성을 포함하는 대상 동영상을 획득하는 단계; 상기 의료 행위와 관련성이 높은 화이트 키워드를 획득하는 단계; 상기 음성에 포함된 복수개의 워딩 중 상기 화이트 키워드에 대응되는 화이트 워딩을 결정하는 단계; 및 상기 음성의 전체 구간 중 상기 화이트 워딩에 대응되는 화이트 구간 외의 문제 구간에는 음소거되도록 음성 처리를 수행하는 단계;를 포함하는, 방법, 서버 및 디바이스가 개시된다.Obtaining a target video related to medical practice and including audio; Obtaining white keywords highly related to the medical practice; determining a white word corresponding to the white keyword among a plurality of wordings included in the voice; and performing voice processing to mute problem sections other than the white section corresponding to the white wording among all sections of the voice. A method, server, and device comprising a.

Description

Method and server for processing voices embedded in videos}

본 개시의 기술 분야는 의료 동영상에 포함된 음성을 처리하는 방법에 관한 것으로, 동영상 전체 구간의 음성에서 주요 구간을 결정하고 주요 구간 및 비주요 구간을 구분하여 비주요 구간의 음성이 음소거되도록 처리하고 사용자 계정, 네트워크 서버 사이의 이동통신망을 이용하여 음성 처리된 동영상에 포함된 음성 신호 정보를 사용자 계정에 제공하는 방법을 제공하는 기술분야와 관련된다.The technical field of the present disclosure relates to a method of processing audio included in a medical video, which includes determining a major section from the audio of the entire video section, distinguishing between a major section and a non-main section, and processing the audio in the non-main section to be muted. It is related to a technical field that provides a method of providing audio signal information included in audio-processed video to a user account using a mobile communication network between a user account and a network server.

최근 들어 카메라가 부착된 내시경 장비들이 다양한 분야에서 사용되고, 대상자의 체내에 삽입되거나 시술 또는 수술 현장을 촬영하는 영상 장비들이 소형화, 보편화되면서 내외과적 수술에서 동영상을 획득하는 경우가 많아지고 있다. 또한, 최근 들어 빅 데이터, 인공 지능의 발달로 동영상을 의료 정보 콘텐츠로 가공하고 이를 기반으로 기술에 대한 표준화 연구 및 도구의 사용성에 대한 수술 상황 확인 등에 대한 동영상의 주요 상황에 대한 빠른 검색 및 분석이 요구되고 있다. 또한, 의료심사평가와 같은 수술 평가의 경우 그 동안은 텍스트와 영상으로 구성된 심사청구 자료를 기초로 동영상을 분석하고 평가하였으나, 이는 제한적 정보를 기반으로 평가가 이루어짐으로써 정확한 확인을 하기 어려운 한계가 있다. 또한, 시술 또는 수술 상황에서 발생되는 음성 신호가 중요할 수 있는데 동영상 내에 불필요한 음성이 섞여 있을 수 있다. 이에 따라 불필요한 음성이 섞여 있는 동영상의 경우 불필요한 데이터 용량이 포함될 수 있고, 개인적 침해 부분에서의 문제 또한 발생할 수 있다는 문제점이 있다. 따라서 해당 동영상을 편집하거나 일부 구간을 선택적으로 음성/동영상 데이터를 소비함으로써 정보를 획득하는 것이 효율적일 수 있다. 사용자가 동영상에서의 음성을 듣거나 동영상을 확인할 때 동영상에 포함되는 음성을 유의미한 음성만 포함되도록 하여 더 주요 음성, 필요한 부분만을 확인할 수 있다. 따라서, 주요 구간 및 비주요 구간을 구분하여 주요 구간의 음성만을 제공할 수 있도록 하는 방법에 대한 제공이 필요한 실정이다.Recently, endoscopic devices equipped with cameras have been used in various fields, and video devices that are inserted into the patient's body or film the procedure or surgery site have become miniaturized and popular, and the number of cases of video acquisition in internal and surgical surgeries is increasing. In addition, recently, with the development of big data and artificial intelligence, videos have been processed into medical information content, and based on this, quick search and analysis of key situations in videos, such as standardization research on technology and confirmation of surgical situations regarding the usability of tools, etc. It is being demanded. In addition, in the case of surgical evaluations such as medical review and evaluation, videos have been analyzed and evaluated based on review request data consisting of text and video. However, this has the limitation of making accurate confirmation difficult because the evaluation is based on limited information. . Additionally, audio signals generated during a procedure or surgical situation may be important, but unnecessary audio may be mixed in the video. Accordingly, there is a problem that videos containing unnecessary audio may contain unnecessary data volume, and problems with personal infringement may also occur. Therefore, it may be efficient to obtain information by editing the video or selectively consuming audio/video data for some sections. When a user listens to a voice in a video or checks a video, the voice included in the video can be checked so that only meaningful voices are included, so that only the more important voices and necessary parts can be checked. Therefore, there is a need to provide a method for distinguishing between main and non-main sections and providing only the voice of the main section.

한국공개특허 제 10-2021-0120936 호 (2021.10.07) 음성 인터랙션 방법, 장치, 전자 기기, 판독 가능 저장 매체 및 컴퓨터 프로그램 제품Korean Patent Publication No. 10-2021-0120936 (2021.10.07) Voice interaction method, device, electronic device, readable storage medium, and computer program product

본 개시에서 해결하고자 하는 과제는 동영상에 포함된 음성을 처리하는데 있어서, 동영상 내에서의 음성에 따른 주요 구간 및 비주요 구간을 결정하여 구분하고 비주요 구간의 음성 신호에 음소거 처리를 수행함으로써 유의미한 음성을 획득하여 사용자 계정과의 이동통신망을 이용한 통신에 따라 해당 동영상을 제공하는 방법 및 서비스를 제공하기 위한 것이다.The problem to be solved in this disclosure is to process the voice included in the video, by determining and distinguishing the main section and non-main section according to the voice in the video and performing muting processing on the voice signal in the non-main section, so that meaningful voice is heard. The purpose is to provide a method and service for providing the video according to communication using a mobile communication network with the user's account.

본 개시에서 해결하고자 하는 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.The problems to be solved by this disclosure are not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로써, 본 개시의 제 1 측면에 따른 디바이스 및/또는 서버가 동영상에 포함된 음성을 처리하는 방법은, 의료 행위와 관련되고, 음성을 포함하는 대상 동영상을 획득하는 단계; 상기 의료 행위와 관련성이 높은 화이트 키워드를 획득하는 단계; 상기 음성에 포함된 복수개의 워딩 중 상기 화이트 키워드에 대응되는 화이트 워딩을 결정하는 단계; 및 상기 음성의 전체 구간 중 상기 화이트 워딩에 대응되는 화이트 구간 외의 문제 구간에는 음소거되도록 음성 처리를 수행하는 단계;를 포함할 수 있다.As a technical means for achieving the above-mentioned technical problem, the method of processing voice included in a video by a device and/or server according to the first aspect of the present disclosure is related to medical practice and involves processing a target video including voice. acquiring; Obtaining white keywords highly related to the medical practice; determining a white word corresponding to the white keyword among a plurality of wordings included in the voice; and performing voice processing to mute problem sections other than the white section corresponding to the white wording among all sections of the voice.

또한, 상기 화이트 키워드는 의학 용어를 포함하고, 상기 음성 처리를 수행하는 단계는 상기 음성의 전체 구간 중 상기 화이트 구간 외의 전체 구간에 대한 음성을 음소거 할 수 있다.Additionally, the white keyword includes a medical term, and the step of performing the voice processing may mute the voice for all sections other than the white section among all sections of the voice.

또한, 상기 화이트 구간은 상기 화이트 워딩을 포함하는 문장에 대응되는 구간을 포함할 수 있다.Additionally, the white section may include a section corresponding to a sentence including the white wording.

또한, 상기 의료 행위와 관련성이 낮은 블랙 키워드를 획득하는 단계; 및 상기 음성에 포함된 상기 복수개의 워딩 중 상기 블랙 키워드에 대응되는 블랙 워딩을 결정하는 단계;를 더 포함하고, 상기 음성 처리를 수행하는 단계는 상기 블랙 워딩에 대응되는 블랙 구간이 음소거되도록 음성 처리를 수행할 수 있다.Additionally, acquiring black keywords with low relevance to the medical practice; and determining a black word corresponding to the black keyword among the plurality of wordings included in the voice, wherein performing the voice processing includes processing the voice so that a black section corresponding to the black wording is muted. can be performed.

또한, 복수의 시점에 대해서 상기 대상 동영상에 포함된 복수의 사람의 의료 행위 기여도를 결정하는 단계; 및 상기 의료 행위 기여도가 기설정 레벨 미만인 사람으로부터 획득되는 음성에 대해서는 음소거되도록 음성 처리를 수행하는 단계;를 더 포함할 수 있다.Additionally, determining the contribution to medical practice of a plurality of people included in the target video for a plurality of viewpoints; It may further include performing voice processing to mute the voice acquired from a person whose contribution to medical practice is less than a preset level.

또한, 상기 화이트 키워드에 대응되는 상기 화이트 워딩에 대한 저장 및 텍스트화를 수행하는 단계; 상기 화이트 키워드가 기설정 수술에 대응되는지 여부를 결정하는 단계; 및 상기 화이트 키워드가 상기 기설정 수술에 대응되는 경우, 상기 텍스트화된 상기 화이트 워딩에 대한 색인 결과를 제공하는 단계;를 더 포함할 수 있다. Additionally, storing and converting the white wording corresponding to the white keyword into text; determining whether the white keyword corresponds to a preset surgery; and, if the white keyword corresponds to the preset surgery, providing an index result for the white wording converted into text.

또한, 상기 의료 행위와 관련성이 낮은 블랙 키워드를 획득하는 단계; 상기 음성에 포함된 상기 복수개의 워딩 중 상기 블랙 키워드에 대응되는 블랙 워딩을 결정하는 단계; 및 상기 음성 처리에 대한 보안 레벨을 결정하는 단계;를 더 포함하고, 상기 음성 처리를 수행하는 단계는 상기 보안 레벨에 기초하여 상기 화이트 구간 외의 전체 구간, 상기 블랙 워딩에 대응되는 블랙 구간 및 영상 분석을 통해 결정되는 블랙 영상 구간 중 적어도 하나의 구간에 대한 음성을 음소거 할 수 있다.Additionally, acquiring black keywords with low relevance to the medical practice; determining a black word corresponding to the black keyword among the plurality of wordings included in the voice; And determining a security level for the voice processing; further comprising performing the voice processing, based on the security level, analyzing the entire section other than the white section, the black section corresponding to the black wording, and the image. The audio for at least one section of the black video section determined through can be muted.

또한, 상기 블랙 구간 또는 상기 블랙 영상 구간과 상기 화이트 구간이 중첩되는 경우, 상기 보안 레벨에 기초하여 중첩 구간에 대한 음성을 음소거 할 수 있다.Additionally, when the black section or the black video section overlaps with the white section, audio for the overlapping section may be muted based on the security level.

본 개시의 제 2 측면에 따른 디바이스 및/또는 서버는, 의료 행위와 관련되고, 음성을 포함하는 대상 동영상을 획득하는 수신부; 및 상기 의료 행위와 관련성이 높은 화이트 키워드를 획득하고, 상기 음성에 포함된 복수개의 워딩 중 상기 화이트 키워드에 대응되는 화이트 워딩을 결정하고, 상기 음성의 전체 구간 중 상기 화이트 워딩에 대응되는 화이트 구간 외의 문제 구간에는 음소거되도록 음성 처리를 수행하는 프로세서; 를 포함할 수 있다.A device and/or server according to a second aspect of the present disclosure includes a receiving unit that acquires a target video that is related to medical practice and includes audio; and obtaining a white keyword highly relevant to the medical practice, determining a white word corresponding to the white keyword among a plurality of wordings included in the voice, and selecting a white word other than the white wording corresponding to the white wording among the entire section of the voice. The problem section includes a processor that performs voice processing to mute; may include.

또한, 상기 화이트 키워드는 의학 용어를 포함하고, 상기 프로세서는 상기 음성의 전체 구간 중 상기 화이트 구간 외의 전체 구간에 대한 음성을 음소거 할 수 있다.Additionally, the white keyword includes a medical term, and the processor may mute the entire voice section other than the white section among all sections of the voice.

본 개시의 제 3 측면에 따라 제 1 측면을 구현하기 위한 프로그램이 기록된 컴퓨터로 판독 가능한 비일시적 기록 매체를 포함할 수 있다.According to the third aspect of the present disclosure, it may include a computer-readable non-transitory recording medium on which a program for implementing the first aspect is recorded.

본 개시의 일 실시 예에 따르면, 수술 영상의 유의미한 음성 영역을 시계열로 배열하여 제공하기 때문에 사용자는 불필요한 음성 신호가 필터링된 음성만을 제공할 수 있다는 점에서 사용자의 만족도를 향상시킬 수 있다. According to an embodiment of the present disclosure, since meaningful audio areas of the surgical image are arranged in time series and provided, the user's satisfaction can be improved in that only audio with unnecessary audio signals filtered out can be provided.

또한, 수술 영상에서의 사적인, 주관적인 내용의 음성이 필터링 되기 때문에 효율성이 향상될 수 있고 의사 또는 간호사의 개인적 대화에 대한 침해 우려성을 낮출 수 있다는 점에서 효율성이 향상될 수 있다.In addition, efficiency can be improved because private and subjective voices in surgical videos are filtered, and concerns about infringement on doctors or nurses' personal conversations can be reduced.

또한, 의학 용어가 포함된 음성 영역을 판단하여 해당 영역의 음성만을 제공하기 때문에 정확도 높은 음성 구간을 결정할 수 있다는 점에서 효율성이 향상될 수 있다. Additionally, efficiency can be improved in that a voice section with high accuracy can be determined because the voice region containing medical terms is determined and only the voice from that region is provided.

또한, 사적인, 주관적인 용어가 포함된 음성 영역을 판단하여 해당 영역의 음성 영역을 음소거 처리 하기 때문에 불필요 영역의 음성을 필터링할 수 있다는 점에서 효율성이 향상될 수 있다. In addition, efficiency can be improved in that speech in unnecessary areas can be filtered by determining the speech area containing private or subjective terms and muting the speech area in that area.

또한, 수술 영상에서 수술에 참여하고 있는 의료진만의 음성을 구분하여 제공함으로써 수술에 관련도가 높고 정확도 높은 음성을 제공할 수 있다는 점에서 효율성이 향상될 수 있다.In addition, efficiency can be improved in that it is possible to provide voices with high relevance and accuracy to the surgery by distinguishing and providing the voices of only the medical staff participating in the surgery in the surgical video.

또한, 의학 용어에 따른 일부 구간을 사용자의 선택에 따라 제공될 수 있도록 색인 결과를 제공하기 때문에 사용자는 확인하고자 하는 영역의 영상 및 음성을 확인할 수 있다는 점에서 사용자의 만족도가 향상될 수 있다.In addition, because index results are provided so that some sections according to medical terminology can be provided according to the user's selection, the user's satisfaction can be improved in that the user can check the image and audio of the area he or she wants to check.

또한, 사용자로부터 입력되는 영상의 보안 레벨에 따라 상이한 방법으로 음성을 음소거함으로써 제공되는 음성의 범위를 조절할 수 있기 때문에 사용자는 영상의 중요도, 보안에 따라 조절되는 음소거 영역, 범위를 제공받을 수 있어 만족도가 향상될 수 있다. In addition, the range of the voice provided can be adjusted by muting the voice in different ways depending on the security level of the video input from the user, so the user can be provided with a mute area and range that are adjusted according to the importance and security of the video, thereby providing satisfaction. can be improved.

본 개시의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present disclosure are not limited to the effects mentioned above, and other effects not mentioned may be clearly understood by those skilled in the art from the description below.

도 1 은 일 실시 예에 따른 디바이스 및/또는 서버를 포함하는 동영상에 포함된 음성 처리를 수행하는 시스템의 개략적인 구성도이다.
도 2 는 일 실시 예에 따른 서버의 구성을 개략적으로 도시한 블록도이다.
도 3 은 일 실시 예에 따른 디바이스 및/또는 서버가 동작하는 각 단계를 도시한 흐름도이다.
도 4 는 일 실시 예에 따른 디바이스 및/또는 서버가 음성 처리를 수행하는 각 단계를 도시한 흐름도이다.
도 5 는 일 실시 예에 따른 디바이스 및/또는 서버가 음소거 구간을 결정하는 일 예를 설명하기 위한 도면이다.
도 6 은 일 실시 예에 따른 디바이스 및/또는 서버가 보안 레벨에 따라 중첩 구간을 음소거 구간으로 결정하는 일 예를 설명하기 위한 도면이다.
도 7 은 일 실시 예에 따른 디바이스 및/또는 서버가 색인 결과를 제공하는 일 예를 도시한 도면이다.1 is a schematic configuration diagram of a system that performs audio processing included in a video including a device and/or a server according to an embodiment.
Figure 2 is a block diagram schematically showing the configuration of a server according to an embodiment.
Figure 3 is a flowchart illustrating each step in which a device and/or server operates according to an embodiment.
FIG. 4 is a flowchart illustrating each step in which a device and/or a server performs voice processing according to an embodiment.
FIG. 5 is a diagram illustrating an example in which a device and/or a server determines a mute period, according to an embodiment.
FIG. 6 is a diagram illustrating an example in which a device and/or a server determines an overlapping section as a mute section according to a security level, according to an embodiment.
FIG. 7 is a diagram illustrating an example in which a device and/or server provides index results according to an embodiment.

본 개시에서 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술 되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나, 본 개시는 이하에서 개시되는 실시 예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 개시가 완전 하도록 하고, 해당 기술 분야에 속하는 통상의 기술자에게 본 개시의 범주를 완전하게 알려주기 위해 제공되는 것이다. Advantages and features in the present disclosure, and methods for achieving them, will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the disclosure is complete and to those skilled in the art. It is provided to provide complete information.

본 명세서에서 사용된 용어는 실시 예들을 설명하기 위한 것이며 본 개시를 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 개시의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terms used herein are for describing embodiments and are not intended to limit the disclosure. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used in the specification, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other elements in addition to the mentioned elements. Like reference numerals refer to like elements throughout the specification, and “and/or” includes each and every combination of one or more of the referenced elements. Although “first”, “second”, etc. are used to describe various components, these components are of course not limited by these terms. These terms are merely used to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may also be the second component within the technical spirit of the present disclosure.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 해당 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings commonly understood by those skilled in the art. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

공간적으로 상대적인 용어인 "아래(below)", "아래(beneath)", "하부(lower)", "위(above)", "상부(upper)" 등은 도면에 도시되어 있는 바와 같이 하나의 구성요소와 다른 구성요소들과의 상관관계를 용이하게 기술하기 위해 사용될 수 있다. 공간적으로 상대적인 용어는 도면에 도시되어 있는 방향에 더하여 사용시 또는 동작 시 구성요소들의 서로 다른 방향을 포함하는 용어로 이해되어야 한다. 예를 들어, 도면에 도시되어 있는 구성요소를 뒤집을 경우, 다른 구성요소의 "아래(below)"또는 "아래(beneath)"로 기술된 구성요소는 다른 구성요소의 "위(above)"에 놓여질 수 있다. 따라서, 예시적인 용어인 "아래"는 아래와 위의 방향을 모두 포함할 수 있다. 구성요소는 다른 방향으로도 배향될 수 있으며, 이에 따라 공간적으로 상대적인 용어들은 배향에 따라 해석될 수 있다.Spatially relative terms such as “below”, “beneath”, “lower”, “above”, “upper”, etc. are used as a single term as shown in the drawing. It can be used to easily describe the correlation between a component and other components. Spatially relative terms should be understood as terms that include different directions of components during use or operation in addition to the directions shown in the drawings. For example, if a component shown in a drawing is flipped over, a component described as “below” or “beneath” another component will be placed “above” the other component. You can. Accordingly, the illustrative term “down” may include both downward and upward directions. Components can also be oriented in other directions, so spatially relative terms can be interpreted according to orientation.

이하에서는 도면을 참조하여 실시 예들을 상세히 설명한다.Hereinafter, embodiments will be described in detail with reference to the drawings.

도 1 은 일 실시 예에 따른 디바이스(100) 및/또는 서버(190)를 포함하는 동영상에 포함된 음성 처리를 수행하는 시스템을 설명하는 도면이다.FIG. 1 is a diagram illustrating a system that performs audio processing included in a video including a device 100 and/or a server 190 according to an embodiment.

도 1에 도시된 바와 같이, 음성 처리 시스템은 정보 획득 장치(110), 디바이스(100), 내부 서버(190), 외부 서버(130), 저장 매체(140), 통신 디바이스(150), 가상 서버(160), 사용자 단말(170) 및 네트워크 등을 포함할 수 있다.As shown in Figure 1, the voice processing system includes an information acquisition device 110, a device 100, an internal server 190, an external server 130, a storage medium 140, a communication device 150, and a virtual server. 160, a user terminal 170, and a network may be included.

그러나, 도 1에 도시된 구성요소들 외에 다른 범용적인 음성 처리 시스템에 더 포함될 수 있음을 관련 기술 분야에서 통상의 지식을 가진 자라면 이해할 수 있다. 예를 들면, 음성 처리 시스템은 네트워크와 연동하여 동작하는 블록체인 서버(미도시)를 더 포함할 수 있다. 또는 다른 실시 예에 따를 경우, 도 1에 도시된 구성요소들 중 일부 구성요소는 생략될 수 있음을 관련 기술 분야에서 통상의 지식을 가진 자라면 이해할 수 있다.However, those skilled in the art will understand that other general-purpose speech processing systems may include other components in addition to those shown in FIG. 1. For example, the voice processing system may further include a blockchain server (not shown) that operates in conjunction with the network. Alternatively, according to another embodiment, those skilled in the art may understand that some of the components shown in FIG. 1 may be omitted.

일 실시 예에 따른 디바이스(100) 및/또는 내부 서버(190)을 포함하는 음성 처리 장치(10)는 정보 획득 장치(110)로부터 수술 등 의료 행위와 관련된 정보를 획득할 수 있다. 정보 획득 장치(110)는 촬영 장치, 녹음 장치, 생체 신호 획득 장치 등을 포함할 수 있으나, 이에 제한되지 않는다. 생체 신호는 체온 신호, 맥박 신호, 호흡 신호, 혈압 신호, 근전도 신호, 뇌파 신호 등 생명체로부터 획득되는 신호를 제한 없이 포함할 수 있다. 정보 획득 장치(110)의 일 예인 촬영 장치는 수술실 상황을 전체적으로 촬영하는 제 1 촬영 장치(예: CCTV 등)와 수술 부위를 집중적으로 촬영하는 제 2 촬영 장치(예: 내시경 등) 등을 포함할 수 있으나, 이에 제한되지 않는다.The voice processing device 10 including the device 100 and/or the internal server 190 according to an embodiment may obtain information related to medical procedures such as surgery from the information acquisition device 110. The information acquisition device 110 may include, but is not limited to, a photographing device, a recording device, a biological signal acquisition device, etc. Biological signals may include signals obtained from living organisms, such as body temperature signals, pulse signals, respiration signals, blood pressure signals, electromyography signals, and brain wave signals, without limitation. The imaging device, which is an example of the information acquisition device 110, may include a first imaging device (e.g., CCTV, etc.) for photographing the entire operating room situation and a second imaging device (e.g., endoscope, etc.) for intensively photographing the surgical site. may, but is not limited to this.

일 실시 예에 따른 디바이스(100)는 정보 획득 장치(110)로부터 수술 등 의료 행위와 관련된 영상(동영상, 정지영상 등)을 획득할 수 있다. 디바이스(100)는 획득한 영상에 포함된 음성 처리를 수행할 수 있다. 일 실시 예에 따른 음성 처리는 각각의 영상에 대한 네이밍, 인코딩, 저장, 전송, 편집, 음소거, 메타 데이터 생성 등을 포함할 수 있으나, 이에 제한되지 않는다.The device 100 according to an embodiment may acquire images (videos, still images, etc.) related to medical procedures such as surgery from the information acquisition device 110. The device 100 may perform audio processing included in the acquired image. Voice processing according to one embodiment may include naming, encoding, storage, transmission, editing, muting, metadata creation, etc. for each video, but is not limited thereto.

일 실시 예에 따른 음성 처리 장치(10)는 정보 획득 장치(110)로부터 획득한 의료 행위 관련 정보를 그대로 또는 갱신하여 네트워크로 전송할 수 있다. 음성 처리 장치(10)가 네트워크로 전송하는 전송 정보는 네트워크를 통해서 외부 디바이스(130, 140, 150, 160, 170)로 전송될 수 있다. 예를 들면, 네트워크는 외부 서버(130), 저장 매체(140), 통신 디바이스(150), 가상 서버(160), 사용자 단말(170) 등으로 디바이스(100)가 네트워크로 전송한 전송 정보를 그대로 또는 갱신하여 전송할 수 있다. 디바이스(100) 또한 외부 디바이스(130, 140, 150, 160, 170)로부터 수신한 정보(예: 피드백 정보, 갱신 요청 등)를 수신할 수 있다. 통신 디바이스(150)는 통신에 이용되는 디바이스를 제한 없이 의미할 수 있으며(예: 게이트웨이), 통신 디바이스(150)는 사용자 단말(180) 등 네트워크와 직접 연결되지 않는 디바이스와 통신할 수 있다.The voice processing device 10 according to an embodiment may transmit the medical treatment-related information obtained from the information acquisition device 110 as is or as updated information to the network. Transmission information transmitted by the voice processing device 10 to the network may be transmitted to external devices 130, 140, 150, 160, and 170 through the network. For example, the network transmits the transmission information sent by the device 100 to the network as is, such as the external server 130, the storage medium 140, the communication device 150, the virtual server 160, and the user terminal 170. Alternatively, it can be updated and transmitted. Device 100 may also receive information (eg, feedback information, update request, etc.) received from external devices 130, 140, 150, 160, and 170. The communication device 150 may refer to a device used for communication without limitation (eg, a gateway), and the communication device 150 may communicate with a device that is not directly connected to the network, such as the user terminal 180.

일 실시 예에 따른 디바이스(100) 및 또는 서버(190)는 후술하는 바와 같이 입력부, 출력부 프로세서, 메모리 등을 포함할 수 있으며, 디스플레이 장치(미도시)도 포함할 수 있다. 예컨대, 사용자는 디스플레이 장치를 통해서 통신 상태, 메모리 사용 현황, 전력 상태(예: 배터리의 충전상태(State Of Charge), 외부 전력 공급 여부 등), 저장된 동영상에 대한 썸네일, 현재 동작 중인 동작 모드 등을 등을 확인 할 수 있다. 한편, 디스플레이 장치는 액정 디스플레이(liquid crystal display), 박막 트랜지스터 액정 디스플레이(thin film transistor-liquid crystal display), 유기 발광 다이오드(organic light-emitting diode), 플렉시블 디스플레이(flexible display), 3차원 디스플레이(3D display), 전기영동 디스플레이(electrophoretic display) 등일 수 있다. 또한, 디스플레이 장치는 구현 형태에 따라 2개 이상의 디스플레이를 포함할 수 있다. 또한, 디스플레이의 터치패드가 레이어 구조를 이루어 터치 스크린으로 구성되는 경우, 디스플레이는 출력 장치 이외에 입력 장치로도 사용될 수 있다.The device 100 and/or the server 190 according to an embodiment may include an input unit, an output processor, memory, etc., and may also include a display device (not shown), as will be described later. For example, through the display device, the user can view communication status, memory usage status, power status (e.g., battery state of charge, external power supply, etc.), thumbnails for stored videos, currently operating mode, etc. You can check, etc. Meanwhile, display devices include liquid crystal display, thin film transistor-liquid crystal display, organic light-emitting diode, flexible display, and 3D display. display), electrophoretic display, etc. Additionally, the display device may include two or more displays depending on the implementation type. Additionally, when the touchpad of the display has a layered structure and is configured as a touch screen, the display can be used as an input device in addition to an output device.

또한, 네트워크는 유선 통신 또는 무선 통신을 통해 상호 통신을 수행할 수 있다. 예컨대 네트워크는 일종의 서버로 구현될 수도 있으며, 와이파이 칩, 블루투스 칩, 무선 통신 칩, NFC 칩 등을 포함할 수 있다. 물론, 디바이스(100) 및/또는 서버(190)는 와이파이 칩, 블루투스 칩, 무선 통신 칩, NFC 칩 등을 이용하여 각종 외부기기와 통신을 수행할 수 있다. 와이파이 칩, 블루투스 칩은 각각 Wi-Fi 방식, 블루투스 방식으로 통신을 수행할 수 있다. 와이파이 칩이나 블루투스 칩을 이용하는 경우에는 SSID 및 세션 키 등과 같은 각종 연결 정보를 먼저 송수신하여, 이를 이용하여 통신 연결한 후 각종 정보들을 송수신할 수 있다. 무선 통신 칩은 IEEE, 지그비, 3G(3rd Generation), 3GPP(3rd Generation Partnership Project), LTE(Long Term Evolution) 등과 같은 다양한 통신 규격에 따라 통신을 수행할 수 있다. NFC 칩은 135kHz, 13.56MHz, 433MHz, 860~960MHz, 2.45GHz 등과 같은 다양한 RF-ID 주파수 대역들 중에서 13.56MHz 대역을 사용하는 NFC(Near Field Communication) 방식으로 동작할 수 있다.Additionally, networks may communicate with each other through wired or wireless communication. For example, a network may be implemented as a type of server and may include a Wi-Fi chip, Bluetooth chip, wireless communication chip, NFC chip, etc. Of course, the device 100 and/or the server 190 may communicate with various external devices using a Wi-Fi chip, a Bluetooth chip, a wireless communication chip, an NFC chip, etc. Wi-Fi chips and Bluetooth chips can communicate using Wi-Fi and Bluetooth methods, respectively. When using a Wi-Fi chip or a Bluetooth chip, various connection information such as SSID and session key are first transmitted and received, and various information can be transmitted and received after establishing a communication connection using this. Wireless communication chips can perform communication according to various communication standards such as IEEE, ZigBee, 3G (3rd Generation), 3GPP (3rd Generation Partnership Project), and LTE (Long Term Evolution). The NFC chip can operate in the NFC (Near Field Communication) method using the 13.56MHz band among various RF-ID frequency bands such as 135kHz, 13.56MHz, 433MHz, 860~960MHz, 2.45GHz, etc.

일 실시 예에 따른 입력부는 사용자가 디바이스(100) 및/또는 서버(190)를 제어하기 위한 데이터를 입력하는 수단을 의미할 수 있다. 예를 들어, 입력부에는 키 패드(key pad), 돔 스위치 (dome switch), 터치 패드(접촉식 정전 용량 방식, 압력식 저항막 방식, 적외선 감지 방식, 표면 초음파 전도 방식, 적분식 장력 측정 방식, 피에조 효과 방식 등), 조그 휠, 조그 스위치 등이 있을 수 있으나 이에 한정되는 것은 아니다.The input unit according to one embodiment may refer to a means through which a user inputs data to control the device 100 and/or the server 190. For example, the input unit includes a key pad, dome switch, and touch pad (contact capacitive type, pressure resistance type, infrared detection type, surface ultrasonic conduction type, integral tension measurement type, There may be a piezo effect method, etc.), a jog wheel, a jog switch, etc., but it is not limited thereto.

일 실시 예에 따른 출력부는 오디오 신호 또는 비디오 신호 또는 진동 신호를 출력할 수 있으며, 출력부는 디스플레이 장치, 음향 출력 장치, 및 진동 모터 등을 포함할 수 있다.The output unit according to one embodiment may output an audio signal, a video signal, or a vibration signal, and the output unit may include a display device, a sound output device, and a vibration motor.

일 실시 예에 따른 사용자 단말(170)은 스마트폰(Smartphone), 스마트패드(SmartPad), 태블릿 PC 등 다양한 유무선 통신 디바이스를 포함할 수 있으나 이에 제한되지 않는다.The user terminal 170 according to an embodiment may include, but is not limited to, various wired and wireless communication devices such as a smartphone, SmartPad, and tablet PC.

일 실시 예에 따른 음성 처리 장치(10)는 정보 획득 장치(110)로부터 획득한 의료 행위 관련 정보를 갱신할 수 있다. 예를 들면, 디바이스(100)는 정보 획득 장치(110)로부터 획득한 영상에 대한 네이밍, 인코딩, 저장, 전송, 편집, 메타 데이터 생성 등을 수행할 수 있다. 일 예로, 디바이스(100)는 획득한 영상의 메타 데이터(예: 생성 시간)을 이용하여 영상 파일의 네이밍을 수행할 수 있다. 다른 예로, 음성 처리 장치(10)는 정보 획득 장치(110)로부터 획득한 의료 행위와 관련된 영상을 분류할 수 있다. 음성 처리 장치(10)는 학습된 AI를 이용하여, 의료 행위와 관련된 영상을 및 음성 신호를 수술 종류, 수술자, 수술 장소 등 다양한 기준에 기초하여 분류할 수 있다.The voice processing device 10 according to an embodiment may update medical practice-related information obtained from the information acquisition device 110. For example, the device 100 may perform naming, encoding, storage, transmission, editing, metadata creation, etc. on the image acquired from the information acquisition device 110. As an example, the device 100 may name an image file using metadata (eg, creation time) of the acquired image. As another example, the voice processing device 10 may classify images related to medical procedures obtained from the information acquisition device 110. The voice processing device 10 can use learned AI to classify images and audio signals related to medical procedures based on various criteria such as type of surgery, operator, and location of surgery.

일 실시 예에서, 디바이스(100)와 서버(190)는 연동될 수 있으며 음성 처리 장치(10)로써 음성 처리를 수행하기 위한 구성은 서버(190)에 의해 수행될 수 있으며 디바이스(100)에서 수행될 수도 있다. 예를 들면 디바이스(100)는 서버(190)로서 동작 할 수 있다. 이하에서는 서버(190)로 통일하여 후술하지만 실제로 동작함에 있어서 동작 주체는 서버(190)가 아닌 디바이스(100) 또는 음성 처리 장치(10)로 해석될 수 있다.In one embodiment, the device 100 and the server 190 may be interconnected, and the configuration for performing voice processing with the voice processing device 10 may be performed by the server 190 and performed by the device 100. It could be. For example, device 100 may operate as a server 190. Hereinafter, it will be described later as the server 190, but in actual operation, the operating entity may be interpreted as the device 100 or the voice processing device 10, not the server 190.

도 2 는 일 실시 예에 따른 서버(190)의 구성을 개략적으로 도시한 블록도이다.Figure 2 is a block diagram schematically showing the configuration of the server 190 according to an embodiment.

도 2를 참조하면, 서버(190)는 수신부(210), 프로세서(220), 출력부(230) 및 메모리(240)를 포함할 수 있다. Referring to FIG. 2 , the server 190 may include a receiving unit 210, a processor 220, an output unit 230, and a memory 240.

일 실시 예에 따른 수신부(210)는 사용자 계정 또는 정보 획득 장치(110)로부터 의료 행위와 관련되고, 음성을 포함하는 대상 동영상을 획득할 수 있다. 일 실시 예에서, 대상 동영상은 수술 동영상을 의미하며, 수술 동영상 내의 음성이 함께 제공되는 수술 동영상 전체 구간을 포함할 수 있다.The receiving unit 210 according to an embodiment may obtain a target video that is related to medical practice and includes audio from the user account or the information acquisition device 110. In one embodiment, the target video refers to a surgery video and may include the entire section of the surgery video in which audio within the surgery video is provided.

일 실시 예에 따른 프로세서(220)는 의료 행위와 관련성이 높은 화이트 키워드를 획득할 수 있다. 또한, 프로세서(220)는 수신부(210)로부터 획득된 대상 동영상 내에서의 음성에 포함된 복수개의 워딩 중 화이트 키워드에 대응되는 화이트 워딩을 결정할 수 있다. 또한, 프로세서(220)는 음성의 전체 구간 중 화이트 워딩에 대응되는 화이트 구간을 결정할 수 있고 화이트 구간 외의 문제 구간에는 음소거되도록 음성 처리를 수행할 수 있다. The processor 220 according to one embodiment may acquire white keywords that are highly relevant to medical practice. Additionally, the processor 220 may determine a white wording corresponding to the white keyword among a plurality of wordings included in the voice in the target video obtained from the receiver 210. Additionally, the processor 220 can determine a white section corresponding to white wording among all sections of the voice and perform voice processing to mute problem sections other than the white section.

일 실시 예에 따른 출력부(230)는 프로세서(220)로부터 획득된 대상 동영상에 대한 음성 처리 동영상을 사용자 계정으로 출력할 수 있다. 따라서 사용자는 음성 처리 동영상을 모니터링 할 수 있다. The output unit 230 according to one embodiment may output a voice-processed video for the target video obtained from the processor 220 to the user account. Therefore, users can monitor voice-processed videos.

일 실시 예에 따른 메모리(240)는 수신부(210)로부터 획득된 대상 동영상, 프로세서(220)로부터 획득된 화이트 키워드, 화이트 워딩, 화이트 구간 및 문제 구간 등 음성 처리 동영상을 획득하는 프로세스에서 생성되는 다양한 정보들을 저장 및 포함할 수 있다. The memory 240 according to an embodiment includes various information generated in the process of acquiring a voice-processed video, such as a target video acquired from the receiver 210, a white keyword acquired from the processor 220, white wording, a white section, and a problem section. Information can be stored and included.

또한, 서버(190)는 수신부(210)에서 사용자 계정 또는 정보 획득 장치(110)로부터 대상 동영상을 획득하며 출력부(230)를 통해 대상 동영상에 대한 음성 처리 동영상을 사용자 계정으로 출력하게 되는 과정에서 인터넷망 또는 이동통신망 등과 같은 종래의 다양한 네트워크 조합에 의해 결합될 수 있으며, 이에 대해서는 특별한 제한이 없음을 유의해야 한다. In addition, the server 190 obtains the target video from the user account or the information acquisition device 110 at the receiver 210 and outputs the audio processed video for the target video to the user account through the output unit 230. It should be noted that it can be combined by various conventional network combinations such as Internet networks or mobile communication networks, and there are no special restrictions on this.

더하여, 도 2에 도시된 구성요소들 외에 다른 범용적인 구성이 서버(180)에 더 포함될 수 있음을 관련 기술 분야에서 통상의 지식을 가진 자라면 이해할 수 있다. 예를 들면, 서버(190)는 사용자 계정 또는 정보 획득 장치(110)로부터 획득되는 의료 행위와 관련되고, 음성을 포함하는 대상 동영상에 대한 정보를 저장하는 복수의 메모리(미도시)를 더 포함할 수 있다. 또는 다른 실시 예에 따를 경우, 도 2에 도시된 구성요소들 중 일부 구성요소는 생략될 수 있음을 관련 기술 분야에서 통상의 지식을 가진 자라면 이해할 수 있다.In addition, those skilled in the art can understand that other general-purpose configurations other than the components shown in FIG. 2 may be included in the server 180. For example, the server 190 may further include a plurality of memories (not shown) that store information about a target video that is related to a user account or a medical procedure obtained from the information acquisition device 110 and includes audio. You can. Alternatively, according to another embodiment, those skilled in the art may understand that some of the components shown in FIG. 2 may be omitted.

일 실시 예에 따른 서버(190)는 사용자에 의해 이용될 수 있고, 휴대폰, 스마트폰, PDA(Personal Digital Assistant), PMP(Portable Multimedia Player), 태블릿 PC 등과 같이 터치 스크린 패널이 구비된 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치와 연동될 수 있으며, 이 외에도 데스크탑 PC, 태블릿 PC, 랩탑 PC, 셋탑 박스를 포함하는 IPTV와 같이, 애플리케이션을 설치하고 실행할 수 있는 기반이 마련된 장치에 포함되거나 연동될 수 있다.The server 190 according to one embodiment can be used by a user, and can be used by any type of device equipped with a touch screen panel, such as a mobile phone, smartphone, PDA (Personal Digital Assistant), PMP (Portable Multimedia Player), tablet PC, etc. It can be linked to handheld-based wireless communication devices, and is also included in devices that have a basis for installing and running applications, such as desktop PCs, tablet PCs, laptop PCs, and IPTVs, including set-top boxes. It can be linked.

서버(190)는 본 명세서에서 설명되는 기능을 실현시키기 위한 컴퓨터 프로그램을 통해 동작하는 컴퓨터 등의 단말기로 구현될 수 있다.The server 190 may be implemented as a terminal such as a computer that operates through a computer program to realize the functions described in this specification.

일 실시 예에 따른 서버(190)는 동영상에 포함된 음성 처리를 수행하는 시스템(미도시), 관련 서버(미도시) 및 관련 디바이스를 포함할 수 있으나, 이에 제한되지 않는다. 일 실시 예에 따른 서버(190)는 음성 처리를 수행하는 애플리케이션을 지원할 수 있다.The server 190 according to an embodiment may include, but is not limited to, a system (not shown) that performs audio processing included in a video, a related server (not shown), and a related device. The server 190 according to one embodiment may support an application that performs voice processing.

이하에서는 일 실시 예에 따른 서버(190)가 독립적으로 동영상에 포함된 음성 처리를 수행하는 실시 예를 중심으로 서술하도록 하지만, 전술한 것처럼, 디바이스와의 연동을 통해 수행될 수도 있다. 즉, 일 실시 예에 따른 서버(190)와 디바이스(100)는 그 기능의 측면에서 통합 구현될 수 있고, 디바이스(100)는 생략되어 설명될 수도 있으며, 어느 하나의 실시 예에 제한되지 않음을 알 수 있다.Hereinafter, the description will focus on an embodiment in which the server 190 according to an embodiment independently performs voice processing included in a video. However, as described above, it may also be performed through linkage with a device. That is, the server 190 and the device 100 according to one embodiment may be integrated and implemented in terms of their functions, and the device 100 may be omitted and described, and is not limited to any one embodiment. Able to know.

도 3 은 일 실시 예에 따른 디바이스(100) 및/또는 서버(190)가 동작하는 각 단계를 도시한 흐름도이다.FIG. 3 is a flowchart illustrating each step in which the device 100 and/or the server 190 operate according to an embodiment.

단계 S310 에서 서버(190)는 의료 행위와 관련되고, 음성을 포함하는 대상 동영상을 획득할 수 있다. 일 실시 예에서 대상 동영상은 수술 동영상을 포함할 수 있고 음성을 포함하는 수술 동영상 전체 구간 모두를 포함할 수 있다. 따라서, 서버(190)는 사용자 계정 또는 정보 획득 장치(110)로부터 수술 기간 전체의 음성을 포함하는 대상 동영상을 획득할 수 있다. In step S310, the server 190 may acquire a target video that is related to medical practice and includes audio. In one embodiment, the target video may include a surgery video and may include all sections of the surgery video including audio. Accordingly, the server 190 may obtain a target video including audio of the entire surgical period from the user account or the information acquisition device 110.

단계 S320 에서 서버(190)는 의료 행위와 관련성이 높은 화이트 키워드를 획득할 수 있다. 화이트 키워드는 의료 행위와 관련되는 키워드를 식별하는 인공지능 기반 알고리즘을 기반으로 결정되어 획득될 수 있고, 관리자에 의해 기설정될 수도 있다. 또한, 화이트 키워드는 수술 동영상에 포함되는 수술의 종류, 수술의 영역에 따라 상이하게 결정될 수 있다. 일 실시 예에서, 화이트 키워드는 의학 용어를 포함할 수 있다. 의학 용어는 수술의 종류, 수술의 영역에 따라 수술 시 사용되는 의료 시술 및 수술용 용어를 의미할 수 있다. In step S320, the server 190 may obtain a white keyword highly related to medical practice. White keywords can be determined and obtained based on an artificial intelligence-based algorithm that identifies keywords related to medical practice, and can also be preset by the administrator. Additionally, white keywords may be determined differently depending on the type and area of surgery included in the surgery video. In one embodiment, white keywords may include medical terms. Medical terms may refer to medical procedures and surgical terminology used during surgery depending on the type of surgery and area of surgery.

단계 S330 에서 서버(190)는 음성에 포함된 복수개의 워딩 중 화이트 키워드에 대응되는 화이트 워딩을 결정할 수 있다. 예를 들면 화이트 키워드는 워딩과 같은 단어일 수 있으며, 화이트 워딩은 동영상에 포함된 음성에 포함되는 복수의 워딩 중 미리 획득된 화이트 키워드가 포함된 워딩 또는 동일한 워딩이 대응될 수 있다. 따라서, 서버(190)는 동영상에 포함된 음성에서의 화이트 워딩을 결정할 수 있다.In step S330, the server 190 may determine a white word corresponding to the white keyword among a plurality of wordings included in the voice. For example, the white keyword may be the same word as the wording, and the white wording may correspond to a word containing a pre-acquired white keyword or the same word among a plurality of words included in the voice included in the video. Accordingly, the server 190 can determine white wording in the voice included in the video.

단계 S340 에서 서버(190)는 음성의 전체 구간 중 화이트 워딩에 대응되는 화이트 구간 외의 문제 구간에는 음소거되도록 음성 처리를 수행할 수 있다. 일 실시 예에서 화이트 구간은 기설정 범위에 따른 구간을 포함할 수 있다. 예를 들면 서버(190)는 동영상 내 음성 영역에서 화이트 워딩이 포함된 시점을 획득하고 화이트 워딩이 포함된 시점에서 앞뒤로의 기설정 시간 영역(범위)의 구간을 화이트 구간으로 결정할 수 있다. 또한, 일 실시 예에서, 화이트 구간은 화이트 워딩을 포함하는 문장에 대응되는 구간을 포함할 수 있다. 예를 들면 서버(190)는 동영상 내 음성 영역에서 화이트 워딩이 포함된 위치를 획득하고 화이트 워딩이 포함된 문장 영역을 화이트 구간으로 결정할 수도 있다. 또한, 문장의 개수 및 위치는 기설정 될 수 있다. 일 실시 예에서 서버(190)는 화이트 워딩이 포함된 해당 문장만 화이트 구간으로 결정할 수도 있고 기설정 문장 개수, 기설정 위치에 따라 화이트 워딩이 포함된 해당 문장, 바로 앞문장, 바로 뒷문장을 화이트 구간으로 결정할 수도 있다. 예를 들면, 기설정 문장 개수가 2개이고 기설정 위치가 '앞' 이면 화이트 워딩이 포함된 해당 문장과 바로 앞문장 두 문장이 화이트 구간으로 결정될 수 있다. 또한, 기설정 개수가 3개이고 기설정 위치가 '앞' '뒤' 이면 화이트 워딩이 포함된 해당 문장, 바로 앞문장, 바로 뒷문장이 화이트 구간으로 결정될 수 있다. 문제 구간은 화이트 구간을 제외한 영역을 의미하며 의료 행위와 관련성이 높은 화이트 키워드에 대응되지 않는 불필요 영역을 의미할 수 있다.In step S340, the server 190 may perform voice processing to mute problem sections other than the white section corresponding to the white wording among all sections of the voice. In one embodiment, the white section may include a section according to a preset range. For example, the server 190 may obtain the time point at which white wording is included in the audio area in the video and determine the section of the preset time region (range) forward and backward from the time point at which the white wording is included as the white section. Additionally, in one embodiment, the white section may include a section corresponding to a sentence including white wording. For example, the server 190 may obtain the position containing white wording in the voice area in the video and determine the sentence area containing the white wording as the white section. Additionally, the number and position of sentences may be preset. In one embodiment, the server 190 may determine only the corresponding sentence containing white wording as a white section, and may determine the corresponding sentence containing white wording, the immediately preceding sentence, and the immediately following sentence as white according to the number of preset sentences and the preset position. It can also be decided by section. For example, if the number of preset sentences is 2 and the preset position is 'front', the corresponding sentence containing white wording and the two sentences immediately preceding it may be determined as the white section. In addition, if the preset number is 3 and the preset positions are 'front' and 'back', the corresponding sentence containing white wording, the immediately preceding sentence, and the immediately following sentence may be determined as the white section. The problem section refers to an area excluding the white section and may refer to an unnecessary area that does not correspond to a white keyword that is highly relevant to medical practice.

따라서 일 실시 예에 따른 서버(190)는 동영상 내 전체 구간의 음성에서 결정된 화이트 구간을 제외한 문제 구간에 포함되는 음성을 음소거 처리 할 수 있다. 또한, 서버(190)는 다양한 방법에 따라 음성이 음소거되도록 음소거 처리를 수행할 수 있다. 음소거 처리를 수행하는 방법은 도 4를 참조하여 설명할 수 있다.Therefore, the server 190 according to one embodiment may mute the audio included in the problem section excluding the white section determined from the audio of all sections in the video. Additionally, the server 190 may perform mute processing to mute the voice according to various methods. A method of performing mute processing can be explained with reference to FIG. 4.

도 4 는 일 실시 예에 따른 디바이스(100) 및/또는 서버(190)가 음성 처리를 수행하는 각 단계를 도시한 흐름도이다.FIG. 4 is a flowchart illustrating each step in which the device 100 and/or the server 190 perform voice processing according to an embodiment.

도 4를 참조하면, 상술한 바와 같이 단계 S411 에서 서버(190)는 의학 용어를 포함하는 화이트 키워드를 획득할 수 있고 단계 S413 에서 서버(190)는 의학 용어가 포함된 화이트 키워드에 대응되는 화이트 워딩 및 화이트 구간 외의 전체 구간에 대한 음성이 음소거되도록 음성 처리를 수행할 수 있다.Referring to FIG. 4, as described above, in step S411, the server 190 may obtain a white keyword including a medical term, and in step S413, the server 190 may obtain a white wording corresponding to the white keyword including a medical term. And voice processing may be performed so that the voice for the entire section other than the white section is muted.

또한, 도 4 를 참조하면, 단계 S421 에서 서버(190)는 의료 행위와 관련성이 낮은 블랙 키워드를 획득할 수 있다. 블랙 키워드는 의료 행위 또는 의학 용어와 관련 없는 키워드를 의미할 수 있다. 예를 들면, 기설정 알고리즘에 의해 의학 용어 이외의 모든 키워드가 블랙 키워드로 결정될 수도 있고 “예쁘다”, “잘생겼다”와 같은 의료 행위와는 무관한 주관적인, 사적인 키워드가 블랙 키워드로 결정될 수도 있다. 또한, 단계 S423 에서 서버(190)는 음성에 포함된 복수개의 워딩 중 블랙 키워드에 대응되는 블랙 워딩을 결정할 수 있다. 단계 S330 에서 화이트 워딩에 대하여 설명한 바와 같이 블랙 워딩은 동영상 음성에 포함된 복수의 워딩 중 미리 획득된 블랙 키워드가 포함된 워딩 또는 동일한 워딩이 대응될 수 있다. 또한, 단계 S425 에서 서버(190)는 블랙 워딩에 대응되는 블랙 구간이 음소거되도록 음성 처리를 수행할 수 있다. 단계 S340 에서 화이트 구간에 대하여 설명한 바와 같이 블랙 구간은 동영상 내 음성 영역에서 블랙 워딩이 포함된 시점을 획득하고 블랙 워딩이 포함된 시점에서 앞뒤로의 기설정 시간 영역을 블랙 구간으로 결정할 수 있다. 또한, 블랙 구간은 블랙 워딩을 포함하는 문장에 대응하는 구간을 포함할 수 있다. 예를 들면 서버(190)는 동영상 내 음성 영역에서 블랙 워딩이 포함된 위치를 획득하고 블랙 워딩이 포함된 문장 영역을 블랙 구간으로 결정할 수도 있다. 따라서, 일 실시 예에 따른 서버(190)는 동영상 내 전체 구간의 음성에서 결정된 블랙 구간에 포함되는 음성을 음소거 처리 할 수 있다. Additionally, referring to FIG. 4, in step S421, the server 190 may obtain a black keyword that has low relevance to medical practice. Black keywords may refer to keywords that are not related to medical practice or medical terminology. For example, all keywords other than medical terms may be determined as black keywords by a preset algorithm, or subjective and private keywords unrelated to medical practice, such as “pretty” and “handsome,” may be determined as black keywords. Additionally, in step S423, the server 190 may determine a black word corresponding to the black keyword among a plurality of wordings included in the voice. As described with respect to the white wording in step S330, the black wording may correspond to a word containing a pre-obtained black keyword or the same word among a plurality of wordings included in the video audio. Additionally, in step S425, the server 190 may perform voice processing so that the black section corresponding to the black wording is muted. As described for the white section in step S340, the black section can be obtained by obtaining a point in time when black wording is included in the audio area in the video, and determining a preset time area before and after the time point containing black wording as the black section. Additionally, the black section may include a section corresponding to a sentence including black wording. For example, the server 190 may obtain the location containing black wording in the audio area in the video and determine the sentence area containing black wording as the black section. Accordingly, the server 190 according to one embodiment may mute the voice included in the black section determined from the voice of the entire section in the video.

도 4를 참조하면, 단계 S431 에서 서버(190)는 복수의 시점에 대해서 대상 동영상에 포함된 복수의 사람의 의료 행위 기여도를 결정할 수 있다. 예를 들면 의료 행위 기여도는 복수의 사람의 위치에 따라 결정될 수 있다. 예를 들면 수술하는데 있어서 집도 의사의 기설정 위치, 서브 의사의 기설정 위치, 마취사의 기설정 위치 및 간호사의 기설정 위치 등이 있을 수 있고 복수의 사람의 해당 위치에 따라 의료 행위 기여도가 결정될 수 있다. 이 때, 서버(190)는 집도 의사의 기설정 위치에 위치하는 사람의 의료 행위 기여도를 가장 높은 순위로 결정할 수 있고 서브 의사의 기설정 위치에 위치하는 사람의 의료 행위 기여도를 2순위로 높게 결정할 수 있다. 또한, 서버(190)는 마취사의 기설정 위치에 위치하는 사람의 의료 행위 기여도를 3순위로 높게 결정할 수 있고 간호사의 기설정 위치에 위치하는 사람의 의료 행위 기여도를 4순위로 높게 결정할 수 있다. 또한, 기설정 의료진(예를 들면 집도 의사, 서브 의사, 마취사, 간호사 등) 및 기설정 위치는 사용자 또는 관리자에 의해 유동적으로 결정 및 변경될 수 있다. 일 실시 예에서, 서버(190)는 동영상 내에 의료진이 모두 등장하는 프레임을 사용자 계정으로 제공할 수 있고 사용자의 기여도에 대한 선택 입력에 따라 각 의료진의 의료 행위 기여도가 결정될 수도 있다. 예를 들면 사용자가 복수의 사람 각각을 선택하여 의료진 마다 의료 행위 기여도에 따른 순위를 입력함으로써 각 의료진의 의료 행위 기여도에 따른 순위가 결정될 수 있다. 또한, 단계 S433 에서 서버(190)는 의료 행위 기여도가 기설정 레벨 미만인 사람으로부터 획득되는 음성에 대해서는 음소거되도록 음성 처리를 수행할 수 있다. 예를 들면 서버(190)는 기설정 레벨에 대응되는 순위를 3 순위로 결정할 수 있다. 따라서 서버(190)는 의료 행위 기여도가 4순위에 대응되는 간호사로부터 획득되는 음성에 대해서는 음소거를 수행할 수 있다. 또한, 기설정 레벨에 대응되는 순위는 사용자 또는 관리자에 의해 유동적으로 결정 또는 변경될 수도 있다. Referring to FIG. 4, in step S431, the server 190 may determine the contribution to medical practice of a plurality of people included in the target video for a plurality of viewpoints. For example, contribution to medical practice may be determined based on the locations of multiple people. For example, during surgery, there may be a preset position of the surgeon, a preset position of a sub-doctor, a preset position of an anesthetist, a preset position of a nurse, etc., and the contribution to medical practice may be determined depending on the positions of multiple people. there is. At this time, the server 190 may determine the medical practice contribution of the person located at the preset position of the operating doctor as the highest ranking and determine the medical practice contribution of the person located at the sub-doctor's preset position as the second highest ranking. You can. Additionally, the server 190 may determine the contribution to medical practice of a person located at the anesthetist's preset position as high as the third priority, and may determine the contribution to medical practice of a person located at the nurse's preset position as high as the fourth priority. Additionally, preset medical staff (eg, surgeon, sub-doctor, anesthetist, nurse, etc.) and preset locations can be flexibly determined and changed by the user or administrator. In one embodiment, the server 190 may provide a frame in which all medical staff appear in the video to the user account, and the contribution of each medical staff member to medical practice may be determined according to the user's selection input for the contribution. For example, the user may select each of a plurality of people and enter a ranking according to each medical staff member's contribution to medical practice, thereby determining the ranking according to each medical staff member's contribution to medical practice. Additionally, in step S433, the server 190 may perform voice processing to mute the voice acquired from a person whose contribution to medical practice is less than a preset level. For example, the server 190 may determine the rank corresponding to the preset level to be rank 3. Accordingly, the server 190 may mute the voice obtained from the nurse whose contribution to medical practice is ranked 4th. Additionally, the ranking corresponding to the preset level may be flexibly determined or changed by the user or administrator.

일 실시 예에 따른 서버(190)는 음성 처리에 대한 보안 레벨을 결정할 수 있다. 또한, 서버(190)는 보안 레벨에 기초하여 화이트 구간 외의 전체 구간, 블랙 워딩에 대응되는 블랙 구간 및 영상 분석을 통해 결정되는 블랙 영상 구간 중 적어도 하나의 구간에 대한 음성을 음소거 할 수 있다. 블랙 영상 구간은 블랙 구간과 상이할 수 있다. 인공지능 기반 알고리즘에 따른 영상 분석을 통해 수술 도구 및 상황을 식별할 수 있고 대상 동영상 내에서의 주요 구간 이외의 비주요 구간에 대응되는 영상 구간을 블랙 영상 구간으로 결정할 수 있다. 보안 레벨은 사용자 또는 관리자에 의해 미리 결정될 수 있다. 예를 들면 서버(190)는 사용자계정으로부터 해당 수술 동영상의 보안 수준을 나타내는 보안 레벨에 대한 정보를 획득할 수 있다. 보안 레벨은 1(상), 2(중), 3(하)로 결정될 수 있으며 관리자 또는 사용자는 각 대상 동영상 마다 대응되는 보안 레벨을 미리 입력할 수 있다. 일 실시 예에서, 서버(190)는 획득된 대상 동영상의 보안 레벨이 1(상)일 경우, 수술 동영상의 중요도가 매우 높은 수술 영상으로 판단할 수 있고 안정성을 향상시킬 필요성이 높기 때문에 화이트 구간 외의 전체 구간에 대한 음성을 음소거 할 수 있다. 또한, 서버(190)는 획득된 대상 동영상의 보안 레벨이 2(중)일 경우, 수술 동영상의 중요도가 보안 레벨 1(상)보다는 낮지만 어느 정도 중요도가 있는 수술 영상으로 판단할 수 있기 때문에 블랙 구간 및 블랙 영상 구간에 대한 음성을 음소거 할 수 있다. 또한, 서버(190)는 획득된 대상 동영상의 보안 레벨이 3(하)일 경우, 최소한의 음성만 제거해도 무방하다고 판단할 수 있기 때문에 블랙 구간에 대한 음성만 음소거 할 수 있다. 따라서, 서버(190)는 보안 레벨에 따라 상이한 방법으로 음소거 구간(510)을 결정하기 때문에 사용자는 안정성 높은 음성 구간의 음성을 제공받을 수 있다.The server 190 according to one embodiment may determine the security level for voice processing. Additionally, based on the security level, the server 190 may mute the voice for at least one section among the entire section other than the white section, the black section corresponding to the black wording, and the black video section determined through video analysis. The black video section may be different from the black section. Through image analysis based on artificial intelligence-based algorithms, surgical tools and situations can be identified, and video sections corresponding to non-main sections other than the main section within the target video can be determined as black video sections. The security level may be predetermined by the user or administrator. For example, the server 190 may obtain information about the security level indicating the security level of the corresponding surgery video from the user account. The security level can be determined as 1 (high), 2 (medium), and 3 (low), and the administrator or user can enter the corresponding security level for each target video in advance. In one embodiment, when the security level of the acquired target video is 1 (high), the server 190 may determine that the surgery video is a surgery video of very high importance and there is a high need to improve stability, so You can mute the voice for the entire section. In addition, when the security level of the acquired target video is 2 (medium), the server 190 can determine that the surgery video has a certain level of importance even though the importance of the surgery video is lower than security level 1 (high), so the black You can mute the audio for sections and black video sections. Additionally, if the security level of the acquired target video is 3 (low), the server 190 may determine that it is okay to remove only the minimum amount of audio, and thus may mute only the audio for the black section. Therefore, since the server 190 determines the mute section 510 in different ways depending on the security level, the user can be provided with a voice section with high stability.

도 5 는 일 실시 예에 따른 디바이스(100) 및/또는 서버(190)가 음소거 구간(510)을 결정하는 일 예를 설명하기 위한 도면이다.FIG. 5 is a diagram illustrating an example in which the device 100 and/or the server 190 determines a mute period 510 according to an embodiment.

도 5를 참조하면, 도 5 의 (a)는 화이트 워딩(513, 523, 533) 및 화이트 구간(511, 523, 533)에 따라 결정되는 음소거 구간(510)을 나타내는 도면이고, 도 5의 (b)는 블랙 워딩(515, 525, 535) 및 블랙 구간(512, 522,532)에 따라 결정되는 음소거 구간(510)을 나타내는 도면이다. 도 5의 (a)에 도시된 바와 같이 일 실시 예에서 서버(190)는 대상 동영상 전체 구간의 음성 영역에서 상술한 바와 같이 획득된 화이트 키워드에 대응되는 화이트 워딩(513, 523, 533)이 포함되는 지점을 획득할 수 있고 획득된 지점에 대응되는 화이트 구간(511, 521, 531)을 결정할 수 있다. 또한, 서버(190)는 대상 동영상 전체 구간의 음성 영역에서 화이트 구간(511, 521, 531) 이외의 구간을 음소거 구간(510)으로 결정할 수 있다. 따라서, 서버(190)는 결정된 음소거 구간(510)의 음성이 음소거되도록 음성 처리를 수행할 수 있다. 또한, 도 5의 (b)에 도시된 바와 같이 일 실시 예에서 도5의 (a)에서 설명한 방법과 다른 방법을 이용하여 음소거 구간(510)을 결정할 수도 있다. 예를 들면 서버(190)는 대상 동영상 전체 구간의 음성 영역에서 상술한 바와 같이 획득된 블랙 키워드에 대응되는 블랙 워딩(515, 525, 535)이 포함되는 지점을 획득할 수 있고 획득된 지점에 대응되는 블랙 구간(512, 522, 532)을 결정할 수 있다. 또한, 서버(190)는 대상 동영상 전체 구간의 음성 영역에서 블랙 구간(512, 522, 532)을 음소거 구간(510)으로 결정할 수 있다. 따라서, 서버(190)는 결정된 음소거 구간(510)의 음성이 음소거되도록 음성 처리를 수행할 수 있다.Referring to FIG. 5, (a) of FIG. 5 is a diagram showing a mute section 510 determined according to the white wordings 513, 523, and 533 and the white sections 511, 523, and 533, and (in FIG. 5) b) is a diagram showing the mute section 510 determined according to the black wordings 515, 525, and 535 and the black sections 512, 522, and 532. As shown in (a) of Figure 5, in one embodiment, the server 190 includes white wordings 513, 523, and 533 corresponding to the white keywords obtained as described above in the audio area of the entire target video section. The point can be obtained and the white section (511, 521, 531) corresponding to the obtained point can be determined. Additionally, the server 190 may determine a section other than the white sections 511, 521, and 531 in the audio area of the entire target video section as the mute section 510. Accordingly, the server 190 may perform voice processing so that the voice in the determined mute section 510 is muted. Additionally, as shown in (b) of FIG. 5, in one embodiment, the mute section 510 may be determined using a method different from the method described in (a) of FIG. 5. For example, the server 190 may obtain a point containing black wordings 515, 525, 535 corresponding to the black keyword obtained as described above in the audio area of the entire target video section and correspond to the obtained point. The black sections 512, 522, and 532 can be determined. Additionally, the server 190 may determine the black sections 512, 522, and 532 in the audio area of the entire target video section as the mute section 510. Accordingly, the server 190 may perform voice processing so that the voice in the determined mute section 510 is muted.

또한, 일 실시 예에 따른 서버(190)는 블랙 구간(512, 525, 535) 또는 블랙 영상 구간과 화이트 구간(511, 521, 531)이 중첩되는 경우, 보안 레벨에 기초하여 중첩 구간에 대한 음성을 음소거 할 수 있다. 일 실시 예에 따른 서버(190)는 블랙 구간(512, 522, 532) 또는 블랙 영상 구간과 화이트 구간(511, 521, 531)이 중첩되는 경우, 중첩 구간에 대한 음성을 음소거 할 수 있다. 이에 대해 도 6을 참조하여 설명할 수 있다. In addition, when the black section 512, 525, 535 or the black video section and the white section 511, 521, 531 overlap, the server 190 according to one embodiment provides audio for the overlapping section based on the security level. can be muted. The server 190 according to one embodiment may mute the audio for the overlapping section when the black section 512, 522, and 532 or the black video section and the white section 511, 521, and 531 overlap. This can be explained with reference to FIG. 6.

도 6 은 일 실시 예에 따른 디바이스(100) 및/또는 서버(190)가 보안 레벨에 따라 중첩 구간을 음소거 구간으로 결정하는 일 예를 설명하기 위한 도면이다.FIG. 6 is a diagram illustrating an example in which the device 100 and/or the server 190 determines an overlapping section as a mute section according to a security level.

도 6을 참조하면, 일 실시 예에서, 동영상 음성 전체 구간에서 블랙 구간(512, 522) 또는 블랙 영상 구간(611, 621)과 화이트 구간(511, 521, 531)이 중첩되는 구간이 존재할 수 있다. 이 때, 블랙 구간(512, 522) 또는 블랙 영상 구간(611, 621)과 화이트 구간(511, 521, 531)이 중첩되는 구간은 블랙 구간(512, 522)과 블랙 영상 구간(611, 621)에 대응되는 구간이 포함되는 구간이므로, 불필요한 음성 구간일 확률이 높다고 판단할 수 있기 때문에 서버(190)는 중첩되는 구간을 음소거 구간(510)으로 결정할 수 있다.Referring to FIG. 6, in one embodiment, there may be a black section 512, 522 or a section where the black video section 611, 621 and the white section 511, 521, 531 overlap in the entire video audio section. . At this time, the black section (512, 522) or the black image section (611, 621) and the white section (511, 521, 531) overlap with the black section (512, 522) and the black image section (611, 621). Since the section corresponding to is included, it can be determined that there is a high probability that it is an unnecessary voice section, so the server 190 can determine the overlapping section as the mute section 510.

일 실시 예에서, 중첩 구간에 대한 음성을 음소거 하기 위한 대상 동영상의 보안 레벨은 1(상), 2(중), 3(중하), 4(하), 5(최하)로 결정될 수 있다. 관리자 또는 사용자는 각 대상 동영상 마다 대응되는 보안 레벨을 미리 입력할 수 있다. 서버(190)는 수술 동영상의 중요도에 따른 보안 레벨에 따라 음소거 구간(510)을 상이하게 결정할 수 있다. 도 6은 각 구간 간의 중첩 되는 구간을 나타내기 위해 화이트 구간(511, 521, 531) 외의 전체 구간이 음소거 구간(510)이라는 점은 생략하여 도시한 도면이다. 도 6을 참조하면, 서버(190)는 획득된 대상 동영상의 보안 레벨이 1(상)일 경우, 수술 동영상의 중요도가 매우 높은 수술 영상으로 판단할 수 있기 때문에 화이트 구간(511, 521, 531)과 블랙 구간(512, 525, 535) 또는 블랙 영상 구간(611,621)이 중첩되는 모든 구간을 음소거 구간(510)으로 결정할 수 있다. 따라서 화이트 구간(511, 521, 531)과 블랙 구간(512, 525, 535) 또는 블랙 영상 구간(611,621)이 중첩되는 모든 구간 및 화이트 구간(511, 521, 531) 외의 전체 구간이 음소거 구간(510)으로 결정될 수 있다. 또한, 블랙 구간(512, 525, 535)은 음성에 따른 불필요 구간에 대응되는 구간이기 때문에 음소거 구간(510)을 결정하는데 있어서 영상 기반인 블랙 영상 구간(611,621)보다 더 중요한 구간일 확률이 높기 때문에 서버(190)는 획득된 대상 동영상의 보안 레벨이 2(중)일 경우, 화이트 구간(511, 521, 531)과 블랙 구간(512, 525, 535)이 중첩되는 구간만 음소거 구간(510)으로 결정할 수 있다. 이 때, 화이트 구간(511, 521, 531)과 블랙 영상 구간(611,621)이 중첩되는 구간은 비음소거될 수 있다. 따라서 화이트 구간(511, 521, 531)과 블랙 구간(512, 525, 535)이 중첩되는 구간 및 화이트 구간(511, 521, 531) 외의 전체 구간이 음소거 구간(510)으로 결정될 수 있다. 또한, 서버(190)는 획득된 대상 동영상의 보안 레벨이 3(중하)일 경우, 화이트 구간(511, 521, 531)과 블랙 영상 구간(611,621)이 중접되는 구간만 음소거 구간(510)으로 결정할 수 있다. 이 때, 화이트 구간(511, 521, 531)과 블랙 구간(512, 525, 535)이 중첩되는 구간은 비음소거될 수 있다. 따라서 화이트 구간(511, 521, 531)과 블랙 영상 구간(611,621)이 중접되는 구간 및 화이트 구간(511, 521, 531) 외의 전체 구간이 음소거 구간(510)으로 결정될 수 있다. 또한, 대상 동영상 전체 구간에서 세 구간이 모두 중첩되는 구간이 존재할 수 있고 서버(190)는 획득된 대상 동영상의 보안 레벨이 4(하)일 경우, 화이트 구간(511, 521, 531)과 블랙 구간(512, 525, 535)과 블랙 영상 구간(611,621)이 모두 중첩되는 구간을 음소거 구간(510)으로 결정할 수 있다. 따라서 화이트 구간(511, 521, 531)과 블랙 구간(512, 525, 535)과 블랙 영상 구간(611,621)이 모두 중첩되는 구간 및 화이트 구간(511, 521, 531) 외의 전체 구간이 음소거 구간(510)으로 결정될 수 있다. 또한, 서버(190)는 획득된 대상 동영상의 보안 레벨이 5(최하)일 경우, 화이트 구간(511, 521, 531)과 블랙 구간(512, 525, 535) 또는 블랙 영상 구간(611,621)이 중첩되는 구간을 무시하고 화이트 구간(511, 521, 531) 외의 전체 구간을 음소거 구간(510)으로 결정하여 화이트 구간(511, 521, 531)의 음성이 모두 비음소거되도록 할 수 있다. 따라서, 서버(190)는 각 구간의 중첩 구간이 존재할 경우 보안 레벨에 따라 중첩 구간에 대응되는 구간을 음소거 구간(510)으로 결정하기 때문에 보다 정확도 높은 음성 구간을 제공할 수 있다. In one embodiment, the security level of the target video for muting the voice for the overlapping section may be determined as 1 (high), 2 (medium), 3 (low middle), 4 (low), and 5 (lowest). Administrators or users can pre-enter the security level corresponding to each target video. The server 190 may determine the mute section 510 differently depending on the security level according to the importance of the surgery video. FIG. 6 is a diagram omitting the fact that the entire section other than the white sections 511, 521, and 531 is the mute section 510 in order to show the overlapping sections between each section. Referring to FIG. 6, when the security level of the acquired target video is 1 (above), the server 190 can determine that the surgery video is a very important surgery video, so the white sections 511, 521, and 531 Any section where the black sections 512, 525, 535 or the black image sections 611, 621 overlap can be determined as the mute section 510. Therefore, all sections where the white section (511, 521, 531) overlaps with the black section (512, 525, 535) or the black video section (611,621) and all sections other than the white section (511, 521, 531) are muted sections (510). ) can be determined. In addition, since the black sections (512, 525, 535) correspond to unnecessary sections according to the voice, there is a high probability that they are more important sections than the video-based black video sections (611, 621) in determining the mute section (510). When the security level of the acquired target video is 2 (medium), the server 190 selects only the section where the white section (511, 521, 531) and the black section (512, 525, 535) overlap as the mute section (510). You can decide. At this time, the section where the white sections 511, 521, and 531 overlap with the black image sections 611 and 621 may be non-muted. Accordingly, the entire section other than the overlapping section between the white section 511, 521, and 531 and the black section 512, 525, and 535 and the white section 511, 521, and 531 may be determined as the mute section 510. In addition, when the security level of the acquired target video is 3 (low to medium), the server 190 determines only the section where the white sections 511, 521, 531 and the black video sections 611, 621 overlap as the mute section 510. You can. At this time, the section where the white sections 511, 521, and 531 overlap with the black sections 512, 525, and 535 may be non-muted. Accordingly, the entire section other than the section where the white section (511, 521, 531) and the black image section (611, 621) overlap and the white section (511, 521, 531) may be determined as the mute section (510). In addition, there may be a section where all three sections overlap in the entire target video section, and the server 190 may select the white sections 511, 521, and 531 and the black section when the security level of the acquired target video is 4 (low). The section where (512, 525, 535) and the black image section (611, 621) overlap can be determined as the mute section (510). Therefore, the white section (511, 521, 531), the black section (512, 525, 535), and the black video section (611,621) all overlap, and the entire section other than the white section (511, 521, 531) is the mute section (510). ) can be determined. In addition, when the security level of the acquired target video is 5 (lowest), the server 190 overlaps the white section (511, 521, 531) and the black section (512, 525, 535) or the black video section (611, 621). The entire section other than the white sections 511, 521, and 531 can be ignored and the entire section other than the white sections 511, 521, and 531 determined as the muted section 510, so that all voices in the white sections 511, 521, and 531 are unmuted. Accordingly, if there is an overlapping section of each section, the server 190 determines the section corresponding to the overlapping section as the mute section 510 according to the security level, thereby providing a voice section with higher accuracy.

일 실시 예에 따른 서버(190)는 화이트 워딩(513, 523, 533)에 대한 색인 결과를 제공할 수 있다. 이에 대해 도 7을 참조하여 설명할 수 있다.The server 190 according to one embodiment may provide index results for white wordings 513, 523, and 533. This can be explained with reference to FIG. 7 .

도 7 은 일 실시 예에 따른 디바이스(100) 및/또는 서버(190)가 색인 결과를 제공하는 일 예를 도시한 도면이다.FIG. 7 is a diagram illustrating an example in which the device 100 and/or the server 190 provide index results according to an embodiment.

도 7을 참조하면, 서버(190)는 대상 동영상의 화이트 워딩 정보 테이블(731, 732, 733, 734, 735, 736, 737, 738)에 복수개의 화이트 워딩을 디스플레이하는 화면을 제공할 수 있고, 화이트 정보 테이블(731, 732, 733, 734, 735, 736, 737, 738)에 각각 대응되는 화이트 구간 영역(711, 712, 713, 714, 715, 716, 717, 718)의 위치를 제공하여 사용자의 직관적인 확인이 가능하도록 할 수 있다. 예를 들면, 사용자는 복수개의 화이트 워딩 중 원하는 화이트 워딩을 선택 입력할 수 있고 이에 대응되는 화이트 워딩이 포함되는 구간의 화이트 구간 영역(711, 712, 713, 714, 715, 716, 717, 718)의 위치를 제공하여 사용자가 해당 영역의 동영상 및 음성을 바로 확인할 수 있도록 할 수 있다. 따라서, 서버(190)는 색인 결과를 제공하기 때문에 사용자가 확인하기 위한 화이트 워딩에 대응되는 화이트 구간 영역(711, 712, 713, 714, 715, 716, 717, 718)의 위치를 바로 제공할 수 있어 사용자의 이용 편의성이 향상될 수 있다. 또한, 화이트 정보 테이블(731, 732, 733, 734, 735, 736, 737, 738)에 디스플레이되는 화이트 워딩은 수술의 종류, 대상 동영상 음성에 포함되는 복수개의 화이트 워딩에 따라 결정될 수 있고 관리자 또는 사용자에 의해 설정될 수도 있다.Referring to FIG. 7, the server 190 may provide a screen that displays a plurality of white wordings in the white wording information tables 731, 732, 733, 734, 735, 736, 737, and 738 of the target video, The location of the white section areas (711, 712, 713, 714, 715, 716, 717, 718) corresponding to the white information table (731, 732, 733, 734, 735, 736, 737, 738) is provided to the user. It is possible to make intuitive confirmation of . For example, the user can select and input a desired white wording among a plurality of white wordings and enter the white section area (711, 712, 713, 714, 715, 716, 717, 718) containing the corresponding white wording. By providing the location of the area, users can immediately check video and audio in that area. Therefore, because the server 190 provides the index result, it can immediately provide the location of the white section area (711, 712, 713, 714, 715, 716, 717, 718) corresponding to the white wording for the user to check. This can improve user convenience. In addition, the white wording displayed on the white information tables 731, 732, 733, 734, 735, 736, 737, and 738 may be determined according to the type of surgery and a plurality of white words included in the target video voice, and may be determined by the administrator or user. It can also be set by .

일 실시 예에 따르면, 서버(190)는 화이트 키워드에 대응되는 화이트 워딩에 대한 저장 및 텍스트화를 수행할 수 있다. 서버(190)는 화이트 워딩에 대한 텍스트화가 수행되어 획득된 텍스트를 메모리(240) 등에 저장될 수 있다.According to one embodiment, the server 190 may store and convert white wording corresponding to the white keyword into text. The server 190 may convert the white wording into text and store the obtained text in the memory 240, etc.

일 실시 예에 따르면, 서버(190)는 화이트 키워드가 기설정 수술에 대응되는지 여부를 결정할 수 있다. 또한, 서버(190)는 화이트 키워드가 기설정 수술에 대응되는 경우, 텍스트화된 화이트 워딩에 대한 색인 결과를 제공할 수 있다. 예를 들면 서버(190)는 화이트 키워드가 대응되는 단계(예: 마취 단계, 개복 단계, 집도 단계 등)를 결정하고, 대응되는 단계에 대한 색인 정보를 획득할 수 있다. 서버(190)는 색인 정보를 대상 동영상에 병합하여 갱신된 대상 동영상을 제공할 수 있다. 색인 정보가 추가되어 대상 동영상이 갱신되는 과정은 자동으로 진행될 수 있으며, 갱신된 대상 동영상을 획득한 사용자는 색인 정보를 이용하여 용이하게 화이트 키워드에 대응되는 재생 시점에 접근할 수 있다.According to one embodiment, the server 190 may determine whether the white keyword corresponds to a preset surgery. Additionally, if the white keyword corresponds to a preset surgery, the server 190 may provide index results for the textual white wording. For example, the server 190 may determine the stage to which the white keyword corresponds (e.g., anesthesia stage, laparotomy stage, surgery stage, etc.) and obtain index information for the corresponding stage. The server 190 may provide an updated target video by merging the index information into the target video. The process of updating the target video by adding index information can proceed automatically, and a user who has acquired the updated target video can easily access the playback time corresponding to the white keyword using the index information.

일 실시 예에 따르면, 수술 영상의 유의미한 음성 영역을 시계열로 배열하여 제공하기 때문에 사용자는 불필요한 음성 신호가 필터링된 음성만을 들을 수 있다는 점에서 사용자의 만족도를 향상시킬 수 있고 수술 영상에서의 사적인, 주관적인 내용의 음성이 필터링 되기 때문에 효율성이 향상될 수 있고 의사 또는 간호사의 개인적 대화에 대한 침해 우려성을 낮출 수 있다는 점에서 효율성이 향상될 수 있다. 또한, 의학 용어가 포함된 음성 영역을 판단하여 해당 영역의 음성만을 제공하기 때문에 정확도 높은 음성 구간을 결정할 수 있다는 점에서 효율성이 향상될 수 있고 사적인, 주관적인 용어가 포함된 음성 영역을 판단하여 해당 영역의 음성 영역을 음소거 처리 하기 때문에 불필요 영역의 음성을 필터링할 수 있다는 점에서 효율성이 향상될 수 있다. 또한, 수술 영상에서 수술에 참여하고 있는 의료진만의 음성을 구분하여 제공함으로써 수술에 관련도가 높고 정확도 높은 음성을 제공할 수 있다는 점에서 효율성이 향상될 수 있고 의학 용어에 따른 일부 구간을 사용자의 선택에 따라 제공될 수 있도록 색인 결과를 제공하기 때문에 사용자는 확인하고자 하는 영역의 영상 및 음성을 확인할 수 있다는 점에서 사용자의 만족도가 향상될 수 있다. 또한, 사용자로부터 입력되는 영상의 보안 레벨에 따라 상이한 방법으로 음성을 음소거함으로써 제공되는 음성의 범위를 조절할 수 있기 때문에 사용자는 영상의 중요도, 보안에 따라 조절되는 음소거 영역, 범위를 제공받을 수 있어 만족도가 향상될 수 있다. According to one embodiment, since the meaningful audio areas of the surgical image are provided by arranging them in time series, the user's satisfaction can be improved in that the user can only hear the voice with filtered out unnecessary audio signals, and the personal and subjective sound areas in the surgical image are provided. Efficiency can be improved because the voice of the content is filtered, and efficiency can be improved in that concerns about infringement on a doctor or nurse's personal conversation can be lowered. In addition, efficiency can be improved in that it is possible to determine a voice section with high accuracy by determining the voice area containing medical terms and providing only the voice of that region, and by determining the voice region containing private and subjective terms, the corresponding region Since the voice area is muted, efficiency can be improved in that voice in unnecessary areas can be filtered. In addition, efficiency can be improved in that it is possible to provide voices with high relevance and accuracy to the surgery by distinguishing and providing the voices of only the medical staff participating in the surgery in the surgery video, and some sections according to medical terminology can be selected by the user. Because index results are provided so that they can be provided according to selection, user satisfaction can be improved in that users can check video and audio in the area they want to check. In addition, the range of the voice provided can be adjusted by muting the voice in different ways depending on the security level of the video input from the user, so the user can be provided with a mute area and range that are adjusted according to the importance and security of the video, thereby providing satisfaction. can be improved.

본 개시의 다양한 실시 예들은 기기(machine)(예를 들어, 디스플레이 장치 또는 컴퓨터)에 의해 읽을 수 있는 저장 매체(storage medium)(예를 들어, 메모리)에 저장된 하나 이상의 인스트럭션들을 포함하는 소프트웨어로서 구현될 수 있다. 예를 들면, 기기의 프로세서(예를 들어, 프로세서(220))는, 저장 매체로부터 저장된 하나 이상의 인스트럭션들 중 적어도 하나의 인스트럭션을 호출하고, 그것을 실행할 수 있다. 이것은 기기가 상기 호출된 적어도 하나의 인스트럭션에 따라 적어도 하나의 기능을 수행하도록 운영되는 것을 가능하게 한다. 상기 하나 이상의 인스트럭션들은 컴파일러에 의해 생성된 코드 또는 인터프리터에 의해 실행될 수 있는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적'은 저장매체가 실재(tangible)하는 장치이고, 신호(signal)(예: 전자기파)를 포함하지 않는다는 것을 의미할 뿐이며, 이 용어는 데이터가 저장매체에 반영구적으로 저장되는 경우와 임시적으로 저장되는 경우를 구분하지 않는다. Various embodiments of the present disclosure are implemented as software including one or more instructions stored in a storage medium (e.g., memory) that can be read by a machine (e.g., a display device or computer). It can be. For example, the device's processor (eg, processor 220) may call at least one instruction among one or more instructions stored from a storage medium and execute it. This allows the device to be operated to perform at least one function according to the at least one instruction called. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. A storage medium that can be read by a device may be provided in the form of a non-transitory storage medium. Here, 'non-transitory' only means that the storage medium is a tangible device and does not contain signals (e.g. electromagnetic waves). This term refers to cases where data is stored semi-permanently in the storage medium. There is no distinction between temporary storage cases.

일 실시 예에 따르면, 본 개시에 개시된 다양한 실시 예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로 배포되거나, 또는 어플리케이션 스토어(예: 플레이 스토어TM)를 통해 또는 두 개의 사용자 장치들(예: 스마트폰들) 간에 직접, 온라인으로 배포(예: 다운로드 또는 업로드)될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 기기로 읽을 수 있는 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.According to one embodiment, methods according to various embodiments disclosed in the present disclosure may be included and provided in a computer program product. Computer program products are commodities and can be traded between sellers and buyers. The computer program product may be distributed in the form of a machine-readable storage medium (e.g. compact disc read only memory (CD-ROM)) or via an application store (e.g. Play StoreTM) or on two user devices (e.g. It can be distributed (e.g. downloaded or uploaded) directly between smartphones) or online. In the case of online distribution, at least a portion of the computer program product may be at least temporarily stored or temporarily created in a machine-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.

본 발명에 대하여 예시한 도면을 참조로 하여 설명하였으나 개시된 실시 예와 도면에 의해 한정되는 것은 아니며 본 실시 예와 관련된 기술 분야에서 통상의 지식을 가진 자는 상기된 기재의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 방법들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 실시 예를 설명하며 본 발명의 구성에 따른 작용 효과를 명시적으로 기재하여 설명하지 않았을지라도, 해당 구성에 의해 예측이 가능한 효과 또한 인정될 수 있다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Although the present invention has been described with reference to the illustrative drawings, it is not limited to the disclosed embodiments and drawings, and those skilled in the art in the technical field related to the present embodiments may make modifications without departing from the essential characteristics of the above-mentioned description. You will be able to understand that it can be implemented in a certain form. Therefore, the disclosed methods should be considered from an explanatory rather than a restrictive perspective. Even if the operational effects according to the configuration of the present invention are not explicitly described and explained in the description of the embodiment, the effects that can be predicted by the configuration may also be recognized. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the equivalent scope should be construed as being included in the present invention.

100: 디바이스 190: 서버
210: 수신부 220: 프로세서
230: 출력부 240: 메모리
511, 521, 531: 화이트 구간 510: 음소거 구간
513, 523, 533: 화이트 워딩
512, 522, 532: 블랙 구간 515, 525, 535: 블랙 워딩
611, 621: 블랙 영상 구간
711, 712, 713, 714, 715, 716, 717, 718: 화이트 구간 영역
731, 732, 733, 734, 735, 736, 737, 738: 화이트 정보 테이블100: device 190: server
210: receiving unit 220: processor
230: output unit 240: memory
511, 521, 531: White section 510: Mute section
513, 523, 533: White Wording
512, 522, 532: Black section 515, 525, 535: Black wording
611, 621: Black video section
711, 712, 713, 714, 715, 716, 717, 718: White section area
731, 732, 733, 734, 735, 736, 737, 738: White Information Table

Claims

In the method of processing audio included in a video,
Obtaining a target video related to medical practice and including audio;
Obtaining white keywords highly related to the medical practice;
determining a white word corresponding to the white keyword among a plurality of wordings included in the voice; and
A method comprising: performing voice processing to mute problem sections other than the white section corresponding to the white wording among all sections of the voice.

According to claim 1,
The white keywords include medical terms,
The step of performing the voice processing is
A method of muting the voice for all sections other than the white section among all sections of the voice.

According to claim 1,
The white section includes a section corresponding to a sentence including the white wording.

According to claim 1,
Obtaining black keywords with low relevance to the medical practice; and
Further comprising: determining a black word corresponding to the black keyword among the plurality of wordings included in the voice,
The step of performing the voice processing is
A method of performing voice processing so that the black section corresponding to the black wording is muted.

According to claim 1,
determining the contribution of a plurality of people included in the target video to medical practice for a plurality of viewpoints; and
The method further comprising performing voice processing to mute the voice obtained from a person whose contribution to medical practice is less than a preset level.

According to claim 1,
storing and converting the white wording corresponding to the white keyword into text;
determining whether the white keyword corresponds to a preset surgery; and
If the white keyword corresponds to the preset surgery, providing an index result for the textualized white wording. The method further comprising:

According to claim 1,
Obtaining black keywords with low relevance to the medical practice;
determining a black word corresponding to the black keyword among the plurality of wordings included in the voice; and
Further comprising: determining a security level for the voice processing,
The step of performing the voice processing is
A method of muting the voice for at least one section among all sections other than the white section, a black section corresponding to the black wording, and a black video section determined through video analysis, based on the security level.

According to claim 7,
When the black section or the black video section and the white section overlap, a method of muting the voice for the overlapping section based on the security level.

In the server that performs audio processing included in the video,
a receiving unit that acquires a target video that is related to medical practice and includes audio; and
Obtaining a white keyword highly related to the medical practice, determining a white wording corresponding to the white keyword among a plurality of wordings included in the voice, and determining problems other than the white section corresponding to the white wording among the entire section of the voice. The section includes a processor that performs voice processing to mute; Server, including.

According to clause 9,
The white keywords include medical terms,
The processor is
A server that mutes the voice for all sections other than the white section among all sections of the voice.

A non-transitory computer-readable recording medium on which a program for implementing the method of any one of claims 1 to 8 is recorded.