KR20220133894A

KR20220133894A - Systems and methods for analysis and determination of relationships from various data sources

Info

Publication number: KR20220133894A
Application number: KR1020227026311A
Authority: KR
Inventors: 존 형 리; 제임스 존슨 가드너; 저스틴 에드워즈; 그레고리 알렉산더 보생저; 데이비드 앤서니 스크립카; 레이첼 에이. 와그너-카이저
Original assignee: 케이피엠지 엘엘피
Priority date: 2019-12-30
Filing date: 2020-12-22
Publication date: 2022-10-05
Also published as: EP4085353A1; JP2023509437A; EP4085353A4; WO2021138163A1; AU2020418514A1; CA3163394A1

Abstract

본 발명은 다양한 데이터 소스로부터 데이터를 분석하기 위한 컴퓨터-구현 시스템들 및 방법들에 관한 것이다. 시스템들 및 방법들의 실시예들은 분석된 데이터에 기반하여 특정 질문들에 대한 응답을 생성하는 것을 더 제공하고, 생성은: 분석된 데이터와 연관된 관련 문서들을 검색하고; 검색된 관련 문서들 중 어느 문서로부터 어느 정보가 보고되어야 하는지를 결정하고; 관련 문서들과 연관된 그래프 스키마 및 결정에 기반하여 응답을 제공하는 것을 포함한다.The present invention relates to computer-implemented systems and methods for analyzing data from a variety of data sources. Embodiments of systems and methods further provide for generating a response to specific questions based on the analyzed data, the generating comprising: retrieving related documents associated with the analyzed data; determine from which of the retrieved related documents which information should be reported; and providing a response based on the graph schema and decisions associated with the relevant documents.

Description

Systems and methods for analysis and determination of relationships from various data sources

[0001] 본 출원은 2017년 10월 13일에 출원된 미국 가특허 출원 일련 번호 62/572,266의 출원일의 이익을 주장하고 그 전체가 참조로 본원에 포함된 2018년 10월 12일에 출원된 미국 특허 출원 일련 번호 16/159,088호의 일부 계속 출원이고 그 출원일의 이익을 주장한다.[0001] This application claims the benefit of filing date of U.S. Provisional Patent Application Serial No. 62/572,266, filed on October 13, 2017, filed on October 12, 2018, and is incorporated herein by reference in its entirety. It is a continuation-in-part of Patent Application Serial No. 16/159,088 and claims the benefit of the filing date.

[0002] 본 발명은 다양한 데이터 소스들로부터의 데이터를 분석하고, 분석된 데이터에 기반하여 특정 질문들에 대한 응답들을 생성하기 위한 시스템들 및 방법들에 관한 것이다.The present invention relates to systems and methods for analyzing data from various data sources and generating answers to specific questions based on the analyzed data.

[0003] 기계 학습, 자연어 프로세싱, 데이터 분석, 모바일 컴퓨팅 및 클라우드 컴퓨팅의 발전들이 소정 프로세스들과 기능들을 대체하기 위해 다양한 조합들로 사용됨에 따라 노동의 디지털화가 계속 진행된다. 비교적 저렴한 비용으로 솔루션들을 설계, 테스트 및 구현할 수 있으므로 상당한 IT 투자 없이 기본 프로세스 자동화가 구현될 수 있다. 향상된 프로세스 자동화는 데이터를 사용하여 기계 학습 요소들을 지원하게 할 수 있는 더 고급 기술들을 포함한다. 기계 학습 도구들은 데이터에서 자연-발생 패턴들을 발견하고 결과들을 예측하는 데 사용될 수 있다. 그리고 자연어 프로세싱 도구들은 컨텍스트에서 텍스트를 분석하고 원하는 정보를 추출하는 데 사용된다.[0003] The digitization of labor continues as advances in machine learning, natural language processing, data analytics, mobile computing and cloud computing are used in various combinations to replace certain processes and functions. Solutions can be designed, tested and implemented at a relatively low cost, allowing basic process automation to be implemented without significant IT investments. Enhanced process automation includes more advanced technologies that can use data to support machine learning elements. Machine learning tools can be used to discover naturally-occurring patterns in data and predict outcomes. And natural language processing tools are used to analyze text in context and extract desired information.

[0004] 그러나, 이러한 디지털 도구들은 일반적으로 다양한 포맷들과 코딩 언어들로 발견되므로, 통합하기 어렵고 또한 맞춤화되지 않는 경우가 많다. 결과적으로, 이러한 시스템들은 다양한 유형들의 입력 데이터, 예를 들어, 구조화된 데이터, 반-구조화된 데이터, 비구조화된 데이터, 이미지들 및 음성의 분석 및 프로세싱을 요구하는 특정 질문들에 대한 자동화된 솔루션들이나 답변들을 제공할 수 없을 것이다. 예를 들어, 이러한 시스템들은 현재 "이 500 개의 계약서들 중 어느 것이 새로운 은행 규정 XYZ를 준수하지 않는지?"와 같은 질문들을 효율적으로 처리할 수 없다.However, since these digital tools are generally found in a variety of formats and coding languages, they are difficult to integrate and often not customized. As a result, these systems provide automated solutions to specific questions requiring analysis and processing of various types of input data, e.g., structured data, semi-structured data, unstructured data, images and speech. I will not be able to provide answers or answers. For example, these systems currently cannot efficiently address questions such as "Which of these 500 contracts do not comply with the new banking regulations XYZ?"

[0005] 그러므로, 알려진 시스템들의 전술한 단점들을 극복할 수 있고 문서들, 통신들, 텍스트 파일들, 웹사이트들 및 다른 구조화 및 비구조화 입력 파일들을 분석하여 자동 및 맞춤 분석을 적용하여 특정 질문들 및 다른 지원 정보에 대한 답변들 형태로 출력을 생성할 수 있는 시스템 및 방법을 갖는 것이 바람직할 것이다.Therefore, it is possible to overcome the aforementioned shortcomings of known systems and apply automatic and custom analysis to analyze documents, communications, text files, websites and other structured and unstructured input files to ask specific questions. It would be desirable to have a system and method capable of generating output in the form of answers to and other supporting information.

[0006] 일 실시예에 따라, 본 발명은 다양한 데이터 소스들로부터의 데이터를 분석하기 위한 컴퓨터-구현 시스템 및 방법에 관한 것이다. 방법은: 다양한 데이터 소스들로부터의 데이터를 입력들로서 수신하는 단계; 다양한 데이터 소스들 각각으로부터의 수신된 데이터를 공통 데이터 구조로 변환하는 단계; 수신된 데이터에서 키워드들을 식별하는 단계; 문서 코퍼스에 기반하여 문장 또는 단어 임베딩(embedding)들을 생성하는 단계; 생성된 문장 또는 단어 임베딩들에 기반하여 하나 이상의 라벨들의 선택을 수신하는 단계; 선택된 하나 이상의 라벨들을 모델에 추가하는 단계; 구성 파일에 기반하여 공통 데이터 구조에 대해 모델을 훈련시키는 단계; 및 모델에 기반하여 사용자 질문에 대한 응답으로 결과를 생성하는 단계를 포함할 수 있고, 생성하는 단계는: 수신된 데이터로부터 관련 문서들을 검색하는 단계; 검색된 관련 문서들 중 어느 문서로부터 어느 정보가 보고되어야 하는지를 결정하는 단계; 및 관련 문서들과 연관된 그래프 스키마(schema) 및 결정에 기반하여 결과를 제공하는 단계를 포함한다.According to one embodiment, the present invention relates to a computer-implemented system and method for analyzing data from various data sources. The method includes: receiving data from various data sources as inputs; converting received data from each of the various data sources into a common data structure; identifying keywords in the received data; generating sentence or word embeddings based on the document corpus; receiving a selection of one or more labels based on the generated sentence or word embeddings; adding the selected one or more labels to the model; training a model on a common data structure based on the configuration file; and generating a result in response to a user query based on the model, the generating comprising: retrieving related documents from the received data; determining from which of the retrieved related documents which information is to be reported; and providing a result based on the graph schema and the decision associated with the related documents.

[0007] 본 발명은 또한 다양한 데이터 소스들로부터의 데이터를 분석하기 위한 컴퓨터-구현 시스템에 관한 것이다.[0007] The present invention also relates to a computer-implemented system for analyzing data from various data sources.

[0008] 예시적인 문서 관리 워크플로우(workflow)는 문서 수집(document ingestion), 예측, 컨설리데이션(consolidation) 및 분석을 위한 주요 작업들을 매끄럽게 통합한다. 워크플로우는 사용자들이 문서들(예를 들어, 계약서들)에 대한 특정 질문들에 답변하고 다른 문서들과의 관계들을 모델링하여 지식 기반을 구축할 수 있게 한다. 특히, 각각의 단계(예를 들어, 수집, 예측, 컨설리데이션 및 분석)는 사용자가 필요로 하는 최소한의 노력이나 변경들로 구성할 수 있는 종단 간 워크플로우에 통합된다. 각각의 단계는 문서들로부터 정보를 분석하고 추출할 수 있게 하도록 이전 단계들을 기반으로 한다. 이와 관련하여, 다른 문서 관리 프레임워크들은 일반적으로 전체 워크플로우를 함께 가져오기 위해 상당한 양의 "글루 코드(glue code)"(예를 들어, 특정 프로젝트를 위해 맞춤 제작된 코드)를 필요로 한다. 반면, 본 발명을 통해, 사용자들은 코드를 다시 작성할 필요 없이 각각의 단계를 구성할 수 있으므로, 예시적인 프로세스가 다양한 프로젝트들에서 쉽게 재사용할 수 있게 한다.[0008] An exemplary document management workflow seamlessly integrates key tasks for document ingestion, forecasting, consolidation, and analysis. The workflow allows users to build a knowledge base by answering specific questions about documents (eg, contracts) and modeling relationships with other documents. In particular, each step (eg, collection, prediction, consolidation and analysis) is integrated into an end-to-end workflow that can be configured with minimal effort or changes required by the user. Each step builds on the previous steps to enable analysis and extraction of information from the documents. In this regard, other document management frameworks typically require a significant amount of "glue code" (eg, code tailored for a particular project) to bring the entire workflow together. On the other hand, the present invention allows users to configure each step without having to rewrite code, thus making the exemplary process easy to reuse in various projects.

[0009] 또한, 실시예에 따르면, 예시적인 워크플로우는 예를 들어 프로세스를 특정 문제/사용 사례에 매핑함으로써 다양한 유형들의 문서 분석 문제들을 처리할 수 있다. 문제는 조항/규정 준수, 조달 계약서들, 상업적 누출, 계약 위험 분석 등과 같은 다양한 도메인들에서 발생할 수 있다. 또한 예시적인 프레임워크는 유연하여, 사용자들이 비즈니스 로직 규칙들, 포스트-프로세싱(post-processing) 및 품질 평가 작업들을 맞춤화하고, 이들을 비즈니스 사용 사례 및 특정 사용자의 특정 요구들 사항에 맞게 조정할 수 있게 한다. 다시 말해, 예시적인 프로세스는 문제를 표준적이고, 융통성 없는 프레임워크에 적합화하도록 시도하기보다는 문서 분석 문제에 적합하다. 또한, 예시적인 워크플로우의 각각의 부분(예를 들어, 문서 프로세싱, 피처 생성, 모델 아키텍처, 품질 평가, 포스트-프로세싱 및 계약 컨설리데이션)은 디폴트 구성과 연관될 수 있고, 이는 많은 특정 질문들을 커버할 수 있다. 그러나, 이들 디폴트 구성들은 새로운 질문이나 고유한 질문을 처리하기 위해 쉽게 수정될 수 있다. 또한, Lume 데이터 구조(Lume data structure)는 예시적인 프로세스 전반에 걸쳐 데이터와 메타데이터를 영속화함으로써, 통합 학습 모델을 가능하게 한다. 또한, 프로세스가 완전히 통합되기 때문에, 문서들(예를 들어, 계약서들) 및 해당 결과 문서들은 문서 내의 내용에 대한 지식을 추출하고 특정 질문들에 답변하도록 프로세싱될 수 있다. 또한, 예시적인 프로세스는 그래프-기반 추론 프레임워크를 사용하여 다수의 문서들에 걸쳐 정보를 해결할 수 있다. 예를 들어, 비즈니스 로직 계층은 주제 전문가가 문서 패밀리들을 결합하는 방법을 지정하게 할 수 있다. 또한, 그래프-기반 추론 프레임워크는 충돌하는 절들의 처리를 지정할 수 있다. 또한, 추론들은 또한 문서 패밀리 레벨 또는 개별 문서 레벨에서 이루어질 수 있다.Further, in accordance with an embodiment, the example workflow may address various types of document analysis problems, eg, by mapping a process to a specific problem/use case. Problems can arise in a variety of domains such as clause/compliance, procurement contracts, commercial leaks, contract risk analysis, etc. The example framework is also flexible, allowing users to customize business logic rules, post-processing and quality assessment tasks, and tailor them to the business use case and the specific needs of a specific user. . In other words, the exemplary process is suited to document analysis problems rather than attempting to fit the problem into a standard, inflexible framework. In addition, each part of the example workflow (eg, document processing, feature creation, model architecture, quality assessment, post-processing, and contract consolidation) may be associated with a default configuration, which addresses many specific questions. can cover However, these default configurations can be easily modified to handle new or unique questions. Additionally, Lume data structures enable unified learning models by persisting data and metadata throughout the exemplary process. Also, because the process is fully integrated, documents (eg, contracts) and their resulting documents can be processed to extract knowledge of the content within the document and answer specific questions. In addition, the example process may resolve information across multiple documents using a graph-based reasoning framework. For example, the business logic layer may allow subject matter experts to specify how to combine document families. In addition, the graph-based inference framework can specify the handling of conflicting clauses. Further, inferences may also be made at the document family level or at the individual document level.

[0010] 또한, 본 발명은 또한 최소한의 인간 상호작용으로 특정 절을 추출하기 위한 최적의 모델 프레임워크를 찾기 위해 스위칭될 수 있는 상호교환가능 모델 아키텍처들을 제공한다. 프레임워크-특정 언어는 디폴트 구성들 또는 맞춤 구성들에 포함될 수 있다. 또한 지식-기반을 통해 프레임워크-특정 피처들이 이용가능하게 될 수 있다. 또한, 특정 문제들에 대한 매우 효과적인 디폴트 옵션들은 사용자에 의한 구성들을 최소화한다. 또한, 교체가능 모델 아키텍처들은 비전문가들에 의해 사용될 수 있는 맞춤 구성 파일들을 통해 교체될 수 있는 시퀀스 라벨링, 분류 및 심층 학습 모델들에 대한 지원을 제공한다.[0010] The present invention also provides interchangeable model architectures that can be switched to find the optimal model framework for extracting a specific clause with minimal human interaction. A framework-specific language may be included in default configurations or custom configurations. Framework-specific features may also be made available through the knowledge-base. Also, very effective default options for certain problems minimize configurations by the user. In addition, interchangeable model architectures provide support for sequence labeling, classification and deep learning models that can be replaced via custom configuration files that can be used by non-experts.

[0011] 또한, 실시예에 따르면, 본 발명을 사용하여, 주제 전문 지식은 전체 솔루션에 인코딩될 수 있다. 예를 들어, 본 발명은 기계 학습 출력을 향상시키기 위해 복잡한 수동 작업들의 완료를 디지털화할 수 있다. 또한, 포스트 프로세싱은 클라이언트 사양들에 기반하여 높은 신뢰도의 답변들을 정리하거나 다시 포맷하기 위해 적용될 수 있다. 또한, 포스트 프로세싱은 또한 주제 전문 지식을 활용하여 문서로부터의 다수의 정보 피스들에 의존하는 질문들에 대한 다운스트림 답변들을 생성할 수 있다. 또한, 높은 신뢰도의 답변들이 고객의 사양들을 따르는 것을 보장하도록 품질 평가 단계들이 추가된다.[0011] Further, according to an embodiment, using the present invention, subject expertise may be encoded into the overall solution. For example, the present invention can digitize the completion of complex manual tasks to improve machine learning output. Also, post processing can be applied to organize or reformat high confidence answers based on client specifications. In addition, post processing can also utilize subject matter expertise to generate downstream answers to questions that rely on multiple pieces of information from a document. In addition, quality evaluation steps are added to ensure that high-confidence answers comply with the customer's specifications.

[0012] 또한, 실시예에 따르면, 본 발명은 또한 풍부한 고품질 훈련 및 테스팅 데이터세트들의 개발을 제공한다. 예를 들어, 본 발명은 데이터의 라벨링에서 엄선된 주제 전문 지식을 제공한다. 또한, 본 발명은 검색, 텍스트 유사성 및 클러스터링 기법들을 활용하여 성능이 뛰어난 모델들을 생성하는 데 보다 효율적이고 효과적인 대표적인 다양한 라벨링된 데이터세트들을 얻는다. 또한, 데이터세트들은 또한 프레임워크-특정 지식 기반의 정보와 통합될 수 있다. 또한, 본 발명은 또한 해당 특정 도메인을 더 잘 나타내기 위해 맞춤 단어 임베딩의 생성을 제공한다. 또한, 특정 문서 정보를 라벨링하기 위해 통합 학습 모델 또는 능동 학습 모델 중 적어도 하나가 활용될 수 있다. 마지막으로, 예시적인 프레임워크의 특정 모델들 및 결과들은 제3자 저장 디바이스에 저장될 수 있다.[0012] Further, according to an embodiment, the present invention also provides for the development of rich, high-quality training and testing datasets. For example, the present invention provides curated subject expertise in the labeling of data. In addition, the present invention utilizes search, text similarity and clustering techniques to obtain representative diverse labeled datasets that are more efficient and effective in generating high-performing models. In addition, datasets can also be integrated with framework-specific knowledge base information. In addition, the present invention also provides for the creation of custom word embeddings to better represent that particular domain. In addition, at least one of an integrated learning model or an active learning model may be utilized to label specific document information. Finally, specific models and results of the exemplary framework may be stored in a third-party storage device.

[0013] 이들 및 다른 장점들은 다음의 상세한 설명에서 보다 완전하게 설명될 것이다.[0013] These and other advantages will be more fully described in the detailed description that follows.

[0014] 본 발명의 보다 완전한 이해를 용이하게 하기 위해, 이제 첨부된 도면들을 참조한다. 도면들은 본 발명을 제한하는 것으로 해석되어서는 안 되고, 단지 본 발명의 상이한 양태들 및 실시예들을 예시하기 위한 것이다.
[0015] 도 1은 본 발명의 예시적인 실시예에 따른 분석 시스템에 대한 기능 블록도이다.
[0016] 도 2는 본 발명의 예시적인 실시예에 따른 분석 시스템의 아키텍처의 다이어그램이다.
[0017] 도 3은 본 발명의 예시적인 실시예에 따른, 본원에서 Lume로 지칭되는 변환된 파일에 대한 표준 데이터 포맷의 표현이다.
[0018] 도 4a는 본 발명의 예시적인 실시예에 따른 Lume 구조 및 예시적인 레벨들의 예를 묘사하는 도면이다.
[0019] 도 4b는 도 4a에 묘사된 메타데이터를 갖는 문서의 더 큰 뷰를 예시한다.
[0020] 도 5는 본 발명의 예시적인 실시예에 따른 마이크로소프트 워드 문서로부터 Lume 생성 프로세스를 묘사하는 도면이다.
[0021] 도 6은 본 발명의 예시적인 실시예에 따른 마이크로소프트 워드 및 텍스트 파일들의 디렉토리로부터 데이터세트 생성 프로세스를 묘사하는 도면이다.
[0022] 도 7은 본 발명의 예시적인 실시예에 따른 분석 시스템에 대한 흐름도이다.
[0023] 도 8은 본 발명의 예시적인 실시예에 따른 분석 시스템에 의해 수집되고 분석될 문서의 예를 예시한다.
[0024] 도 9는 본 발명의 예시적인 실시예에 따라 표에 도시된 표현 문자열들로 제시된 표현의 예이다.
[0025] 도 10은 본 발명의 예시적인 실시예에 따른 예측 답변들 형태의 지능형 도메인 엔진으로부터의 출력의 예이다.
[0026] 도 11은 본 발명의 예시적인 실시예에 따른 답변들에 대한 지원 및 정당성 형태의 지능형 도메인 엔진으로부터의 출력의 예이다.
[0027] 도 12는 본 발명의 예시적인 실시예에 따른 분석 시스템의 시스템 다이어그램이다.
[0028] 도 13은 본 발명의 예시적인 실시예에 따른 분석 시스템에 대한 흐름도이다.
[0029] 도 14는 본 발명의 예시적인 실시예에 따른 도 13에 묘사된 주석달기 단계의 흐름도이다.
[0030] 도 15a는 본 발명의 예시적인 실시예에 따른 도 13에 묘사된 능동 학습 단계에 대한 아키텍처 다이어그램이다.
[0031] 도 15b는 본 발명의 예시적인 실시예에 따른 도 13에 묘사된 능동 학습 단계에 대한 작업흐름도이다.
[0032] 도 16은 본 발명의 예시적인 실시예에 따른 도 13에 묘사된 기계 학습 단계의 다이어그램이다.
[0033] 도 17은 본 발명의 예시적인 실시예에 따른 도 13에 묘사된 컨설리데이션 단계의 다이어그램이다.
[0034] 도 18은 본 발명의 예시적인 실시예에 따른 다수의 문서들을 표현하기 위한 그래프 스키마들을 묘사하는 다이어그램이다.[0014] To facilitate a more complete understanding of the present invention, reference is now made to the accompanying drawings. The drawings are not to be construed as limiting the invention, but merely to illustrate different aspects and embodiments of the invention.
1 is a functional block diagram of an analysis system according to an exemplary embodiment of the present invention.
2 is a diagram of an architecture of an analysis system according to an exemplary embodiment of the present invention;
3 is a representation of a standard data format for a converted file referred to herein as Lume, according to an exemplary embodiment of the present invention;
4A is a diagram depicting an example of a Lume structure and exemplary levels in accordance with an exemplary embodiment of the present invention.
4B illustrates a larger view of a document with metadata depicted in FIG. 4A .
[0020] Figure 5 is a diagram depicting the process of creating a Lume from a Microsoft Word document in accordance with an exemplary embodiment of the present invention.
6 is a diagram depicting a process of creating a dataset from a directory of Microsoft Word and text files in accordance with an exemplary embodiment of the present invention.
7 is a flowchart for an analysis system according to an exemplary embodiment of the present invention.
8 illustrates an example of a document to be collected and analyzed by an analysis system according to an exemplary embodiment of the present invention.
9 is an example of an expression presented as expression strings shown in a table according to an exemplary embodiment of the present invention.
10 is an example of output from an intelligent domain engine in the form of predictive answers according to an exemplary embodiment of the present invention.
11 is an example of output from an intelligent domain engine in the form of support and justification for answers according to an exemplary embodiment of the present invention.
12 is a system diagram of an analysis system according to an exemplary embodiment of the present invention.
13 is a flowchart for an analysis system according to an exemplary embodiment of the present invention.
[0029] Figure 14 is a flow chart of the annotating step depicted in Figure 13 in accordance with an exemplary embodiment of the present invention.
[0030] FIG. 15A is an architecture diagram for the active learning phase depicted in FIG. 13 according to an exemplary embodiment of the present invention.
[0031] Figure 15b is a workflow diagram for the active learning step depicted in Figure 13 in accordance with an exemplary embodiment of the present invention.
[0032] Figure 16 is a diagram of the machine learning phase depicted in Figure 13 in accordance with an exemplary embodiment of the present invention.
[0033] Figure 17 is a diagram of the consolidation phase depicted in Figure 13 in accordance with an exemplary embodiment of the present invention.
18 is a diagram depicting graph schemas for representing multiple documents in accordance with an exemplary embodiment of the present invention.

[0035] 이제 본 발명의 다양한 특징들을 예시하기 위해 본 발명의 예시적인 실시예들이 설명될 것이다. 본원에 설명된 실시예들은 본 발명의 범위를 제한하려는 것이 아니라, 오히려 본 발명의 구성요소들, 용도 및 동작의 예들을 제공하도록 의도된다.[0035] Exemplary embodiments of the invention will now be described to illustrate various features of the invention. The embodiments described herein are not intended to limit the scope of the invention, but rather to provide examples of components, use and operation of the invention.

[0036] 일 실시예에 따르면, 본 발명은 구조화 및 비구조화된 데이터의 분석을 위한 자동화 시스템 및 방법에 관한 것이다. 분석 시스템(본원에서 종종 "시스템"으로 지칭됨)은 인공 지능 도메인 전문지식 및 관련 기술 구성요소들을 포함하는 인공 지능 능력들의 포트폴리오를 포함할 수 있다. 시스템은 문서 수집 및 광학 문자 인식(OCR)과 같은 기본 능력들, 예를 들어 문서들을 취하고 이들을 분석을 수행하기 위해 기계에 의해 판독가능한 포맷들로 변환하는 능력을 포함할 수 있다. 바람직한 실시예에 따르면, 시스템은 또한 명시적으로 프로그래밍되지 않고(감독 및 비감독) 시스템이 학습할 수 있는 능력을 제공하는 기계 학습 구성요소들; 데이터의 높은-레벨 추상화들을 모델링하는 심층 학습 구성요소들; 및 자연어 프로세싱(NLP) 및 생성, 예를 들어 인간 스피치 또는 텍스트를 이해하고 텍스트 또는 스피치를 생성하는 기능을 포함한다.According to one embodiment, the present invention relates to an automated system and method for the analysis of structured and unstructured data. An analytics system (sometimes referred to herein as a “system”) may include a portfolio of artificial intelligence capabilities, including artificial intelligence domain expertise and related technology components. The system may include basic capabilities such as document collection and optical character recognition (OCR), for example, the ability to take documents and convert them into machine readable formats to perform analysis. According to a preferred embodiment, the system also includes machine learning components that are not explicitly programmed (supervised and unsupervised) and provide the ability for the system to learn; deep learning components that model high-level abstractions of data; and natural language processing (NLP) and generation, eg, the ability to understand human speech or text and generate text or speech.

[0037] 시스템은 또한 구조화된 데이터(예를 들어, 트랜잭션 시스템 데이터 및 마이크로소프트 엑셀 파일들과 같은 열들과 행들로 조직화된 데이터); 반-구조화된 데이터(예를 들어, 인식된 데이터 구조에 저장되지 않지만 폼들과 같은 일부 유형의 탭(tab)들이나 포맷팅을 여전히 포함하는 텍스트); 비구조화 데이터(예를 들어, 계약서들, 트윗들 및 정책 문서들과 같이 인식된 데이터 구조에 저장되지 않은 텍스트); 및 이미지들 및 음성(예를 들어, 물리적 객체들 및 사람의 음성 데이터에 대한 사진들 또는 다른 시각적 묘사들)을 포함하여, 다양한 유형들의 입력 데이터를 수집하고 프로세싱하도록 설계될 수 있다.[0037] The system may also include structured data (eg, transactional system data and data organized into columns and rows, such as Microsoft Excel files); semi-structured data (eg, text that is not stored in a recognized data structure but still contains formatting or tabs of some type, such as forms); unstructured data (eg, text not stored in recognized data structures such as contracts, tweets, and policy documents); and images and voice (eg, photographs or other visual depictions of physical objects and human voice data), and may be designed to collect and process various types of input data.

[0038] 시스템은 구조화된 데이터 및 비구조화된 데이터의 빠르게 증가하는 본체를 구성하는 문서들, 통신들 및 웹사이트들을 수집, 이해 및 분석하기 위해 배포될 수 있다. 일 실시예에 따르면, 시스템은: (a) 성적표들, 세금 신고서들, 통신들, 재무 보고서들 및 유사한 문서들 및 입력 파일들을 판독, (b) 정보를 추출하고 정보를 구조화된 파일들로 캡처, (c) 정책들, 규칙들, 규정들, 및/또는 비즈니스 목표들의 맥락에서 정보를 평가, 및 (d) 질문들에 답하고, 통찰력들을 생성하고, 정보의 패턴들과 이상들을 식별하도록 설계될 수 있다. 시스템은 주제 전문지식을 캡처하고 저장하고; 자연어 프로세싱(NLP)을 사용하여 문서들을 수집, 마이닝(mine) 및 분류하고; 고급 기계 학습 및 인공 지능 방법들을 통합하고; 자문 및 고객 이해 관계자들과의 협력적이고 반복적인 개선을 활용할 수 있다.The system may be deployed for collecting, understanding and analyzing the documents, communications and websites that make up the rapidly growing body of structured and unstructured data. According to one embodiment, the system: (a) reads report cards, tax returns, communications, financial reports and similar documents and input files, (b) extracts information and captures the information into structured files , (c) evaluate information in the context of policies, rules, regulations, and/or business objectives, and (d) design to answer questions, generate insights, and identify patterns and anomalies in information can be The system captures and stores subject expertise; collect, mine, and classify documents using natural language processing (NLP); integrating advanced machine learning and artificial intelligence methods; Collaborative and iterative improvements with advisory and customer stakeholders are available.

[0039] 시스템이 답변할 수 있는 질문들의 예들은 예를 들어 어떤 문서들이 소정 정책 또는 규정을 준수하는지, 어떤 애셋들이 가장 위험한지, 어떤 권리들이 개입을 보증하는지, 어떤 고객들이 손실을 겪을 가능성이 가장 높은지/낮은지, 어떤 클라이언트들이 증가/축소하는 지갑과 시장 점유율을 가질 것인지, 어떤 문서들이 추세나 의미의 변화를 겪는지를 포함할 수 있다. 시스템이 분석할 수 있는 정책들 또는 규칙들의 예들은 예를 들어 몇 가지 예를 들면 새로운 규정들, 회계 표준들, 수익성 목표들, 증가 대 희석 프로젝트들의 식별, 신용 위험 평가, 애셋 선택, 포트폴리오 재조정 또는 결제 결과들을 포함할 수 있다. 시스템이 분석할 수 있는 문서들의 예들은 예를 들어 법적 계약서들, 대출 문서들, 증권 안내서, 회사 재무 서류, 파생 상품 확인들 및 마스터들, 보험 정책들, 보험 청구 메모들, 고객 서비스 성적표들, 이메일 교환들을 포함할 수 있다.Examples of questions the system may answer include, for example, which documents comply with certain policies or regulations, which assets are most at risk, which rights warrant intervention, which customers are likely to suffer losses This could include which documents are highest/lowest, which clients will have increasing/decreasing wallets and market share, and which documents are undergoing a change in trend or meaning. Examples of policies or rules that the system can analyze include, for example, new regulations, accounting standards, profitability targets, identification of growth versus dilution projects, credit risk assessment, asset selection, portfolio rebalancing or Payment results may be included. Examples of documents that the system can analyze include, for example, legal contracts, loan documents, securities prospectus, company financial documents, derivative checks and masters, insurance policies, insurance claim notes, customer service report cards, may include email exchanges.

[0040] 도 1은 본 발명의 예시적인 실시예에 따른 구조화 및 비구조화된 데이터의 자동화된 분석을 위한 시스템의 기능 블록도이다. 도 1에 도시된 바와 같이, 시스템은 콘텐츠를 수집하고 구조화하는 알고리즘 외에, 다양한 데이터 소스들, 도메인 지식 및 인간 상호작용을 통합한다. 시스템은 계약서들, 대출 문서들 및/또는 텍스트 파일들과 같은 복수의 문서들(5)을 수집하고 관련 데이터(6)를 추출하기 위한 스캐닝 구성요소(10)를 포함한다. 수집 프로세스 동안, 시스템은 OCR 기술을 통합하여 이미지(예를 들어, PDF 이미지)를 검색가능 문자들로 변환하고 NLP 프리-프로세싱(pre-processing)을 통합하여 스캔된 이미지들을 원시 문서들(11) 및 필수 콘텐츠(12)로 변환할 수 있다. 또한, 적절한 수집 접근법은 문서 메타데이터와 포맷팅 정보를 변환하고 보존하기 위해 사용될 것이다. 많은 사례들에서, 입력된 비구조화된 데이터는 데이터세트에 저장된 문서들의 코퍼스(15)를 함께 형성하는 다수의 문서들에 상주할 것이다.1 is a functional block diagram of a system for automated analysis of structured and unstructured data in accordance with an exemplary embodiment of the present invention. As shown in Figure 1, the system integrates various data sources, domain knowledge, and human interaction, in addition to algorithms for collecting and structuring content. The system comprises a scanning component 10 for collecting a plurality of documents 5 , such as contracts, loan documents and/or text files, and for extracting relevant data 6 . During the collection process, the system incorporates OCR technology to convert images (eg, PDF images) into searchable characters and incorporates NLP pre-processing to convert the scanned images into raw documents (11). and essential content 12 . In addition, appropriate aggregation approaches will be used to transform and preserve document metadata and formatting information. In many instances, the input unstructured data will reside in multiple documents that together form a corpus 15 of documents stored in the dataset.

[0041] 도 1의 예는 특정 비즈니스 컨텍스트에서 구현된 "규정 규칙 세트"를 묘사한다. 규정 규칙 세트의 일 예는 새로운 또는 수정된 금융 규정들일 수 있고, 금융 기관이나 회사는 계약들이 새로운 규정들을 준수하는지 확인할 필요가 있다. 새로운 규정들의 준수를 평가하기 위해 계약들의 수동 검토가 한 가지 대안이지만, 이 접근법은 또한 전문가들이 계약들을 검토하는 데 상당한 시간 투입과 막대한 비용들을 수반할 수 있다. 대안적으로, 시스템은 계약들을 읽고, 정보를 추출하고 정보를 구조화된 파일들로 캡처하고, 수정된 규정들 및/또는 비즈니스 목표들의 맥락에서 정보를 평가하고, 질문들에 답변하고, 통찰력들을 생성하고, 계약들에서 패턴들 및 이상들을 식별하도록 시스템을 구성될 수 있다. 따라서, 본 발명의 예시적인 실시예들은 기존 샘플링 접근법들보다 100% 커버리지를 가능하게 하는 이점들을 제공할 수 있는 복잡한 문서들의 분석을 자동화할 수 있어서, 통찰력들을 생성하는 데 필요한 개발 시간과 비용들을 절감하고, 인간들이 정확한 일관성을 달성하고 관리할 수 있게 하고, 주제 전문가들(SME)의 지식과 전문지식을 활용하고, 데이터를 프로세싱 방법을 설명하는 회계 감사 로그들을 자동으로 생성할 수 있다.The example of FIG. 1 depicts a “prescriptive rule set” implemented in a specific business context. An example of a set of regulatory rules could be new or amended financial rules, and a financial institution or company needs to ensure that contracts comply with the new rules. Manual review of contracts to assess compliance with new regulations is an alternative, but this approach can also entail significant time commitments and significant costs for experts to review contracts. Alternatively, the system reads contracts, extracts information and captures information into structured files, evaluates information in the context of revised regulations and/or business objectives, answers questions, and generates insights and the system can be configured to identify patterns and anomalies in contracts. Accordingly, exemplary embodiments of the present invention can automate the analysis of complex documents, which can provide advantages of enabling 100% coverage over existing sampling approaches, thereby reducing the development time and costs required to generate insights. It allows humans to achieve and manage precise consistency, leverage the knowledge and expertise of subject matter experts (SMEs), and automatically generate audit logs that describe how data is processed.

[0042] 도 1을 참조하면, 규정 규칙 세트는 수동 검토에서 주제 전문가들에 의해 사용되고 또한 기계 검토에서 관련 의미론(21) 및 결정 전략(22)으로 번역된다. 의미론(21)은 엔티티들, 관계들 및 사실들로 구성된 온톨로지(ontology) 또는 지식 기반으로 실현된 도메인 지식을 포함한다. 결정 전략(22)은 특정 질문들에 답변하기 위해 관련 의미론(21)에 적용되는 비즈니스 규칙들로 구성된다. 이것은 문서-레벨 평가들(예를 들어, 준수 또는 비준수), 피처-레벨 추출(종료 날짜들, 주요 엔티티들), 추론된 사실들(예를 들어, 추출된 사실들 및 온톨로지를 활용하여 추론), 또는 위험 식별(예를 들어, 추가 정밀 조사가 필요한 문서 부분들을 식별)을 포함한다. 기계 학습 검토(25a)는 지정된 계약 조건들, 날짜들, 엔티티들 및 사실들과 같은 방향결정 피처(26a)를 분석하고, 지능형 도메인 엔진(본원에서 종종 "IDE"로 지칭됨)을 사용하여 자동화된 문서 분석 평가(27a)를 착수한다. 기계 학습 검토(25a)는 신뢰도 점수를 제공함으로써 기계 준수 결정(28a)을 지원한다. 이와 병행하여, 예를 들어 주제 전문가에 의해 수행된 선택된 문서들의 수동 검토(25b)는 방향결정 피처들(26b)을 분석하고 계약들의 샘플에 대한 문서 분석 평가(27b) 및 수동 준수 결정(28b)을 착수한다. 병렬 수동 및 기계 평가들은 정확도 및 신뢰도 점수(29)를 결정하는 데 사용되고, 이는 수동 검토 및 기계 검토를 위한 피드백(30)으로 사용된다. 피드백(30)은 기계 검토의 개선을 허용하여, 각각의 반복은 자동화된 분석에서 향상된 정확도 및 이에 상응하는 신뢰도 점수의 증가를 제공할 수 있다. 능동 학습 방법들은 주어진 정확도를 달성하는 데 필요한 반복 횟수를 줄이는 데 사용된다.Referring to FIG. 1 , a set of prescriptive rules is used by subject matter experts in manual review and translated into relevant semantics 21 and decision strategy 22 in machine review. Semantics 21 includes domain knowledge realized as an ontology or knowledge base composed of entities, relationships and facts. The decision strategy 22 consists of business rules that are applied to the relevant semantics 21 to answer specific questions. This includes document-level assessments (e.g., conformance or non-compliance), feature-level extractions (end dates, key entities), inferred facts (e.g., inferred using extracted facts and ontology). , or hazard identification (eg, identifying parts of the document that require further scrutiny). The machine learning review 25a analyzes the directional features 26a, such as specified contract terms, dates, entities and facts, and automates it using an intelligent domain engine (sometimes referred to herein as an “IDE”). Initiate the document analysis evaluation 27a. The machine learning review 25a supports the machine compliance decision 28a by providing a confidence score. In parallel with this, a manual review 25b of selected documents performed, for example, by a subject matter expert, analyzes the directional features 26b and evaluates the document analysis on a sample of contracts 27b and a manual compliance decision 28b. to start Parallel manual and machine evaluations are used to determine an accuracy and confidence score (29), which is used as feedback (30) for manual and machine reviews. The feedback 30 may allow for improvement of the machine review, so that each iteration can provide improved accuracy and a corresponding increase in confidence score in the automated analysis. Active learning methods are used to reduce the number of iterations required to achieve a given accuracy.

[0043] 도 2를 참조하면, 본 발명의 예시적인 실시예에 따른 시스템의 아키텍처가 묘사된다. 이미 언급된 바와 같이, 시스템은 구조화 및 비구조화 데이터에 대한 정보 추출 및 데이터 분석을 지원할 수 있다. 입력 데이터(210)는 문서들, 텍스트, 비디오, 오디오, 표들 및 데이터베이스들과 같은 다양한 유형들 및 포맷들의 다양한 파일들 또는 정보의 형태를 취할 수 있다. 도 2에 도시된 바와 같이, 분석될 데이터는 핵심 문서 관리 시스템(220)에 입력될 수 있다.[0043] Referring to FIG. 2, an architecture of a system in accordance with an exemplary embodiment of the present invention is depicted. As already mentioned, the system can support information extraction and data analysis for both structured and unstructured data. The input data 210 may take the form of various files or information of various types and formats, such as documents, text, video, audio, tables, and databases. As shown in FIG. 2 , the data to be analyzed may be input to the core document management system 220 .

[0044] 본 발명의 바람직한 실시예에 따르면, 입력 데이터(210)는 도 2에서 "Lume"로 지칭되는 공통 데이터 포맷(230)으로 변환된다. Lume는 바람직하게 모든 구성요소들 및 데이터 저장소에 대한 공통 포맷일 수 있다. 도 2에 도시된 바와 같이, 핵심 문서 관리 시스템은 문서 변환 시스템(240)(문서들을 Lume 포맷(230)으로 변환하기 위해) 및 문서 및 코퍼스 보관소(220)를 포함한다. 문서 변환 시스템은 문서 데이터 및 메타데이터를 추출하고 자연어 프로세싱을 수행하는 데 사용되는 포맷(240)으로 저장하기 위한 유틸리티를 제공한다. 표준화된 Lume 포맷(230)은 다수의 구성요소들이 Lume들에 쉽게 적용되고 향상된 프로세싱을 위해 업스트림 정보를 활용할 수 있기 때문에 Lume들의 데이터 프로세싱 및 분석을 용이하게 한다. 하나의 애플리케이션에서, 프로세싱 워크플로우는 문장들, 토큰들 및 다른 문서 구조들을 식별하기 위해 함께 체인화될 수 있고; 엔티티 식별; 분류 또는 온톨로지에 대한 주석달기; 및 지능형 도메인 엔진(251)은 이 정보를 활용하여 파생 및 추론된 피처들을 생성할 수 있다. 이들 구성요소들 각각은 Lume(240)를 입력으로 활용하고, Lume(240)를 출력으로 활용하고, 메타데이터는 Lume에 추가로 삽입될 수 있다. 구성요소들의 다른 예들은 예를 들어, 상이한 엔진들, 자연어 프로세싱(NLP) 구성요소들(255), 인덱싱 구성요소들, 및 다른 유형들의 구성요소들(예를 들어, 광학 문자 인식(OCR)(252), 기계 학습(253), 및 이미지 프로세싱(254))을 포함할 수 있다.According to a preferred embodiment of the present invention, the input data 210 is converted into a common data format 230 referred to as “Lume” in FIG. 2 . Lume may preferably be a common format for all components and data storage. As shown in FIG. 2 , the core document management system includes a document conversion system 240 (to convert documents to Lume format 230 ) and a document and corpus repository 220 . The document conversion system provides utilities for extracting document data and metadata and storing them in a format 240 used to perform natural language processing. The standardized Lume format 230 facilitates data processing and analysis of Lumes because multiple components can be easily applied to Lumes and utilize upstream information for enhanced processing. In one application, a processing workflow can be chained together to identify sentences, tokens, and other document structures; entity identification; Annotate a classification or ontology; and intelligent domain engine 251 may utilize this information to generate derived and inferred features. Each of these components utilizes Lume 240 as input and Lume 240 as output, and metadata may be further inserted into the Lume. Other examples of components include, for example, different engines, natural language processing (NLP) components 255, indexing components, and other types of components (eg, optical character recognition (OCR) ( 252 ), machine learning 253 , and image processing 254 ).

[0045] 구성요소들(250)은 Lume들(240)을 읽고 Lume 요소들을 생성한다. 이어서, Lume 요소들은 스탠드 오프(stand-off) 주석달기 포맷(데이터베이스(220), 기본 데이터 포맷(230)의 부모 클래스 정의(parent class definition) 및 애플리케이션 지정 데이터 포맷들(240)에서 포맷들의 특정 인스턴스들에 의해 묘사됨)으로 저장된다. 예로서, NLP 구성요소(255)는 Lume(240)를 프로세싱하고 단어 토큰들, 품사, 의미 역할 라벨들, 명명된 엔티티들, 공동 참조 구들, 등을 포함하여 기본 데이터에서 인간 언어 특정 구성들을 나타내기 위해 추가 Lume 요소들을 추가한다. 이들 요소들은 질문 언어를 통해 세트(또는 개별) Lume(240) 또는 Lume 요소들을 신속하게 검색할 수 있는 능력을 사용자들에게 제공하기 위해 인덱싱될 수 있다.Components 250 read Lumes 240 and create Lume elements. Lume elements are then placed in a stand-off annotation format (database 220, parent class definition of base data format 230) and specific instances of formats in application specific data formats (240). described by them). As an example, NLP component 255 processes Lume 240 and represents human language specific constructs in the underlying data, including word tokens, part-of-speech, semantic role labels, named entities, joint reference phrases, etc. Add additional Lume elements to bet. These elements may be indexed to provide users with the ability to quickly search for a set (or individual) Lume 240 or Lume elements via a query language.

[0046] Lume 기술은 도 3-도 6을 참조하여 아래에서 추가로 설명될 것이다.[0046] The Lume technique will be further described below with reference to FIGS. 3-6.

[0047] 도 2는 또한 다수의 기계 학습(ML) 구성요소들(253)이 시스템에 통합될 수 있음을 예시한다. 예를 들어, 시스템은 ML 변환 구성요소, 분류 구성요소, 클러스터링 구성요소 및 심층 학습 구성요소를 포함할 수 있다. ML 변환 구성요소는 기본 Lume 표현들을 빠른 분석 프로세싱을 위해 기계-판독가능 벡터들로 변환한다. 분류 구성요소는 초기 훈련 및 구성에 기반하여 주어진 입력 세트를 학습된 출력들의 세트(범주 또는 숫자)에 매핑한다. 클러스터링 구성요소는 미리결정된 유사성 메트릭에 기반하여 벡터들의 그룹들을 생성한다. 심층 학습 구성요소는 노드들 및 연결들의 다-계층 네트워크 표현을 활용하여 출력들(범주 또는 숫자)을 학습하는 특정 유형의 기계 학습 구성요소(253)이다.FIG. 2 also illustrates that multiple machine learning (ML) components 253 may be integrated into a system. For example, a system may include an ML transformation component, a classification component, a clustering component, and a deep learning component. The ML transformation component transforms basic Lume representations into machine-readable vectors for fast analysis processing. The classification component maps a given set of inputs to a set of learned outputs (categories or numbers) based on initial training and construction. The clustering component generates groups of vectors based on the predetermined similarity metric. A deep learning component is a specific type of machine learning component 253 that utilizes a multi-layer network representation of nodes and connections to learn outputs (categorical or numeric).

[0048] 도 2는 시스템이 다양한 유형들의 사용자들이 시스템과 상호작용할 수 있도록 하는 다수의 사용자 인터페이스들(270)을 포함할 수 있음을 예시한다. IDE 관리자(273)는 사용자들이 시스템에 표현들을 수정, 삭제 및 추가할 수 있게 한다. 모델 관리자(274)는 사용자가 파이프라인에서 실행할 기계 학습 모델들을 선택하게 한다. 검색 인터페이스(272)(즉, 데이터 탐색)는 사용자들이 플랫폼에 로드된 데이터를 찾게 한다. 문서 및 코퍼스 주석기(271)(즉, 주석달기 관리자) 및 편집자들은 사용자들이 Lume에 주석들을 수동으로 생성 및 수정하고 시스템을 훈련 및 테스트하기 위해 Lume들을 코퍼스들로 그룹화할 수 있다. 시각적 워크플로우 인터페이스들(275)(즉, 워크벤치(workbench))는 워크플로우들을 구축하기 위한 시각적 능력을 제공하고, 플랫폼에 저장된 데이터의 히스토그램들 및 다른 통계적 뷰들을 생성하는 데 사용될 수 있다.2 illustrates that a system may include a number of user interfaces 270 that allow various types of users to interact with the system. The IDE manager 273 allows users to modify, delete, and add representations to the system. The model manager 274 allows the user to select machine learning models to run in the pipeline. The search interface 272 (ie, data search) allows users to find data loaded into the platform. Document and Corpus Annotator 271 (ie, Commenting Manager) and editors can group Lumes into corpuses for users to manually create and modify annotations in Lume and to train and test the system. Visual workflow interfaces 275 (ie, a workbench) provide a visual capability for building workflows and can be used to create histograms and other statistical views of data stored in the platform.

[0049] 도 3은 본 발명의 예시적인 실시예에 따른 Lume의 속성들 및 특징들을 예시한다. 도 3에 도시된 바와 같이, "이름"은 문서의 자격이 없는(non-qualified) 이름으로 구성된 문자열이다. "데이터"는 문서의 문자열 또는 이진 표현이다(예를 들어, 원본 데이터를 나타내는 직렬화된 데이터). "요소들"은 Lume 요소들의 어레이이다.3 illustrates the properties and characteristics of a Lume according to an exemplary embodiment of the present invention. As shown in Fig. 3, "name" is a character string composed of a non-qualified name of the document. "Data" is a string or binary representation of a document (eg, serialized data representing the original data). "Elements" are arrays of Lume elements.

[0050] 도 3에 도시된 바와 같이, 각각의 Lume 요소는 요소 ID와 요소 유형을 포함한다. 본 발명의 바람직한 실시예에 따르면, Lume 요소를 정의하고 생성하기 위해 요소 ID 및 요소 유형만이 필요하다. 요소 ID는 요소의 고유 식별자를 포함하는 문자열이다. 요소 유형은 Lume 요소의 유형을 식별하는 문자열이다. Lume 요소들의 유형들의 예들은 명사, 동사, 형용사와 같은 품사(POS); 및 사람, 장소 또는 조직과 같은 NER(named-entity-recognition)을 포함한다. 또한, 파일 경로 및 파일 유형 정보는 요소들로 저장될 수 있다. 파일 경로는 문서의 전체 소스 파일 경로를 구성하는 문자열이다. 파일 유형은 원본 문서의 파일 유형을 구성하는 문자열이다.As shown in FIG. 3 , each Lume element includes an element ID and an element type. According to a preferred embodiment of the present invention, only the element ID and element type are needed to define and create a Lume element. The element ID is a string containing the element's unique identifier. The element type is a string identifying the type of the Lume element. Examples of types of Lume elements include parts of speech (POS) such as nouns, verbs, and adjectives; and a named-entity-recognition (NER) such as a person, place, or organization. Also, file path and file type information may be stored as elements. The file path is a string that constitutes the full source file path of the document. The file type is a string constituting the file type of the original document.

[0051] 필수는 아니지만, Lume 요소는 또한 하나 이상의 속성들을 포함할 수 있다. 속성은 키-값 쌍들로 구성된 객체이다. 키-값 쌍들의 예는 예를 들어 {"name":"Wilbur", "age":27}일 수 있다. 이것은 개발자의 유연성을 허용하는 간단하면서도 더 강력한 포맷을 생성한다. 본 발명의 예시적인 실시예에 따라, 요소 ID와 유형만 필요한 이유는 개발자들에게 Lume에 대한 정보를 요소에 저장할 수 있는 유연성을 제공하면서 또한 ID 또는 유형에 의해 액세스가능한 것을 보장하기 때문이다. 이러한 유연성은 사용자들이 도메인 전문지식에 따라 요소들 간의 관계들 및 계층구조들을 저장하는 방법을 결정하게 한다. 예를 들어, 요소들은 복잡한 언어 구조들에 필요한 정보를 포함하거나 요소들 간의 관계들을 저장하거나, 다른 요소들을 참조할 수 있다.[0051] Although not required, the Lume element may also include one or more attributes. A property is an object made up of key-value pairs. An example of key-value pairs may be, for example, {"name":"Wilbur", "age":27}. This creates a simpler yet more powerful format that allows for developer flexibility. According to an exemplary embodiment of the present invention, only the element ID and type is needed because it gives developers the flexibility to store information about a Lume in an element while also ensuring that it is accessible by ID or type. This flexibility allows users to decide how to store relationships and hierarchies between elements according to their domain expertise. For example, elements may contain information necessary for complex language constructs, store relationships between elements, or reference other elements.

[0052] 본 발명의 예시적인 실시예에 따르면, Lume 요소들은 스탠드-오프 주석 포맷을 저장하는 데 사용된다. 즉, 요소들은 텍스트에 임베딩되지 않고, 문서 텍스트와 별도로 주석들로 저장된다. 이 실시예에 따르면, 시스템은 원본 데이터를 수정하지 않고 복원할 수 있다.[0052] According to an exemplary embodiment of the present invention, Lume elements are used to store a stand-off annotation format. That is, elements are not embedded in text, but are stored as comments separately from the document text. According to this embodiment, the system can restore the original data without modification.

[0053] 바람직한 실시예에 따르면, Lume 요소들은 다른 Lume 요소들에 대해 계층적 관계로 저장되지 않고, 문서 데이터 및 메타데이터는 비계층적 방식으로 저장된다. 대부분의 알려진 포맷들(Lume 제외)은 계층적이므로, 조작 및 변환이 어렵다. Lume의 비계층적 포맷은 문서 레벨 또는 텍스트 레벨에서 문서 데이터 또는 이의 메타데이터의 모든 요소들에 쉽게 액세스하게 한다. 또한, 충돌들을 해결하거나, 계층구조를 관리하거나 애플리케이션에 필요할 수 있고 필요하지 않을 수 있는 다른 동작들을 필요로 하지 않고 요소들에 대한 동작들을 통해 데이터 구조를 편집, 추가 또는 파싱이 행해질 수 있다. 이 실시예에 따르면, 스탠드-오프 주석 포맷이기 때문에, 시스템은 원본 데이터의 정확한 사본을 보존하고 오버래핑 주석들을 지원할 수 있다. 또한, 이것은 오디오, 이미지 및 비디오와 같은 다수의 포맷들의 주석을 허용한다.According to a preferred embodiment, Lume elements are not stored in a hierarchical relationship to other Lume elements, and document data and metadata are stored in a non-hierarchical manner. Most known formats (except Lume) are hierarchical and difficult to manipulate and convert. Lume's non-hierarchical format allows easy access to all elements of document data or its metadata at the document level or text level. Also, editing, appending, or parsing a data structure may be done through operations on elements without resolving conflicts, managing hierarchies, or requiring other operations that may or may not be required by the application. According to this embodiment, since it is a stand-off annotation format, the system can preserve an exact copy of the original data and support overlapping annotations. It also allows for annotation in multiple formats such as audio, image and video.

[0054] Lume 기술은 문서 데이터 및 메타데이터에 대한 범용 포맷을 제공할 수 있다. Lume가 생성되면, 이는 파이프라인에 도구들을 통합하기 위해 포맷 변환들을 작성할 필요 없이 자연어 프로세싱 파이프라인의 각각의 도구에서 사용될 수 있다. 이것은 데이터와 메타데이터를 전달하는 데 필요한 기본 협약들이 Lume 포맷으로 확립되기 때문이다. 시스템은 일반 텍스트 및 마이크로소프트 워드를 포함한 다수의 포맷들에서 문서 데이터 및 메타데이터를 추출하기 위한 유틸리티들을 제공한다. 포맷-특정 파서들은 이들 포맷들로부터의 데이터와 메타데이터를 Lume로 변환하고, 대응하여, 수정된 Lume를 포맷으로 다시 작성한다. 시스템은 Lume 기술을 사용하여 단어들의 패밀리들과 관련된 정보를 저장하여 프리프로세싱 및 스테밍(stemming)과 같은 자연어 프로세싱을 준비할 수 있다. 또한, 시스템은 Lume 기술을 사용하여 문서의 관계들, 및 그래프 구조들과 관련된 정보를 저장할 수 있다.[0054] Lume technology may provide a universal format for document data and metadata. Once a Lume is created, it can be used in each tool in the natural language processing pipeline without having to write format transformations to incorporate the tools into the pipeline. This is because the underlying conventions needed to pass data and metadata are established in the Lume format. The system provides utilities for extracting document data and metadata from multiple formats, including plain text and Microsoft Word. Format-specific parsers convert data and metadata from these formats into Lume, and correspondingly rewrite the modified Lume to the format. The system may use Lume technology to store information related to families of words in preparation for natural language processing such as preprocessing and stemming. The system may also use Lume technology to store information related to the relationships of documents, and graph structures.

[0055] 본 발명의 예시적인 실시예에 따르면, 시스템은 Lume 및 Lume 요소들 외에 다른 구성요소들을 포함한다. 특히, 시스템은 데이터세트, Lume 데이터 프레임, 이그나이트 구성요소(Ignite component) 및 요소 인덱스를 포함하도록 구성될 수 있다. 데이터세트는 고유 식별자를 갖는 Lume 객체들의 모음이다. 데이터세트는 일반적으로 기계 학습을 위한 훈련 및 테스팅 세트를 지정하는 데 사용되고 또한 많은 문서들에 대한 대량 동작들을 수행하는 데 사용될 수 있다. Lume 데이터 프레임은 Lume의 특수 매트릭스 표현이다. 시스템 내의 많은 기계 학습 및 수치 연산 구성요소들은 이 최적화된 포맷을 활용할 수 있다. 시스템은 또한 일반적으로 기존 Lume 요소들 또는 원본 소스 데이터를 프로세싱하고 새로운 Lume 요소 객체들을 추가하여 Lume(또는 Lume 코퍼스) 데이터를 읽고 Lume(또는 Lume 코퍼스) 데이터를 반환하는 이그나이트 구성요소들을 포함할 수 있다. 요소 인덱스는 세트들 또는 요소들의 컴퓨터 객체 표현 및 Lume 데이터 및 메타데이터 검색의 효율성을 위해 이그나이트에서 일반적으로 활용되는 표현들이다. 예를 들어, 일부 구성요소들은 문자 오프셋들에 대해 작업하도록 최적화될 수 있으므로 문자 오프셋들에 대한 인덱스는 해당 구성요소들에 대한 동작들의 속도를 높일 수 있다.[0055] According to an exemplary embodiment of the present invention, the system includes components other than Lume and Lume elements. In particular, the system may be configured to include a dataset, a Lume data frame, an Ignite component, and an element index. A dataset is a collection of Lume objects with a unique identifier. Datasets are generally used to specify training and testing sets for machine learning and can also be used to perform bulk operations on many documents. A Lume data frame is a special matrix representation of a Lume. Many machine learning and math components within the system can utilize this optimized format. The system may also include Ignite components that read Lume (or Lume corpus) data and return Lume (or Lume corpus) data, typically by processing existing Lume elements or original source data and adding new Lume element objects. have. Element indexes are representations commonly utilized in Ignite for computer object representation of sets or elements and for efficiency in retrieving Lume data and metadata. For example, some components may be optimized to work with character offsets, so an index to character offsets may speed up operations on those components.

[0056] 본 발명의 예시적인 실시예에 따르면, 시스템의 주요 기능들은 다음과 같이 설명되는 데이터 표현, 데이터 모델링, 발견 및 구성, 서비스 상호운용성을 포함한다.According to an exemplary embodiment of the present invention, the main functions of the system include data representation, data modeling, discovery and configuration, and service interoperability, which are described as follows.

[0057] 데이터 표현: Lume는 시스템에서 분석들을 저장하고 통신하는 데 사용되는 일반적인 데이터 포맷이다. Lume는 데이터 표현에 대해 스탠드-오프 접근법을 취하고, 예를 들어 분석 결과들은 원본 데이터와 독립적으로 주석으로 저장된다. 일 실시예에 따르면, Lume는 Python으로 구현되고 컴퓨터-객체 표현들을 Python 객체들로서 가지며 프로세스 간 통신을 위해 "JSON"(JavaScript Object Notation)으로 직렬화된다. Lume는 JSON, Swagger(YAML), RESTful과 같은 웹-기반 사양들과 함께 사용하도록 설계될 수 있고, Python 에코시스템과 인터페이스할 것이지만, 또한 Java 및 다른 언어들로 작성된 구성요소들을 구현하고 지원할 수 있다.[0057] Data Representation: Lume is a general data format used to store and communicate analyzes in a system. Lume takes a stand-off approach to data representation, for example, analysis results are stored as annotations independent of the original data. According to one embodiment, Lume is implemented in Python and has computer-object representations as Python objects and is serialized to "JSON" (JavaScript Object Notation) for inter-process communication. Lume can be designed for use with web-based specifications such as JSON, Swagger (YAML), RESTful, and will interface with the Python ecosystem, but can also implement and support components written in Java and other languages. .

[0058] 데이터 모델링: Lume는 단순하고 시스템 사용자들에 대한 기본 요건들만 시행하도록 설계될 수 있다. 해석들과 비즈니스 로직은 데이터와 프로세스들 둘 모두에 대한 선언적 표현들을 요구하기보다는 시스템 사용자들에게 맡겨진다. 시스템은 모델링을 비공식적으로 남겨두고 프로세싱 구성요소들에 구현들을 위해 세부사항들을 남기도록 설계될 수 있다. 이것은 Lume가 매우 간단한 사양을 유지하게 하고, 다른 애플리케이션들을 방해하지 않고 특정 애플리케이션들에 대해 확장되게 한다. 예를 들어, Lume 검색이 중요한 경우, 이는 Lume 구조의 상단에 인덱싱하는 모듈과 통합된다. 문서 객체 모델(DOM: Document Object Model)로 작업하는 것이 중요할 때, DOM 파서는 Lume 요소들 및 속성들의 형태로 추가 정보를 Lume에 저장하고 이 정보를 사용하여 DOM 모델로 다시 변환한다.[0058] Data modeling: Lume can be designed to be simple and enforce only basic requirements for system users. Interpretations and business logic are left to system users rather than requiring declarative representations of both data and processes. The system can be designed to leave the modeling informal and leave details for implementations to processing components. This allows Lume to keep a very simple specification and be extended for specific applications without interfering with other applications. For example, if Lume search is important, it is integrated with an indexing module on top of the Lume structure. When working with the Document Object Model (DOM) is important, the DOM parser stores additional information in the Lume in the form of Lume elements and attributes and uses this information to transform it back into the DOM model.

[0059] 발견 및 구성: Lume는 또한 분석 프로세스 출처와 관련된 추가 설계 피처를 가질 수 있다. 시스템 워크플로우들은 구성요소들의 반복성과 발견을 촉진하기 위해 출처 정보를 요구할 수 있다. 이 출처 정보는 Lume에 저장되고 출처-시행 워크플로우들을 통해 시행될 수 있다. 예를 들어, 이것은 각각의 출력 Lume들에 대한 조사를 제공하여 올바른 프로세싱 단계들이 완료되었는지 보장할 수 있다. 검증 스테이지에서, 올바르거나 잘못된 메타데이터를 생성한 Lume 요소의 출처를 추적하는 수단을 제공할 수 있다. 또한, 그것은 또한 모든 입력들이 출력들로서 수신되도록 보장하도록 추적될 수 있다.[0059] Discovery and Configuration: Lume may also have additional design features related to analysis process sources. System workflows may require source information to facilitate repeatability and discovery of components. This source information is stored in Lume and can be enforced through source-enforcement workflows. For example, this may provide a look at each of the output Lumes to ensure that the correct processing steps have been completed. In the validation stage, you can provide a means to trace the origin of the Lume element that generated correct or incorrect metadata. In addition, it can also be tracked to ensure that all inputs are received as outputs.

[0060] 서비스 상호운용성. 시스템에 의해 제공되는 서비스들은 본 발명의 일 실시예에 따라 Swagger(YAML 마크업 언어) 사양들을 요구할 수 있다. 시스템 구성요소를 구현하는 데 활용되는 비즈니스 로직, 동작들의 순서 및 다른 데이터 해석과 관련하여 많은 가정들이 있을 수 있다. 상호운용 가능한 구성요소들을 식별하는 것은 입력 및 출력 사양들이 아닌 예시적인 워크플로우들의 분석을 통해 달성될 수 있다. 시스템에서, 구성요소는 단순히 Lume에서 동작할 수 있고 에러의 경우 올바른 에러 코드를 반환하고 적절한 로깅 정보를 기록할 수 있다.[0060] Service Interoperability. Services provided by the system may require Swagger (YAML Markup Language) specifications according to an embodiment of the present invention. There may be many assumptions regarding the business logic utilized to implement the system components, the order of actions, and other interpretations of data. Identifying interoperable components may be accomplished through analysis of example workflows rather than input and output specifications. In the system, components can simply operate on Lume and return the correct error code in case of an error and log appropriate logging information.

[0061] 도 4a는 Lume 구조와, Lume로의 상이한 유형들의 파일들의 초기 변환의 예를 예시한다. 도 4a에 도시된 바와 같이, 데이터세트(410)는 상이한 유형들의 파일들 또는 문서들의 본체를 지칭한다. 이들 문서들은 초기에 Adobe PDF(Portable Document Format), 구조화되지 않은 텍스트 파일들, 마이크로소프트 워드 파일들 및 HTML 파일들과 같은 상이한 포맷들일 수 있다.4A illustrates an example of a Lume structure and initial conversion of different types of files to Lume. As shown in FIG. 4A , dataset 410 refers to a body of different types of files or documents. These documents may initially be in different formats such as Adobe Portable Document Format (PDF), unstructured text files, Microsoft Word files, and HTML files.

[0062] 도 4a는 또한 Lume에 대해 정의된 요소들의 예를 예시한다. 예를 들어, 제1 요소(411)는 연락처 정보를 포함하는 연구 책임자에 대응할 수 있고; 제2 요소는 연락처 정보(412)를 포함하는 프로토콜 관리자에 대응할 수 있고; 제3 요소는 연락처 정보(413)를 포함하는 계약 연구 기관(CRO)에 대응할 수 있고, 제4 요소는 연구 및 개발 회사(414)에 대응할 수 있고, 제5 요소(415)는 문서에 대한 기밀유지 통지에 대응할 수 있다. 도 4b는 도 4a에 묘사된 메타데이터를 갖는 문서의 더 큰 뷰를 예시한다.4A also illustrates an example of elements defined for Lume. For example, the first element 411 may correspond to a research director including contact information; The second element may correspond to a protocol manager including contact information 412 ; A third element may correspond to a contract research organization (CRO) comprising contact information 413 , a fourth element may correspond to a research and development company 414 , and a fifth element 415 may correspond to confidentiality of documents. Able to respond to maintenance notices. 4B illustrates a larger view of a document with metadata depicted in FIG. 4A .

[0063] 또한 도 4a에는 요소 유형들의 예시 레벨들이 도시된다. 예를 들어, 시스템은 각각이 Lume에서 추출될 수 있는 개별 단락들, 토큰들 또는 엔티티들을 사용자가 식별할 수 있게 하는 기능을 제공할 수 있다.Also shown in FIG. 4A are example levels of element types. For example, the system may provide functionality that enables a user to identify individual paragraphs, tokens or entities, each of which may be extracted from a Lume.

[0064] 도 5는 마이크로소프트 워드 문서에서 Lume 생성의 예의 추가 세부사항을 제공한다. 도 5에 도시된 바와 같이, 제1 단계, 즉 단계(501)는 원본 문서를 초기화하는 것이다. 초기화는 Lume 객체에 원본 데이터를 저장하는 것을 수반한다. 제2 단계, 즉 단계(502)는 문서를 Lume 포맷의 요소들로 파싱하는 것이다. 이 단계는 소스 문서로부터의 메타데이터에 대응하여 요소들이 생성되는 루프(502a)를 포함할 수 있다. 이것은 특정 포맷을 수집하는 문서 특정 구성요소들에 의해 수행된다. 특히, 수집 동안 (i) 원본 파일이 열리고, (ii) DOCX 포맷이 XML 파일로 압축해제되고, 이어서 (iii) XML 파일이 파싱을 위해 데이터 구조로 판독된다. 파싱은 문서의 데이터를 메타데이터로부터 분리하고, 이어서 Lume의 "데이터" 필드에 데이터를 저장하고, 메타데이터를 Lume 요소들에 저장한다. 이어서, 이것은 LumeText로 출력된다. 저장된 메타데이터의 예들은 작성자, 페이지, 단락 및 글꼴 정보이다.5 provides further details of an example of creating a Lume in a Microsoft Word document. As shown in Fig. 5, the first step, that is, step 501 is to initialize the original document. Initialization involves storing the original data in a Lume object. The second step, namely step 502, is to parse the document into elements in Lume format. This step may include a loop 502a in which elements are created corresponding to metadata from the source document. This is done by document specific components that collect a specific format. Specifically, during ingestion (i) the original file is opened, (ii) the DOCX format is decompressed into an XML file, and then (iii) the XML file is read into a data structure for parsing. Parsing separates the document's data from the metadata, then stores the data in the Lume's "data" field, and stores the metadata in Lume elements. Then, this is output to LumeText. Examples of stored metadata are author, page, paragraph, and font information.

[0065] 도 5에 도시된 프로세스가 끝나면, 입력 문서는 Lume로 변환되고, 원하는 요소들이 생성되어 저장된다.[0065] When the process shown in FIG. 5 is finished, the input document is converted into a Lume, and desired elements are created and stored.

[0066] 도 6은 도 5의 기능을 문서들의 코퍼스에 적용하는 예를 예시한다. 도 6의 제1 단계, 즉 단계(601)는 데이터세트를 초기화하는 단계를 포함한다. 도 6의 후속 단계들은 데이터세트의 각각의 문서에 도 5에 도시된 프로세스들의 적용을 수반한다. 단계(602)에서 데이터세트의 Lume들이 Lume 포맷으로 변환됨에 따라, 결과들은 데이터세트에 저장된다. 변환은 Lume 데이터 구조의 생성(즉, 루프(602b)), Lume 요소들로 포맷-특정 메타데이터의 변환(즉, 단계(602a)), 및 의미론적 주석, 자연어 프로세싱, 도메인-특정 피처들 생성, 또는 정량적 지문에 대한 벡터화 같은 필요한 추가 주석달기들을 포함한다. 보다 구체적으로, 단계(601)에서, 데이터세트 문서들은 URI에서 식별되고, 이어서 파일 데이터를 포함하는 Lume들이 602로 전달된다. 다음으로, 602b에서, Lume는 적절한 파서로 전달되고, 이는 파싱을 위한 적절한 데이터 구조를 생성한다. 602a에서, 파싱은 문서를 통해 작업하고, Lume의 "데이터" 필드에 있는 데이터와, 메타데이터를 Lume 요소들로 파싱한다. 이어서, 이것은 LumeText로 출력된다.6 illustrates an example of applying the function of FIG. 5 to a corpus of documents. The first step in FIG. 6 , ie step 601 , includes initializing the dataset. Subsequent steps in FIG. 6 involve applying the processes shown in FIG. 5 to each document in the dataset. As the Lumes of the dataset are converted to Lume format in step 602, the results are stored in the dataset. The transformation includes creation of a Lume data structure (ie, loop 602b), transformation of format-specific metadata into Lume elements (ie, step 602a), and creation of semantic annotations, natural language processing, domain-specific features. , or any additional annotations needed, such as vectorization for a quantitative fingerprint. More specifically, in step 601 , dataset documents are identified in the URI, and then Lumes containing file data are passed to 602 . Next, at 602b, the Lume is passed to the appropriate parser, which creates the appropriate data structure for parsing. At 602a, the parsing works through the document and parses the data and metadata in the Lume's "data" field into Lume elements. Then, this is output to LumeText.

[0067] 도 7은 본 발명의 예시적인 실시예에 따라 구조화 및 비구조화 데이터를 분석하기 위한 프로세스의 예를 예시하는 프로세스 다이어그램이다. 단계(710)에서, 텍스트, 마이크로소프트 워드 및/또는 Adobe PDF 문서들과 같은 문서들은 시스템으로 수집된다. 이어서, 문서들은 단계(712)에서 전술된 바와 같이 Lume 포맷으로 변환된다. OCR 프로세스는 이미지 파일을 문자들로 변환하기 위해 단계(714)에서 사용될 수 있다. 단계(716)에서, 문서들은 데이터세트로 컬렉팅(collect)된다. 단계(718)에서 시스템은 구조적 Lume 요소들을 식별하고 주석을 단다(예를 들어, 도 6 참조). 문서들이 Lume 포맷으로 변환되고 Lume 요소들이 생성되면, 자연어 프로세싱(NLP) 루틴들 또는 구성요소들은 단계(720)에서 Lume 포맷 정보에 적용될 수 있다.7 is a process diagram illustrating an example of a process for analyzing structured and unstructured data in accordance with an exemplary embodiment of the present invention. In step 710, documents such as text, Microsoft Word and/or Adobe PDF documents are collected into the system. The documents are then converted to Lume format as described above in step 712 . The OCR process may be used in step 714 to convert the image file to characters. In step 716, the documents are collected into a dataset. At step 718 the system identifies and annotates structural Lume elements (see, eg, FIG. 6 ). Once the documents are converted to Lume format and Lume elements are created, natural language processing (NLP) routines or components may be applied to the Lume format information in step 720 .

[0068] 단계(722)에서, 시스템의 사용자는 엔티티들의 목록을 포함하는 온톨로지를 생성하고 입력한다. 일 예에 따르면, 온톨로지는 사람들과 그들이 어떤 기업에 고용되었는지를 설명할 수 있다. 예를 들어, 온톨로지는 플랫폼의 문서들에서 사람들과 비지니스들을 추출하는 데 유용할 수 있다. 대안적으로, 온톨로지는 회사의 다양한 제품들, 제품이 속한 범주들 및 제품들 간의 임의의 종속성을 설명할 수 있다. 단계(724)는 엔티티 분석 및 의미론적 주석달기를 포함한다. 엔티티 분석은 데이터에서 참조되는 엔티티들이 실제로 동일한 현실-세계 엔티티들인지 결정한다. 이 분석은 추출된 데이터, 온톨로지들 및 추가 기계 학습 모델들을 사용하여 달성된다. 의미론적 주석은 데이터의 구문들을 온톨로지들에 정의된 공식적으로-정의된 개념들과 관련시킨다. 위의 비즈니스 직원 예에서, "John Doe"라는 단어들의 출현이 식별되고 온톨로지에서 직원 John Doe와 연결된다. 이것은 다운스트림 구성요소들이 John Doe에 대한 추가 정보, 예를 들어, 회사에서의 직위 및 기능을 활용할 수 있게 할 것이다.In step 722, the user of the system creates and inputs an ontology comprising a list of entities. According to an example, the ontology may describe people and what kind of company they are employed in. For example, an ontology can be useful for extracting people and businesses from documents in a platform. Alternatively, an ontology may describe the various products of a company, the categories to which the product belongs, and any dependencies between the products. Step 724 includes entity analysis and semantic annotation. Entity analysis determines whether the entities referenced in the data are actually the same real-world entities. This analysis is accomplished using extracted data, ontology and additional machine learning models. Semantic annotations relate the syntax of data to formally-defined concepts defined in ontology. In the business employee example above, occurrences of the words "John Doe" are identified and associated with employee John Doe in the ontology. This will allow downstream components to take advantage of additional information about John Doe, such as his position and capabilities in the company.

[0069] 단계(726)에서, 시스템 사용자는 데이터세트에 저장된 문서들에 적용될 표현들을 생성한다. 표현들은 예를 들어 검색할 패턴들이나 문서들의 다른 구별 피처들을 지정하는 CSV(comma-separated-value) 파일들일 수 있다. 표현들은 주제 전문가들의 전문지식과 노하우를 포함할 수 있다. 예를 들어, 표현은 세금 문서의 특정 계약 조항 또는 조항들을 식별하는 다양한 특정 단어들 및 단어들 또는 패턴들 간의 관계들을 식별할 수 있다. 이들 표현들은 문서의 특정 양태들, 조항들 또는 다른 식별 피처들을 검색하고 식별하는 데 사용된다. 표현은 또한 기계 학습 연산자, 사전-훈련된 시퀀스 라벨링 구성요소 또는 IDE에서 연산자들 중 하나로 작용하는 알고리즘 파서를 활용할 수 있다.In step 726 , the system user creates representations to be applied to documents stored in the dataset. Representations may be, for example, comma-separated-value (CSV) files specifying patterns to search for or other distinguishing features of documents. Expressions may include the expertise and know-how of subject matter experts. For example, the expression may identify various specific words and relationships between words or patterns that identify specific contractual clauses or clauses in a tax document. These expressions are used to search for and identify specific aspects, clauses, or other identifying features of a document. Expressions can also utilize machine learning operators, pre-trained sequence labeling components, or algorithmic parsers that act as one of the operators in the IDE.

[0070] 단계(728)에서, 표현들은 표현들을 읽고 데이터세트에 적용하는 IDE에 입력된다. 일 실시예에 따르면, 출력은 예측된 답변들과 답변들에 대한 지원 및 정당성을 포함할 수 있다. IDE는 도 8-도 12와 관련하여 아래에서 추가로 설명될 것이다.At step 728 , the representations are input into the IDE that reads the representations and applies them to the dataset. According to one embodiment, the output may include predicted answers and support and justification for answers. The IDE will be further described below with respect to FIGS. 8-12.

[0071] 단계(730)에서, IDE의 출력은 추가적인 피처들을 제작하기 위해 활용될 수 있다. 이것은 이전에 생성된 Lume 요소들을 활용하고, 추가 피처들에 대응하는 새로운 Lume 요소들을 생성한다. 피처 엔지니어링은 학습 및 추론 작업들을 위해 특정 신호들과 관련된 피처들을 생성하기 위한 Lume 요소들의 세트들에 대한 표시기 기능들로 추상적으로 생각될 수 있다. 일반적인 경우, 피처 엔지니어링은 시퀀스 라벨링, 또는 시퀀스 학습 작업들에 필요한 추가 범주 또는 설명 텍스트 피처들을 생성할 수 있다. 예를 들어, 엔지니어링은 맞춤 엔티티 태깅을 위한 피처들을 준비하거나, 관계들을 식별하거나, 다운스트림 학습을 위해 요소들의 서브세트를 타깃으로 할 수 있다.At step 730 , the output of the IDE may be utilized to fabricate additional features. This utilizes previously created Lume elements and creates new Lume elements corresponding to additional features. Feature engineering can be abstractly thought of as indicator functions for sets of Lume elements for generating features associated with specific signals for learning and inference tasks. In the general case, feature engineering may generate additional categorical or descriptive text features needed for sequence labeling, or sequence learning tasks. For example, engineering may prepare features for custom entity tagging, identify relationships, or target a subset of elements for downstream learning.

[0072] 단계(732)에서, 기계 학습 알고리즘들 또는 루틴들은 적용되어 업스트림에서 생성된 Lume 요소들로부터 결과들을 생성한다. 기계 학습은 또한 시퀀스 라벨링, 또는 베이지안 네트워크 분석으로 대체될 수 있다. 이것은 기계-학습 점수, 또는 이전 주석들의 정확성, 요소들 간의 관계들에 대한 확률 정보를 생성하거나, 새로운 주석들 또는 분류 메타데이터와 함께 생성한다. 결과들은 주석들을 검사하기 위한 UI 또는 결과들에 대한 추가 분석을 수행하기 위한 워크벤치를 통해 검토를 위해 분석가에게 결과들이 제공되는 단계(734)에서 분석된다. 단계(736)에서, 예측 정확도를 개선하기 위해 하나 이상의 반복들이 수행된다. 표현들을 적용하고(728), 피처들을 엔지니어링하고(730), 기계 학습(732)을 적용하고, 결과들을 검토하는(734) 단계들은 정확도를 개선하기 위해 반복될 수 있다. 정확도가 원하는 레벨을 달성하도록 개선되면, 결과들은 단계(738)에서 데이터베이스에 저장될 수 있다. 엔티티 분석 및 의미 분석(724), 엔지니어 피처들(730) 및 기계 학습(734)이 또한 지능형 도메인 엔진 내에서 활용될 것이지만, 대규모 프로세싱 파이프라인들의 경우 분리된다는 것을 유의한다.In step 732 , machine learning algorithms or routines are applied to generate results from the Lume elements generated upstream. Machine learning can also be replaced by sequence labeling, or Bayesian network analysis. It generates machine-learning scores, or probabilistic information about the accuracy of previous annotations, relationships between elements, or with new annotations or classification metadata. The results are analyzed in step 734 where the results are provided to an analyst for review either via a UI for examining the annotations or a workbench for performing further analysis on the results. In step 736, one or more iterations are performed to improve prediction accuracy. The steps of applying the representations 728 , engineering the features 730 , applying machine learning 732 , and reviewing the results 734 may be repeated to improve accuracy. Once the accuracy has improved to achieve the desired level, the results may be stored in a database at step 738 . Note that entity analysis and semantic analysis 724 , engineer features 730 , and machine learning 734 will also be utilized within the intelligent domain engine, but are separate for large processing pipelines.

[0073] 본 발명의 예시적인 실시예에 따르면, IDE는 자연어 프로세싱, 맞춤형 구축 주석 구성요소들, 및 문서들의 코퍼스를 체계적으로 분류 및 분석하기 위해 수동으로 인코딩된 표현들을 활용하기 위한 플랫폼을 포함한다. IDE는 회사의 인지/AI 능력들과 산업 도메인 지식을 결합하기 위한 플랫폼을 제공할 수 있다. 각각의 문서 분류는 활용될 피처들, 식별될 피처들의 패턴들, 분류 작업에 초점을 맞추기 위한 참조 위치 또는 범위 정보를 포함할 수 있는 표현들 세트로 표현될 수 있다. Lume에 포함된 Lume 요소들 및 데이터로 표현들이 구성되고 작업될 수 있다. IDE는 코퍼스의 각각의 문서에 대한 표현들을 체계적으로 평가하도록 설계되어, 분류 결정들을 지원하는 주석달린 텍스트뿐만 아니라 지정된 결과들을 생성한다. 이 예에서 IDE가 자연어 프로세싱 및 텍스트 마이닝에 활용되지만, IDE 프레임워크가 이미지들, 오디오 및 비디오와 같은 모든 Lume 포맷들에 적용되는 것이 주목된다.According to an exemplary embodiment of the present invention, the IDE includes natural language processing, custom built annotation components, and a platform for utilizing manually encoded representations to systematically classify and analyze a corpus of documents. . An IDE can provide a platform for combining a company's cognitive/AI capabilities with industry domain knowledge. Each document classification may be represented by a set of expressions that may include features to be utilized, patterns of features to be identified, and reference location or range information for focusing the classification task. Representations can be constructed and manipulated with Lume elements and data contained in Lume. The IDE is designed to systematically evaluate the representations for each document in the corpus, producing specified results as well as annotated text to support classification decisions. It is noted that in this example the IDE is utilized for natural language processing and text mining, but the IDE framework applies to all Lume formats such as images, audio and video.

[0074] IDE는 다수의 장점들을 제공할 수 있다. 예를 들어, IDE는 특정 질문에 대한 답변 외에도 분류 판정들을 지원하기 위해 주석달린 텍스트를 출력할 수 있다. 주석들은 결과들을 감사하고 투명성을 제공하는 데 사용될 수 있다. 또한, 정확한 기계 학습 모델을 훈련시키는 것은 일반적으로 다수의 라벨링된 문서들을 필요로 한다. 도메인 지식을 기계 학습과 통합하기 위해 IDE를 사용하는 것은 전문가-파생 피처들을 활용하여 정확한 모델을 훈련하는 데 필요한 문서들의 수를 크게 줄일 수 있다. 이것은, 비구조화 데이터와 관련된 기계 학습 문제들이 일반적으로 과도하게 결정되고, 정확하고 해석 가능한 피처들을 선택하는 능력이 일반적으로 이용가능한 것보다 더 많은 데이터를 필요로 하기 때문이다. 예를 들어, 문서들에서, 단어들의 사전, 철자적 피처들, 문서 구조들, 구문론적 피처들, 의미론적 피처들을 포함하여 수만 개의 피처들이 존재할 수 있다. 또한, 본 발명의 예시적인 실시예에 따르면, 스프레드시트(CSV 또는 XLSX) 또는 IDE 사용자 인터페이스를 통해서와 같이, 코드없는 환경들에서 코딩될 수 있는 도메인 특정 언어를 사용하여 표현들이 생성될 수 있으므로, 표현들을 입력하는 주제 전문가(SME)들과 같은 개인들은 컴퓨터 코딩 기술들을 필요로 하지 않는다. 이에 의해, SME는 기계 훈련 프로세스에 활용될 수 있는 도메인 관련 피처들을 생성할 수 있다. IDE UI는 사용자들이 표현들을 수정, 삭제 및 시스템에 추가하고 IDE를 실행하여 생성된 요소들을 시각화하게 한다. 또한, 표현들은 상호교환 가능하도록 설계될 수 있다. 이들은 산업 또는 문제 세트 전체의 사용 사례들에서 재사용하기 위해 생성될 수 있다. 또한, IDE는 문서들을 저장하고 작업하기 위해 Lume 포맷을 활용하도록 설계될 수 있다. 이 설계는 문서에 존재하는 텍스트 피처들 외에도 주석들 및 메타데이터가 표현들에 대한 입력들이게 한다.[0074] An IDE can provide a number of advantages. For example, the IDE can output annotated text to support classification decisions in addition to answering specific questions. Annotations can be used to audit results and provide transparency. Also, training an accurate machine learning model usually requires a large number of labeled documents. Using an IDE to integrate domain knowledge with machine learning can significantly reduce the number of documents required to train an accurate model utilizing expert-derived features. This is because machine learning problems involving unstructured data are generally over-determined, and the ability to select accurate and interpretable features typically requires more data than is available. For example, in documents, there may be tens of thousands of features, including a dictionary of words, spelling features, document structures, syntactic features, semantic features. Further, in accordance with an exemplary embodiment of the present invention, representations may be generated using a domain specific language that may be coded in codeless environments, such as via a spreadsheet (CSV or XLSX) or IDE user interface; Individuals such as subject matter experts (SMEs) entering expressions do not require computer coding skills. Thereby, the SME can create domain-related features that can be utilized in the machine training process. The IDE UI allows users to edit, delete and add representations to the system and launch the IDE to visualize the elements created. Also, representations may be designed to be interchangeable. They can be created for reuse in use cases across an industry or problem set. Additionally, the IDE can be designed to utilize the Lume format for storing and working with documents. This design allows annotations and metadata as inputs to representations in addition to text features present in the document.

[0075] 본 발명의 예시적인 실시예에 따르면, 표현을 생성하고 사용하기 위한 프로세스는: (1) 수동으로 문서들을 검토, (2) 표현들을 통해 패턴들 캡처 및 기계 학습 또는 통계 추출을 활용할 수 있는 맞춤 구축 코드 생성, (3) IDE에 표현들 로드 및 IDE 실행, (4) 혼동 매트릭스들 및 정확도 통계들 구축(즉, 보이지 않는 문서들의 세트에 대한 현재 결과들을 비교함으로써, 이것은 표현들이 얼마나 잘 일반화될지에 대한 추정치를 생성하고, 시스템이 성능 요건들을 충족하는지 여부를 결정함), (5) 이전 단계들을 반복 및 개선하고, (6) 예측된 답변들 및 답변들에 대한 지원 및 정당성을 제공하는 섹션들과 같은 출력을 생성하는 것을 포함한다.[0075] According to an exemplary embodiment of the present invention, a process for creating and using a representation may include: (1) manually reviewing documents, (2) capturing patterns through representations and utilizing machine learning or statistical extraction. By generating custom built code that has an IDE, (3) loading representations into the IDE and running the IDE, (4) building confusion matrices and accuracy statistics (i.e., by comparing the current results to a set of invisible documents, this is how well the representations work) generate an estimate of whether it will generalize, determine whether the system meets performance requirements), (5) iterate and refine previous steps, and (6) provide support and justification for predicted answers and answers It includes generating output such as sections that

[0076] 하나의 특정 예에 따르면, IDE는 투자 관리 합의들 또는 다른 법적 문서들과 같은 문서들을 분석하여 법적 질문들에 대한 답변들을 자동으로 결정하는 데 사용될 수 있다. 예시를 위해, 이 특정 예에서, 회사가 500개의 투자 관리 합의들과 관련하여 답변해야 할 8 개의 법적 질문들이 있다고 가정한다. 예시적인 질문은 "Does the contract require notification in connection with identified personnel changes?"일 수 있다. 도 8은 법적 질문과 관련된 투자 관리 계약 섹션들의 예를 묘사한다.According to one particular example, an IDE may be used to analyze documents such as investment management agreements or other legal documents to automatically determine answers to legal questions. For purposes of illustration, it is assumed in this particular example that a company has eight legal questions to answer in relation to its 500 investment management agreements. An exemplary question may be "Does the contract require notification in connection with identified personnel changes?" 8 depicts an example of investment management contract sections related to legal questions.

[0077] 도 9는 본 발명의 일 실시예에 따른 표현들의 예들을 예시한다. 도 9에 도시된 바와 같이, 표현들은 코드가 아닌 테이블 포맷(예를 들어, CSV)으로 상세될 수 있다. 도 9 예에서, 각각의 표현은 다른 표현들을 참조할 때 유용할 수 있는 "이름"을 갖는다. 이름은 또한 피처들을 생성하기 위해 출력 파일에 의해 사용될 수 있다. 각각의 표현은 또한 적용될 표현들에 초점을 맞추고 제한하는 "범위"를 포함할 수 있다. 범위 자체는 표현으로 평가되고, 그 결과들은 부모 표현의 범위를 제한하는 데 사용된다. 예를 들어, 범위 표현은 Lume 요소들(Lume 포맷으로의 변환에서 미리 지정되거나 다른 표현에 의해 생성됨)을 참조할 수 있거나, 계약에서 적절한 절을 식별하는 연산자의 결과일 수 있다. 표현은 또한 표현이 포함된 "문자열" 필드를 포함한다. 문자열 필드는 미리결정된 구문을 갖는다. 문자열 필드는 문서들 또는 로직 연산들에서 찾고자 하는 패턴들을 지정할 수 있다. 도 9는 문자열 필드의 예들을 도시한다.9 illustrates examples of representations according to an embodiment of the present invention. As shown in FIG. 9 , the representations may be detailed in a table format (eg, CSV) rather than a code. In the example of FIG. 9 , each representation has a “name” that may be useful when referring to other representations. The name can also be used by the output file to create features. Each expression may also include a “range” that focuses and limits the expressions to be applied. The scope itself is evaluated as an expression, and the results are used to constrain the scope of the parent expression. For example, a range expression may refer to Lume elements (either predefined in a conversion to Lume format or generated by another expression), or it may be the result of an operator that identifies the appropriate clause in the contract. The expression also includes a "string" field in which the expression is contained. The string field has a predetermined syntax. The string field can specify patterns to look for in documents or logical operations. 9 shows examples of a character string field.

[0078] 표현은 또한, 특정 표현을 평가해야 하는지 여부를 결정하는 데 사용되는 "조건" 필드를 포함할 수 있다. 이는 계산 효율성을 위해 표현들을 활성화 또는 비활성화하거나, 소정 유형들의 프로세싱을 활성화 또는 비활성화하는 제어 로직을 구현하는 데 유용하다.[0078] The expression may also include a “condition” field that is used to determine whether a particular expression should be evaluated. This is useful for implementing control logic that activates or deactivates representations for computational efficiency, or activates or deactivates certain types of processing.

[0079] 표현은 문서들에서 패턴들을 검색하는 데 사용될 수 있고, 표현은 이들 패턴들을 캡슐화할 수 있다. 이러한 패턴들의 예들은 통지 요건 및 직원 변경들을 표현하는 다양한 방식들을 포함한다. 예를 들어, "인사"에 대한 많은 단어들, 이를테면 "핵심인", "투자팀", "전문직원", "고위직원", "고위임원들", "포트폴리오 관리자", "포트폴리오 관리자들", "투자 관리자들", "주요 의사 결정자들", "주요 직원들" 및 "투자 관리자"가 있다. 일부 경우들에서 대소문자 구분이 중요하다. 예를 들어, "투자 관리자"는 직원을 지칭할 수 있는 반면; "투자 관리자"는 클라이언트의 투자 조직을 지칭할 수 있다. 일부 경우들에서 단어들의 순서(주제-객체 관계를 나타냄)는 중요하다. 예를 들어, 클라이언트에게 통지하는 투자 관리자는 투자 관리자게 통지하는 클라이언트와 동일하지 않다. 이들 모든 유형들의 패턴들은 표현들에 캡슐화될 수 있다. 주제 전문가(SME)들은 소정 유형들의 전문 문서 유형들을 분석하는 노하우를 표현에 캡슐화할 수 있다.An expression may be used to search for patterns in documents, and the expression may encapsulate these patterns. Examples of such patterns include various ways of expressing notification requirements and staff changes. For example, many words for "HR", such as "core", "investment team", "professional", "senior", "senior", "portfolio manager", "portfolio manager" , "investment managers", "key decision makers", "key employees" and "investment managers". Case sensitivity is important in some cases. For example, "investment manager" may refer to an employee; An “investment manager” may refer to a client's investment organization. In some cases the order of words (representing a subject-object relationship) is important. For example, the investment manager notifying the client is not the same as the client notifying the investment manager. All these types of patterns can be encapsulated in representations. Subject matter experts (SMEs) may encapsulate in representations the know-how of analyzing certain types of specialized document types.

[0080] 도 10은 IDE로부터의 하나의 출력 형태의 예를 예시한다: 예측된 답변들. 이는 각각의 문서에 대한 각각의 질문에 대한 답변들을 포함한다. 예를 들어, 도 10에 도시된 바와 같이, 출력은 입력 파일의 파일이름, 계약의 피처들에 대한 결정들을 제공하는 4 개의 질문들에 대한 답변을 나열하는 표를 포함할 수 있다. 실시예에 따르면, IDE에서 출력될 더 많은 질문들 또는 피처들이 있을 수 있다.[0080] Figure 10 illustrates an example of one form of output from the IDE: predicted answers. It contains answers to each question for each document. For example, as shown in FIG. 10 , the output may include the filename of the input file, a table listing answers to four questions providing decisions about the features of the contract. Depending on the embodiment, there may be more questions or features to be output in the IDE.

[0081] 도 11은 IDE에서 나온 다른 형태의 예를 예시한다: 답변들에 대한 지원 및 정당성. 도 11에서, 사용자 인터페이스는 주어진 답변을 지원하고 정당화하기 위해 IDE에 의해 사용되는 실제 계약 언어를 디스플레이한다. 사용자가 IDE가 올바른지 여부를 평가할 수 있도록 실제 계약 언어가 제시된다. 시스템은 Lume 요소에 저장된 정보를 활용하여 IDE에 의해 제공되는 답변에 대한 기초를 구체적으로 형성하는 텍스트의 소정 단어들을 강조할 수 있다. 이러한 방식으로, IDE는 인간 사용자가 답변이 올바른지 쉽게 검증할 수 있게 한다. 이는 또한 사용자가 임의의 에러들을 이해하고 이러한 에러들을 수정하기 위해 표현을 수정하는 능력을 가능하게 한다.[0081] Figure 11 illustrates another form of example coming out of the IDE: support and justification for answers. 11 , the user interface displays the actual contract language used by the IDE to support and justify a given answer. The actual contract language is presented so that the user can evaluate whether the IDE is correct or not. The system may utilize the information stored in the Lume element to highlight certain words of text that specifically form the basis for the answer provided by the IDE. In this way, the IDE allows a human user to easily verify that an answer is correct. It also enables the user to understand any errors and the ability to modify the expression to correct these errors.

[0082] 도 12는 본 발명의 예시적인 실시예에 따른 시스템의 시스템 다이어그램이다. 도 12에 도시된 바와 같이, 시스템은 시스템을 실행하는 데 사용되는 소프트웨어 및 데이터와 함께 서버(120) 및 연관된 데이터베이스(122)를 포함할 수 있다. 시스템은 또한 원본 문서들을 스캔하고 시스템으로 수집하는 데 사용되는 스캐너(126)를 포함할 수 있다. 서버(120) 및 데이터베이스(122)는 수집된 문서들을 저장할 뿐만 아니라 IDE, Lume들 및 Lume 요소들, 그리고 시스템에 의해 사용되는 다른 소프트웨어 및 데이터를 저장하는 데 사용될 수 있다. 주제 전문가(예를 들어, 세무 전문가)와 같은 사용자(125)는 예를 들어 랩톱 컴퓨터, 데스크톱 컴퓨터 또는 태블릿 컴퓨터와 같은 개인용 컴퓨팅 디바이스(124)를 통해 서버(120), 스캐너(126) 및 데이터베이스(122)에 액세스하여 사용할 수 있다.12 is a system diagram of a system according to an exemplary embodiment of the present invention. 12 , the system may include a server 120 and an associated database 122 along with software and data used to run the system. The system may also include a scanner 126 used to scan and collect original documents into the system. Server 120 and database 122 may be used to store collected documents as well as IDE, Lumes and Lume elements, and other software and data used by the system. A user 125, such as a subject matter expert (eg, a tax professional), can access the server 120, scanner 126 and database (eg, via personal computing device 124 , such as a laptop computer, desktop computer, or tablet computer) 122) can be accessed and used.

[0083] 시스템은 또한 하나 이상의 클라이언트들 또는 다른 사용자들이 시스템에 액세스하게 하도록 구성될 수 있다. 예를 들어, 도 12에 도시된 바와 같이, 클라이언트(135)는 네트워크(110)를 통해 서버(120)에 액세스하기 위해 개인용 컴퓨팅 디바이스(134) 및 회사 서버(130)를 사용할 수 있다. 클라이언트는 또한 클라이언트 데이터베이스(132)에 저장된 클라이언트-특정 데이터(예를 들어, 분석될 계약들의 세트)를 시스템으로 송신하여 서버(120)에 의해 분석되고 데이터베이스(122)에 저장될 데이터세트 문서들에 통합할 수 있다. 도 12에 도시된 서버(120)는 일반적으로 서버들(140 및 150)로 표현되는 다른 클라이언트들 또는 사용자들로부터 다른 문서들, 스프레드시트들, pdf 파일들, 텍스트 파일들, 오디오 파일들, 비디오 파일들, 및 다른 구조화 및 비구조화 데이터를 수신할 수 있다.[0083] The system may also be configured to allow one or more clients or other users to access the system. For example, as shown in FIG. 12 , client 135 may use personal computing device 134 and corporate server 130 to access server 120 via network 110 . The client also sends client-specific data (eg, a set of contracts to be analyzed) stored in the client database 132 to the system in dataset documents to be analyzed by the server 120 and stored in the database 122 . can be integrated The server 120 shown in FIG. 12 includes other documents, spreadsheets, pdf files, text files, audio files, video from other clients or users, generally represented as servers 140 and 150 . files, and other structured and unstructured data.

[0084] 또한 도 12에는 네트워크(110)가 도시되어 있다. 네트워크(110)는 예를 들어 인터넷, 인트라넷, LAN(Local Area Network), WAN(Wide Area Network), 이더넷 연결, WiFi 네트워크, GSM(Global System for Mobile Communication) 링크, 셀룰러 전화 네트워크, GPS(Global Positioning System) 링크, 위성 통신 네트워크 또는 다른 네트워크 중 임의의 하나 이상을 포함할 수 있다. 서버들, 데스크톱 컴퓨터들, 랩톱 컴퓨터들 및 모바일 컴퓨터들과 같은 다른 컴퓨팅 디바이스들은 예를 들어 상이한 개인들 또는 그룹들에 의해 운영될 수 있고, 계약들 또는 보험 정책들과 같은 데이터를 네트워크(110)를 통해 서버(120) 및 데이터베이스(122)에 송신할 수 있다. 또한, 컨테이너화된 또는 마이크로서비스-기반 아키텍처들을 갖는 클라우드-기반 아키텍처들은 또한 시스템을 배포하는 데 사용될 수 있다.Also shown in FIG. 12 is a network 110 . Network 110 may include, for example, the Internet, an intranet, a local area network (LAN), a wide area network (WAN), an Ethernet connection, a WiFi network, a Global System for Mobile Communication (GSM) link, a cellular telephone network, a Global Positioning (GPS) network. System) link, a satellite communication network, or any one or more of other networks. Other computing devices, such as servers, desktop computers, laptop computers, and mobile computers, may for example be operated by different individuals or groups, and may transmit data such as contracts or insurance policies to network 110 . It can be transmitted to the server 120 and the database 122 through. In addition, cloud-based architectures with containerized or microservices-based architectures can also be used to deploy the system.

[0085] 도 13은 본 발명의 예시적인 실시예에 따른 분석 시스템에 대한 흐름도이다. 도면에 묘사된 바와 같이, 흐름도(1300)는 문서 수집 단계(1310), 프리프로세싱 단계(1320), 주석달기 단계(1330), ML 프레임워크 단계(1340), 포스트-프로세싱 단계(1350), 및 다중-문서 컨설리데이션 단계(1360)를 포함한다. 이들 단계들의 결과로서, 흐름도(1300)는 추출된 문서 지식을 제공할 수 있다.13 is a flowchart for an analysis system according to an exemplary embodiment of the present invention. As depicted in the figure, the flow diagram 1300 includes a document collection step 1310, a preprocessing step 1320, an annotating step 1330, an ML framework step 1340, a post-processing step 1350, and a multi-document consolidation step 1360 . As a result of these steps, flowchart 1300 may provide extracted document knowledge.

[0086] 실시예에 따르면, 단계(1310) 동안, 다양한 데이터 소스들, 예를 들어 기계 판독가능 및/또는 기계 판독불가능 PDF들, 워드 문서들, 엑셀 스프레드시트들, 이미지들, HTML 등으로부터 데이터가 수집(즉, 입력)된다. 특히 다양한 데이터 소스들로부터의 원시 데이터는 동일한 Lume 데이터 구조로 변환되고 저장되므로, 상이한 데이터 유형들에 걸쳐 일관성을 제공한다.According to an embodiment, during step 1310, data from various data sources, for example machine-readable and/or non-machine-readable PDFs, Word documents, Excel spreadsheets, images, HTML, etc. is collected (i.e. input). In particular, raw data from various data sources is converted and stored in the same Lume data structure, providing consistency across different data types.

[0087] 또한, 실시예에 따르면, 프리프로세싱 단계(1320) 동안, 다운스트림 모델링 단계들을 풍부하게 하기 위해 다수의 작업들이 수행된다. 예를 들어, 필요한 경우, 기계-판독불가능 PDF들 또는 이미지들로부터의 텍스트를 기계-판독가능 텍스트로 변환하기 위해 광학 문자 인식(OCR)이 수행될 수 있다. 또한, 다운스트림에서도 활용될 수 있는 이미지-관련 피처들을 통합하기 위해 추가 Lume 요소들이 추가될 수 있다. 또한, 문서 텍스트에 대한 자연어 프로세싱 작업들이 또한 수행된다. 예를 들어, 문서 텍스트의 단어들과 문장들은 토큰화 및/또는 표제화될 수 있다. 또한, 스피치 태깅 또는 NER(named-entity-recognition)의 일부와 같은 선택적 정보는 또한 이 단계 동안 포함되어 후속 모델링에 이용가능한 정보를 풍부하게 할 수 있다. 맞춤 단어 임베딩들은 또한 토큰 요소들에 추가될 수도 있고, 여기서 단어 임베딩들은 도메인-특정 문서 세트에 대해 재훈련되고 토큰화된 단어 요소들 및/또는 문장 요소들에 추가된다. 실시예에 따르면, 단어 임베딩은 다수의 문서들, 예를 들어 50 개보다 많은 문서들로 재훈련될 수 있다. 또한, 실시예에 따르면, 추가된 단어 임베딩들은 주석을 간소화하고 피처 생성 및 모델링에서 OCR 에러들을 매끄럽게 할 수 있다. 또한, 문서들이 단일 파일로 컴파일되는 상황들(예를 들어, 일반적으로 단일 PDF에 저장되는 마스터 서비스 합의들 및 대응하는 다수의 수정본들)에서, 파일을 구성요소 문서들로 분할하는 것이 필요할 수 있다. 이들 경우들에서, 휴리스틱 또는 훈련된 모델은 문서들을 구성 부분들로 분할하는 데 활용된다. 실시예에 따르면, 문서 분할은 컨설리데이션 로직은 문서 패밀리들의 세트들에 적용될 경우들에 유용하다. 이들 상황들에서 문서들의 세트에 로직을 적절하게 적용하기 위해 각각의 문서는 별도로 분석되고 고려될 필요가 있다. 예를 들어, 마스터 서비스 합의가 3 개의 수정들을 갖는다고 가정하면, 이들 관련 문서들에 걸친 정보, 예를 들어, 계약에 대한 지불 조건들은 프리프로세싱 및 모델 예측이 실행된 후 컨설리데이팅될 수 있다. 그러나, 문서들 중 단지 하나, 예를 들어 가장 최근의 수정은 계약에 대한 지불 조건들과 같은 가장 관련성이 높은 정보를 포함할 수 있다. 따라서, 계약 컨설리데이션은 문서들의 세트에 걸쳐 로직을 적용하고 가장 관련성이 높은 정보를 추출하는 데 사용될 수 있다.Also, according to an embodiment, during the preprocessing step 1320 , a number of operations are performed to enrich the downstream modeling steps. For example, if necessary, optical character recognition (OCR) may be performed to convert text from machine-readable PDFs or images to machine-readable text. Additionally, additional Lume elements can be added to incorporate image-related features that can also be utilized downstream. In addition, natural language processing operations on the document text are also performed. For example, words and sentences of document text may be tokenized and/or titled. In addition, optional information such as speech tagging or part of named-entity-recognition (NER) may also be included during this step to enrich the information available for subsequent modeling. Custom word embeddings may also be added to token elements, where word embeddings are added to word elements and/or sentence elements retrained and tokenized for a domain-specific document set. According to an embodiment, word embeddings may be retrained with multiple documents, for example more than 50 documents. Further, according to an embodiment, the added word embeddings may simplify annotation and smooth OCR errors in feature creation and modeling. Also, in situations where documents are compiled into a single file (e.g., master service agreements and corresponding multiple revisions typically stored in a single PDF), it may be necessary to split the file into component documents. . In these cases, a heuristic or trained model is utilized to partition the documents into constituent parts. According to an embodiment, document segmentation is useful in cases where consolidation logic is applied to sets of document families. In these situations each document needs to be analyzed and considered separately in order to properly apply the logic to the set of documents. For example, assuming a master service agreement has three modifications, information across these related documents, eg payment terms for a contract, can be consulted after preprocessing and model prediction are executed. . However, only one of the documents, for example the most recent revision, may contain the most relevant information, such as payment terms for a contract. Thus, contract consolidation can be used to apply logic across a set of documents and extract the most relevant information.

[0088] 또한, 실시예에 따르면, 주석달기 단계(1330) 동안, 인간 지식 및 전문지식은 프로세스(1300)에 통합될 수 있고, 여기서 SME들은 문서의 특정 정보를 라벨링할 수 있다. 이 정보는 추출할 특정 구들 및/또는 텍스트, 또는 특정 유형, 예를 들어 유형 A, 유형 B 등으로 특정 절 또는 단락을 라벨링할 수 있다. 실시예에 따르면, 이러한 SME 지식은 다양한 방식들, 예를 들어 웹 또는 엑셀-기반 사용자 인터페이스에 통합될 수 있다. 이어서, 이들 주석들은 Lume 데이터 구조에 직접 추가될 수 있다.Further, according to an embodiment, during annotating step 1330 , human knowledge and expertise may be incorporated into process 1300 , where SMEs may label specific information in a document. This information may label specific phrases and/or text to extract, or specific clauses or paragraphs with specific types, eg Type A, Type B, etc. According to an embodiment, this SME knowledge may be integrated in a variety of ways, for example in a web or Excel-based user interface. These annotations can then be added directly to the Lume data structure.

[0089] 도 14는 주석달기 단계(1330)의 흐름도이다. 프리프로세싱이 완료된 후, Lume 데이터 구조는 주석을 달 수 있다. Lume의 데이터는 문서들의 텍스트 뿐만 아니라 해당 단어들, 문장들 등을 설명하는 요소들을 포함한다. 이어서, Lume의 정보는 주석달기 단계 동안 활용된다. 특히, 주석들은 Lume에 포함된 데이터(예를 들어, 텍스트)를 직접 참조하는 요소들로 추가된다. 도면에 묘사된 바와 같이, 단계(1331)에서, 문서 언어의 키워드들/구들 및 대표적인 예들이 식별된다. 실시예에 따르면, 식별은 사용자 인터페이스를 통해 SME로 수행될 수 있다. 또한, 식별된 키워드/구들 및 대표적인 예들은 지식 기반(1334)에 제공될 수 있다. 또한, 식별된 키워드/구들 및 대표적인 예들은 단계(1332)에 묘사된 바와 같이 예시적인 문장들의 임베딩들을 계산하는 데 또한 사용될 수 있다. 이어서, 단계(1333)에서, 계산된 임베딩들 및 SME 지식에 기반하여 맞춤 단어 임베딩들이 훈련되고, 이는 또한 지식 기반(1334)에 제공될 수 있다. 또한, 도면에 묘사된 바와 같이, 능동 학습 단계들이 또한 수행될 수 있다.14 is a flow chart of annotating step 1330 . After preprocessing is complete, the Lume data structure can be annotated. Lume's data includes not only the text of the documents, but also elements that describe the words, sentences, etc. Subsequently, Lume's information is utilized during the annotation phase. In particular, annotations are added as elements that directly reference data (eg text) contained in a Lume. As depicted in the figure, at step 1331 keywords/phrases and representative examples of the document language are identified. According to an embodiment, identification may be performed with the SME via a user interface. In addition, identified keywords/phrases and representative examples may be provided to the knowledge base 1334 . In addition, the identified keyword/phrases and representative examples may also be used to compute embeddings of example sentences as depicted in step 1332 . Then, at step 1333 , custom word embeddings are trained based on the calculated embeddings and SME knowledge, which may also be provided to the knowledge base 1334 . Also, as depicted in the figure, active learning steps may also be performed.

[0090] 능동 학습 동안, 데이터 주석 및 훈련 세트 생성 프로세스를 식별하고 간소화하기 위한 전략이 생성된다. 실시예에 따르면, 능동 학습은 단어 임베딩들, 문장 임베딩들 및 키워드들을 활용하여 더 넓은 데이터세트에서 텍스트의 가능한 후보들을 찾는다. 특히, 로직 키워드 검색들의 세트 뿐만 아니라 타겟 텍스트의 일부 예들(예를 들어, 타겟 정보가 나타나는 위치의 예시적인 문장들)은 분석을 위해 입력된다. 예를 들어, 계약 조건에 주석을 달 후보들을 검색할 때, 키워드들은 "조건", "기간", "년도들" 또는 "개월들"과 같은 언어를 포함할 수 있다. 또한 "[t]he Agreement will last for a term of 10 years"와 같은 문장 임베딩들은 유사한 문맥 언어를 찾기 위해 활용될 수 있다. 이 특정 능동 학습 전략은 유사하지만 정확하지 않은 주석들의 검색을 높은 확률로 좁힌다. 이어서, 사용자는 이들 결과들을 검토하고 이들 후보 주석들을 사용하여 문서들의 Lume 데이터세트에 라벨들을 직접 추가할 수 있다. 또한, 실시예에 따르면, 이러한 능동 학습 전략은 또한 훈련 세트와 희귀 정보, 예를 들어 희귀 필드들의 균형을 맞추는 데 유용하다. 또한, 능동 학습을 통해, 다양한 주석들이 생성되고 대표 데이터세트가 간소화된 방식으로 개발되어 다른 메타데이터와 함께 Lume에 저장될 수 있다. 이러한 방식으로, 주석들은 Lume에 저장된 보완 정보와 함께 활용될 수 있다.[0090] During active learning, strategies are created to identify and streamline the data annotation and training set creation process. According to an embodiment, active learning utilizes word embeddings, sentence embeddings and keywords to find possible candidates of text in a wider dataset. In particular, a set of logical keyword searches as well as some examples of target text (eg, example sentences of where the target information appears) are input for analysis. For example, when searching for candidates to annotate contract terms, keywords may include language such as “condition”, “period”, “years” or “months”. Also, sentence embeddings such as "[t]he Agreement will last for a term of 10 years" can be used to find similar contextual languages. This particular active learning strategy narrows the search for similar but inaccurate annotations with high probability. The user can then review these results and use these candidate annotations to add labels directly to the Lume dataset of documents. Further, according to an embodiment, this active learning strategy is also useful for balancing the training set with rare information, such as rare fields. Also, through active learning, various annotations can be created and representative datasets can be developed in a streamlined manner and stored in Lume along with other metadata. In this way, annotations can be utilized with complementary information stored in Lume.

[0091] 실시예에 따르면, 도면에 묘사된 바와 같이, 특정 능동 학습 전략(예를 들어, 데이터 다양화 증가, 모델 정보성 개선 등)이 적용될 수 있다. 예를 들어 문장 임베딩들의 유사성은 평균과 비교될 수 있다. 이어서, 단계(1336)에서, 사용자는 예를 들어 특정 라벨들을 확인하거나 거부함으로써 전략의 결과들을 검토할 수 있다. 이어서, 결과는 Lume 메타데이터에 통합된다. 또한, 사용자는 또한 검색 또는 주석들을 개선하거나, 필요에 따라 새로운 데이터를 추가할 수 있다. 이어서, 단계(1337)에 의해 묘사된 바와 같이, 확인된 라벨들은 모델에 추가된다.According to an embodiment, as depicted in the figure, a specific active learning strategy (eg, increasing data diversification, improving model information, etc.) may be applied. For example, the similarity of sentence embeddings can be compared to the mean. Then, at step 1336, the user may review the results of the strategy, for example, by confirming or rejecting certain labels. The results are then incorporated into the Lume metadata. In addition, the user can also refine searches or annotations, or add new data as needed. The identified labels are then added to the model, as depicted by step 1337 .

[0092] 실시예에 따르면, 예시적인 프레임워크는 암시적 지식 전달과 명시적 지식 전달 둘 모두를 보완적인 방식으로 결합한다. 예를 들어, IDE 표현들 형태의 피처 엔지니어링과 같은 암시적 지식 전달은 명시적 지식 전달, 즉 능동 학습을 통한 주석을 지원하는 데 사용된다. 즉, IDE 표현들은 SME이 라벨링/검토할 후보들을 공급하는 능력을 능동 학습 알고리즘에게 제공하는 데 사용될 수 있다. 또한, 실시예에 따르면, 후보들을 검토하는 프로세스에서, SME의 관찰들에 기반하여 엔지니어링된 피처들이 또한 업데이트/개선된다. 이 사이클(예를 들어, IDE 표현 피처들("명시적") -> 후보들의 검토("암시적") -> 관찰들에 기반한 피처들의 개선("명시적") -> 더 많은 후보들의 검토("암시적"))은 모델이 예상 성능을 충족할 때까지 반복된다.According to an embodiment, the exemplary framework combines both implicit knowledge transfer and explicit knowledge transfer in a complementary manner. For example, implicit knowledge transfer, such as feature engineering in the form of IDE representations, is used to support explicit knowledge transfer, ie annotation through active learning. That is, the IDE representations can be used to provide the active learning algorithm with the ability to supply candidates for the SME to label/review. Further, in accordance with an embodiment, in the process of reviewing candidates, features engineered based on the SME's observations are also updated/improved. This cycle (eg, IDE representation features ("explicit") -> review of candidates ("implicit") -> improvement of features based on observations ("explicit") -> review of more candidates ("implicit")) iterates until the model meets the expected performance.

[0093] 도 15a 및 도 15b는 도 13에 묘사된 능동 학습 단계에서 구성요소들 간의 상호작용을 예시한다. 실시예에 따르면, 능동 학습 단계는 사용자 인터페이스(1410), 능동 학습 애플리케이션 프로그램 인터페이스(API: application programming interface)(1420), 데이터베이스(1430), 모듈 관리 모듈(1440), Ignite 플랫폼(1450), 및 로컬 플랫폼(1460)을 활용할 수 있다. API(1420)는 모델 관리 모듈(1440)과 통신하고, 이는 사용자가 주어진 데이터세트(예를 들어, 하이퍼파라미터들 또는 피처 세트들 변경)에 대해 임의의 수의 실험들을 실행하게 한다. 또한, API(1420)는 해당 실험의 특정 설정들에 대한 성능 메트릭을 추적한다. 또한, API(1420)는 또한 Ignite 플랫폼(예를 들어, 워크플로우를 실행하기 위해 Ignite 소프트웨어를 실행하는 클라우드 서버) 또는 로컬 플랫폼(예를 들어, 워크플로우들을 실행하기 위해 Ignite 소프트웨어를 실행하는 로컬 서버 또는 개인 컴퓨팅 디바이스)과 상호 작용하여 능동 학습을 위해 명령들을 해석할 수 있다. 예를 들어, SME가 복수의 계약들에서 "공급자 이름"을 예측하는 모델을 생성하려고 시도하는 경우, SME는 예를 들어 사용자 인터페이스(1410)를 통해, 공급자 이름이 일반적으로 "by," "between," "agreement," "inc.," 등의 단어들 주변 어딘가에 위치될 수 있다는 것을 모델에게 나타낼 수 있다. 실시예에 따르면, SME는 IDE 표현들의 형태로 이 정보를 모델에 제공할 수 있다. 이어서, 능동 학습 전략은 API(1420)를 사용하여 IDE 표현들의 설명에 가장 적합한 주석 후보들, 예를 들어, 자동 주석달기들("auto-annotation들"))을 선택한다. 이들 후보들은 사용자 인터페이스(1410)를 사용하여 SME에 의해 검토할 수 있으므로, 모델에 "공급자 이름"에 대한 암시적 지식을 제공할 수 있다. 예를 들어, 초기 모델, 예를 들어, 15B의 모델 1은 검토된 예들(사용자가 수동으로 확인한 후보들)과 능동 학습 전략의 추가 자동-주석달려진 예들에 대해 훈련될 수 있다. 이어서, 테스트 세트에 대해 모델 성능이 평가될 수 있다. 실시예에 따르면, 수동-검토된 예들은 미래 훈련을 위해 유지될 수 있지만; 자동-주석달린 예들은 추가 모델 반복들을 통해 전파되지 않는다. 이 후보 검토 프로세스 동안, SME는 관찰된 결과들에 기반하여 IDE 표현들을 개선할 수 있다(예를 들어, "by"라는 단어를 제거하고 "company"라는 단어 추가). 이런 개선 완료되면, 사용자 인터페이스(1410)를 통해 SME에 의해 제공될 수 있는 IDE 표현 개선들에서 모델 2 능동 학습 전략이 구성될 수 있다. 이어서, 사용자들은 이 업데이트된 능동 학습 전략의 예들을 수동으로 검토할 수 있다. 제1 반복에서와 같이 새로운 모델은 수동-검토된 주석들(SME에 의해 사용자 인터페이스(1410)를 통해 제공됨), 및 자동-주석들(능동 학습 예측 프레임워크에 의해 직접 제공됨) 둘 모두로부터 훈련될 것이다. 이것은 새로운 모델 버전(예를 들어, 도 15b의 모델 1에서 모델 2로)을 초래하고, 이어서 이는 이들 개선들에 기반하여 검토할 새로운 후보들을 생성하기 위해 능동 학습 예측 프레임워크 내에서 활용된다. 사이클은 모델이 허용가능한 성능 레벨에서 예측들을 하기에 충분한 암시적 및 명시적 지식을 가질 때까지 계속된다.15A and 15B illustrate the interaction between components in the active learning phase depicted in FIG. 13 . According to an embodiment, the active learning phase includes a user interface 1410 , an active learning application programming interface (API) 1420 , a database 1430 , a module management module 1440 , an Ignite platform 1450 , and A local platform 1460 may be utilized. The API 1420 communicates with the model management module 1440 , which allows the user to run any number of experiments on a given dataset (eg, changing hyperparameters or feature sets). The API 1420 also tracks performance metrics for specific settings of the experiment. In addition, API 1420 may also be an Ignite platform (eg, a cloud server running Ignite software to execute workflows) or a local platform (eg, a local server running Ignite software to execute workflows). or a personal computing device) to interpret the instructions for active learning. For example, if an SME is attempting to create a model that predicts a “supplier name” in multiple contracts, the SME, eg, via user interface 1410 , may determine that the supplier name is typically “by,” “between”. ," "agreement," "inc.," etc. may indicate to the model that it can be located somewhere around words. According to an embodiment, the SME may provide this information to the model in the form of IDE representations. The active learning strategy then uses API 1420 to select annotation candidates that are best suited for description of IDE representations, eg, auto-annotations (“auto-annotations”). These candidates may be reviewed by the SME using the user interface 1410, thus providing the model with implicit knowledge of the “supplier name”. For example, an initial model, eg, Model 1 of 15B, can be trained on the reviewed examples (candidates manually identified by the user) and additional auto-annotated examples of the active learning strategy. The model performance may then be evaluated on the test set. According to an embodiment, manually-reviewed examples may be maintained for future training; Auto-annotated examples do not propagate through additional model iterations. During this candidate review process, the SME may refine the IDE representations based on the observed results (eg, remove the word "by" and add the word "company"). Once these refinements are complete, a Model 2 active learning strategy may be constructed in the IDE representation enhancements that may be provided by the SME via the user interface 1410 . Users can then manually review examples of this updated active learning strategy. As in the first iteration, the new model will be trained from both manually-reviewed annotations (provided by the SME via user interface 1410), and auto-annotations (provided directly by the active learning prediction framework). will be. This results in a new model version (eg, from model 1 to model 2 in FIG. 15B ), which is then utilized within the active learning prediction framework to generate new candidates for review based on these improvements. The cycle continues until the model has enough implicit and explicit knowledge to make predictions at an acceptable performance level.

[0094] 실시예에 따르면, SME 주석들이 Lume 데이터 구조에 통합된 후, 모델 훈련은 ML 프레임워크(1340)로 시작될 수 있다. 실시예에 따르면, ML 프레임워크(1340)는 Lume 데이터 구조들에 대해 알고리즘들을 훈련하거나 적용하기 위해 함께 작동하는 여러 구성요소들로 이루어진다. 예를 들어, 정보 추출 구성요소(1349)는 기계 학습 구성요소(1346)와 상호작용하는 계층의 역할을 한다. 또한, 실시예에 따르면, 사용자들은 기계 학습 구성요소(1346)에 명령들을 전송하기 전에 정보 추출 구성요소(1349)에 의해 해석될 수 있는 구성 파일(1341)을 생성할 수 있다. 실시예에 따르면, 구성 파일(1341)의 명령들은 작업 유형(예를 들어, 훈련, 검증, 예측 등), 알고리즘 유형 및 패키지(예를 들어, 사이킷런 로지스틱 회귀(sklearn logistic regression)와 같은 회귀 알고리즘들, 케라스 LSTM(keras LSTM)과 같은 재귀 알고리즘들, 등) 및 피처들(예를 들어, 맞춤 피처들, 단어 임베딩들 등)을 포함한다. 기계 학습 구성요소(1346)는 명령된 바와 같이 훈련 또는 예측을 실행하고/하거나, 회귀 또는 재귀 알고리즘에 명령들을 전송함으로써 구성 파일(1341)로부터 전달된 정보에 대해 작용한다. 기계 학습 구성요소(1346)는 또한 BIO 라벨링, 슬라이딩 윈도우들 등과 같이 필요할 수 있는 임의의 라벨링 기법들을 적용할 수 있을 뿐만 아니라, 훈련된 모델들을 저장하거나 로드할 수 있다. 실시예에 따르면, 회귀 및 재귀 알고리즘들은 기계 학습 구성요소(1346)로부터 데이터 입력들을 수신하고, 구성 파일(1341)을 통해 명령된 바와 같이 훈련 또는 예측을 수행하고, 결과들(예를 들어, 훈련된 모델 또는 예측)을 기계 학습 구성요소(1346)에 다시 반환한다. 또한, 실시예에 따르면, 프로세스 구축기(1345)는 YAML 포맷으로 제공될 수 있는 명령들을 구축 및 번역하기 위한 API로서 작용함으로써 위의 모든 작업들을 가능하게 할 수 있다. 예를 들어, 사용자가 훈련 및 예측을 위해 상이한 모델링 패키지를 사용하기를 원하는 경우, 사용자는 YAML 구성의 패키지 및 모델 유형 이름들을 프로세스 구축기(1345)의 프레임워크(1347)에 제공할 수 있다. 사용자는 또한 모듈(1348)을 사용하여 임의의 디폴트 모델링 알고리즘을 맞춤화할 수 있다. 또한, ML 프레임워크(1340)를 사용하여, 있다면 YAML 파일에 대한 최소한의 변경들은 피처 엔지니어링 및 모델 훈련의 포함/제외를 변경하는 데 필요하다. 또한, 모델들에 걸쳐 거동들의 차이들은 구성 YAML 파일에 격리되고, 공통 코드 기반과 혼합되지 않는다. 이것은 코드 기반이 "안정적"으로 유지될 수 있게 하는 동시에, 사용자들에게 특정 모델 인스턴스의 워크플로우 거동에 대한 임의의 포인트 및 임의의 범위(예를 들어, 세분화된 및/또는 거친 수정들)에서 타깃 수정들을 하게 하는 유연성을 여전히 허용한다. 또한, 이들 수정들은 구성 파일(1341)(코드가 아님) 내에 상주하기 때문에, 배포에 추가 코드를 설치할 필요 없이 플랫폼에 안전하게 전달할 수 있다. 예를 들어, 사용자는 구두점들을 무시하거나, 단어들을 중지하거나, 단어 임베딩들과 같은 추가 피처들을 추가하고 단어가 대문자인지 결정하도록 모델 입력을 수정할 수 있다. 이러한 변경들은 소스 코드를 변경하는 대신, 구성 YAML 파일을 수정하여 실행될 수 있다. 이어서, 구성 파일은 참조된 피처들을 취하고 훈련 데이터세트로부터 피처 매트릭스를 생성할 수 있다.According to an embodiment, after the SME annotations are incorporated into the Lume data structure, model training may begin with the ML framework 1340 . According to an embodiment, the ML framework 1340 consists of several components that work together to train or apply algorithms to Lume data structures. For example, the information extraction component 1349 acts as an interactive layer with the machine learning component 1346 . Further, according to an embodiment, users may create a configuration file 1341 that may be interpreted by the information extraction component 1349 prior to sending instructions to the machine learning component 1346 . According to an embodiment, the instructions in the configuration file 1341 include the task type (eg, training, validation, prediction, etc.), the algorithm type and the package (eg, a regression algorithm such as sklearn logistic regression). , recursive algorithms such as keras LSTM, etc.) and features (eg, custom features, word embeddings, etc.). The machine learning component 1346 operates on information passed from the configuration file 1341 by executing training or prediction as commanded and/or sending instructions to a regression or recursive algorithm. The machine learning component 1346 may also apply any labeling techniques that may be needed, such as BIO labeling, sliding windows, etc., as well as save or load the trained models. According to an embodiment, regression and recursive algorithms receive data inputs from machine learning component 1346 , perform training or prediction as commanded via configuration file 1341 , and produce results (eg, training model or prediction) back to the machine learning component 1346 . Also, according to an embodiment, the process builder 1345 may enable all of the above tasks by acting as an API for building and translating instructions that may be provided in YAML format. For example, if the user wants to use different modeling packages for training and prediction, the user can provide the package and model type names of the YAML construct to the framework 1347 of the process builder 1345 . The user may also customize any default modeling algorithm using module 1348 . Also, using the ML framework 1340, minimal changes to the YAML file, if any, are needed to change the inclusion/exclusion of feature engineering and model training. Also, differences in behaviors across models are isolated in a configuration YAML file and not mixed with a common code base. This allows the code base to remain "stable", while at the same time providing users with a target at any point and at any scope (eg, fine-grained and/or coarse modifications) to the workflow behavior of a particular model instance. It still allows the flexibility to make modifications. Also, because these modifications reside within configuration file 1341 (not code), they can be safely delivered to the platform without the need to install additional code in the distribution. For example, the user may modify the model input to ignore punctuation, stop words, add additional features such as word embeddings, and determine if a word is capitalized. These changes can be made by modifying the configuration YAML file instead of changing the source code. The configuration file can then take the referenced features and create a feature matrix from the training dataset.

[0095] 도 16은 본 발명의 예시적인 실시예에 따른 도 13에 묘사된 기계 학습 단계의 다이어그램이다. 실시예에 따르면, 기존 모델로부터의 예측 뿐만 아니라 모델의 훈련은 동일한 구성 파일, 예를 들어, 구성 파일(1341)을 사용하여 수행된다. 도면에 묘사된 바와 같이, 훈련 모드 동안, LumeDataset와 같은 훈련 데이터세트에서 타겟 진리표들이 추출되고, 이어서 초기화된 모델에 제공될 수 있다. 또한, 훈련 데이터세트에서 피처들이 또한 추출되고 이어서 초기화된 모델에 제공될 수 있다. 이어서, 선택된 모델 아키텍처, 예를 들어 제3자 모델링 패키지(1440)(예를 들어, 사이킷런, 케라스 등)는 모델 훈련 단계를 실행하고, 이어서 훈련된 모델은 데이터베이스(1430)에 저장된다. 이어서, 예측 모드 동안, 훈련된 모델은 데이터베이스(1430)로부터 로드될 수 있고, 테스팅 데이터세트로부터의 결과들을 예측하기 위해 테스팅 데이터세트로부터 추출된 피처들 뿐만 아니라 구성 파일로부터 설정된 피처 매트릭스에서 실행될 수 있다. 실시예에 따르면, 훈련 데이터세트는 모델을 개발하기 위해 특별히 사용되지만 모델 성능을 테스트하기 위해 사용되지 않는 데이터이고; 반대로, 테스팅 데이터세트는 모델 성능을 테스트하는 데 사용되지만 모델을 훈련하는 데는 사용되지 않는다. 그러나, 둘 모두의 데이터세트들은 라벨링되어야 한다.[0095] Figure 16 is a diagram of the machine learning phase depicted in Figure 13 in accordance with an exemplary embodiment of the present invention. According to an embodiment, the training of the model as well as predictions from the existing model is performed using the same configuration file, eg, configuration file 1341 . As depicted in the figure, during training mode, target truth tables may be extracted from a training dataset, such as LumeDataset, and then provided to an initialized model. In addition, features may also be extracted from the training dataset and then provided to the initialized model. The selected model architecture, eg, a third party modeling package 1440 (eg, CykitLearn, Keras, etc.), then executes a model training step, which is then stored in the database 1430 . Then, during prediction mode, the trained model can be loaded from database 1430 and run on features extracted from the testing dataset as well as feature matrices established from a configuration file to predict results from the testing dataset. . According to an embodiment, the training dataset is data specifically used to develop a model but not used to test model performance; Conversely, the testing dataset is used to test model performance, but not to train the model. However, both datasets must be labeled.

[0096] 실시예에 따르면, 문서들에 대해 물어볼 수 있는 많은 질문들은 텍스트 자체로부터 원시 정보의 명시적 추출을 포함한다. 그러나, 철자 에러들이 일반적이고/이거나 포맷팅이 일치하지 않는 기계-판독불가능 문서들의 경우, 추가 프로세싱이 필요하다. 예를 들어, 날짜들은 문서들에 많은 상이한 방식들로 기록될 수 있지만(예를 들어, 4/5/2010, 4.5.10, 2010년 4월 5일, 2010년 4월 5번째일 등) - 정보는 분석을 위해 보고될 때 여전히 일치하게 포맷팅되어야 한다. 따라서, 포스트-프로세싱이 요구된다. 이와 관련하여, 포스트-프로세싱 단계(1350) 동안, 사용자는 모델 결과들에 대해 수행하기 위해 특정 작업들 및 기능들을 맞춤화할 수 있다. 또한, 포스트-프로세싱 단계(1350)는 또한 모델들의 결과들에 소정 비즈니스 로직 및 조건을 부과하는 데 사용될 수 있다. 예를 들어, 소정 비지니스 로직은, 하나의 필드가 다른 필드에 종속될 수 있는 경우 부과될 수 있고 - 모델이, 계약에 자동-갱신이 없어야 한다고 예측하는 경우, 자동-갱신 기간에 대한 결과가 없어야 한다. 따라서, 포스트-프로세싱 단계(1350)에 의해, 사용자가 필요로 하는 포맷으로 데이터가 제공될 수 있다. 또한, 결과들이 상호의존적인 필드들을 포함하는 경우 다양한 모델 예측에 비즈니스 로직이 부과될 수 있다.According to an embodiment, many of the questions that may be asked about documents involve explicit extraction of raw information from the text itself. However, for machine-unreadable documents where spelling errors are common and/or formatting does not match, additional processing is required. For example, dates may be recorded in documents in many different ways (eg 4/5/2010, 4.5.10, 5 April 2010, 5th April 2010, etc.) - The information must still be formatted accordingly when reported for analysis. Therefore, post-processing is required. In this regard, during the post-processing step 1350 , the user may customize certain tasks and functions to perform on the model results. In addition, the post-processing step 1350 may also be used to impose certain business logic and conditions on the results of the models. For example, certain business logic may be imposed if one field may depend on another field - if the model predicts that the contract should not have auto-renew, there should be no consequences for the auto-renew period. do. Accordingly, by the post-processing step 1350, data may be provided in a format required by the user. Also, business logic may be imposed on various model predictions if the results contain interdependent fields.

[0097] 또한, 실시예에 따르면, 컨설리데이션 단계(1360) 동안, 관련 문서들은 입력되고 이어서 비즈니스 로직은 어느 문서로부터 어느 정보가 보고되어야 하는지를 결정하기 위해 그래프 컨설리데이션 엔진(1361)(도 17 참조)에 의해 수행된다. 예를 들어, 다수의 수정들을 갖는 마스터 서비스 합의의 경우, 계약 기간에 대한 정보는 가장 최근 수정 사항에서 파생되어야 한다. 실시예에 따르면, 이 로직은 사용자에 의해 그래프 컨설리데이션 엔진(1361)에 코딩될 수 있다. 또한, 컨설리데이션 작업들은 문서들 간의 관계들을 모델링하기 위해 그래프 데이터베이스(1370)(예를 들어, JanusGraph)에 의해 구현될 수 있다. 예를 들어, 도 17에 묘사된 바와 같이, 다수의 문서들(1362 및 1363)(또는 동일한 문서의 버전들, 즉, "문서 1")은 업데이트되거나 충돌하는 사실들(예를 들어, 사실들 A 및 B)과 함께 그래프 컨설리데이션 엔진(1361)에 입력될 수 있다. 예를 들어, 문서(1362)와 관련하여, 사실 A = "참"이고 사실 B = "1"이다. 다른 한편, 문서(1363)에서, 사실 A = "거짓"이고 사실 B = "2"이다. 이와 관련하여, 문서들(1362 및 1363) 간의 충돌들을 해결하기 위해, 그래프 컨설리데이션 엔진(1361)은 문서에서 발견된 다른 모델 출력들을 사용하고, 이는 그래프 데이터베이스(1370)에서 검색될 수 있다. 이어서, 그래프 컨설리데이션 엔진(1361)은 문서 1에 대한 현재 참의 사실들을 반영하는 컨설리데이팅된 출력(1364)을 제공할 수 있다.Further, according to an embodiment, during the consolidation phase 1360 , relevant documents are entered and then business logic is applied to the graph consolidation engine 1361 (Fig. 17). For example, in the case of a master service agreement with multiple modifications, the information about the contract term should be derived from the most recent modification. According to an embodiment, this logic may be coded in the graph consolidation engine 1361 by the user. Further, the consultation operations may be implemented by the graph database 1370 (eg, JanusGraph) to model relationships between documents. For example, as depicted in FIG. 17 , multiple documents 1362 and 1363 (or versions of the same document, ie, “document 1”) are updated or conflicting facts (eg, facts). A and B) together with the graph consolidation engine 1361 may be input. For example, with respect to document 1362, fact A = “true” and fact B = “1”. On the other hand, in document 1363, fact A = “false” and fact B = “2”. In this regard, to resolve conflicts between documents 1362 and 1363 , graph consolidation engine 1361 uses other model outputs found in the document, which can be retrieved from graph database 1370 . The graph consolidation engine 1361 can then provide a consolidated output 1364 that reflects the current true facts for document 1 .

[0098] 도 18은 본 발명의 예시적인 실시예에 따른 다수의 문서들을 표현하기 위한 그래프 스키마들을 묘사하는 다이어그램이다. 예를 들어, 도면에 묘사된 바와 같이, 문서(1366)(즉, Doc 1, Doc 2, Doc 3 및 Doc 4)는 그래프 스키마(1367)(즉, 그래프 스키마 A) 또는 그래프 스키마(1368)(즉, 그래프 스키마 B)로 표현될 수 있다. 실시예에 따르면, 그래프 스키마들(1367 및 1368)은 SME들에 의해 정의된 비즈니스 사례들에 대한 맞춤 모델들에 기반할 수 있다. 그래프 스키마들(1367 및 1368)은 구성 파일을 통해 생성될 수 있고, 여기서 SME는 그래프의 문서들(1366) 간의 연결들을 결정하는 데 사용할 수 있는 문서의 정보를 지정할 수 있다. 이어서, 이 그래프 모델은 그래프 데이터베이스에 로드될 수 있고, 그래프에 로드된 모든 데이터는 이 그래프 모델을 준수한다. 또한, 그래프 에지들은 프로세싱된 모델들에 기반하여 자동으로 그리고 동적으로 확립될 수 있다. 이와 관련하여, 그래프 스키마(1367)를 사용하여, 문서들(1366)은 공유 문서 ID, 예를 들어 "계약 패밀리 1"에 의해 연결된다. 또한, 그래프 스키마(1368)를 사용하여, Lume들은 관련 클라이언트 이름들을 통해 문서 루트에 연결된다.18 is a diagram depicting graph schemas for representing multiple documents in accordance with an exemplary embodiment of the present invention. For example, as depicted in the figure, document 1366 (i.e., Doc 1, Doc 2, Doc 3, and Doc 4) can be either graph schema 1367 (i.e. graph schema A) or graph schema 1368 ( That is, it can be expressed as graph schema B). According to an embodiment, graph schemas 1367 and 1368 may be based on custom models for business cases defined by SMEs. Graph schemas 1367 and 1368 can be created via a configuration file, where the SME can specify information in the document that can be used to determine connections between documents 1366 of the graph. This graph model can then be loaded into a graph database, and all data loaded into the graph conforms to this graph model. Also, graph edges can be established automatically and dynamically based on the processed models. In this regard, using graph schema 1367, documents 1366 are linked by a shared document ID, eg, “contract family 1”. Also, using the graph schema 1368, Lumes are linked to the document root through their associated client names.

[0099] 또한, 실시예에 따르면, 예시적인 프레임워크는 동적 스키마에 대한 그래프 쿼리 맞춤을 사용하여 문서 패밀리에 대한 질문들에 답변할 수 있다. 예를 들어, 질문이 "갱신 기간 찾기, 최신순으로 수정들 우선 순위 지정"이라고 가정하면, 예시적인 프레임워크는 쿼리를 그래프 쿼리로 변환하고 그래프의 순회를 수행하고, 수정들만 찾고 "유효 날짜 모델"에 의해 이들을 정렬한다. 결과는 컨설리데이션이 수행된 방법에 대한 전체 설명과 함께 사용자에게 반환되지만, 기본 그래프 모델을 이해할 필요는 없다. 예를 들어 결과: "X 개의 수정들을 찾고, 날짜는 다음과 같을 수 있음. 이들은 다음 갱신 기간들을 가짐: Y. 가장 좋은 답변은 Z임"일 수 있다. 또한, 질문이 "최저가가 유효한 가격"인 경우, 예시적인 프레임워크는 쿼리를 그래프 쿼리로 변환하고 그래프의 순회를 수행하여, 가격을 갖는 임의의 문서를 찾고, 이어서 최저 가격을 찾는다." 이와 관련하여, 결과는: "가격을 갖는 X 개의 문서들 찾기. 값들은 […]임. 최저 값들은 Y임"일 수 있다.Also, according to an embodiment, the example framework may answer questions about a document family using graph query fitting to a dynamic schema. For example, assuming the question is "find renewal period, prioritize revisions by newest", the exemplary framework would convert the query to a graph query, perform a traversal of the graph, find only revisions, and "model effective date" sort them by Results are returned to the user with a full description of how the consolidation was performed, but no understanding of the underlying graph model is required. For example result: "Find X revisions, the date may be. They have the following renewal periods: Y. Best answer is Z". Also, if the question is "the lowest price is a valid price," the exemplary framework converts the query to a graph query and performs a traversal of the graph to find any document with a price, followed by the lowest price." So, the result is: "Find X documents with price. The values are […] ]lim. lowest values are Y".

[00100] 또한, 도 13에 묘사된 바와 같이, 예시적인 프레임워크, 예를 들어 흐름(1300)은 또한 높은 품질 및 일관성을 시행하기 위해 흐름(1300)의 모든 단계 후에 QA 검사들을 구현하는 품질 평가(QA) 구성요소를 포함한다. 이들 검사들은 (i) 소정 Lume 요소들이 생성되어 예상대로 Lume 데이터 구조에 추가되었는지 여부, (ii) 모든 Lume들이 단계마다 성공적으로 전달되었는지 여부, 및 (iii) 올바른 속성 키들 및 개수들이 각각의 단계에 포함되었는지 여부를 포함할 수 있다. 또한, 사용자들은 또한 필요에 따라 그 자신의 맞춤 품질 평가 검사들을 구성하고 추가할 수 있다.[00100] Also, as depicted in FIG. 13 , an exemplary framework, eg, flow 1300 , also implements QA checks after every step of flow 1300 to enforce high quality and consistency quality assessment. Includes (QA) components. These checks are based on (i) whether certain Lume elements were created and added to the Lume data structure as expected, (ii) whether all Lumes were successfully passed from step to step, and (iii) whether the correct attribute keys and numbers were assigned to each step. Whether or not it is included may be included. In addition, users can also configure and add their own custom quality assessment checks as needed.

[00101] 본원에 설명된 다양한 실시예들이 광범위한 유용성 및 응용성이 가능하다는 것이 통상의 기술자에 의해 인식될 것이다. 따라서, 다양한 실시예들이 예시적인 실시예와 관련하여 본원에서 상세하게 설명되지만, 본 개시내용이 다양한 실시예들의 예시 및 예이고 가능하게 하는 가능한 개시를 제공하기 위해 만들어진 것임이 이해되어야 한다. 따라서, 본 개시내용은 실시예들을 제한하는 것으로 해석되거나 다른 임의의 그러한 실시예들, 적응들, 변형들, 수정들 및 등가 배열들을 배제하도록 의도되지 않는다.[00101] It will be appreciated by those skilled in the art that the various embodiments described herein are capable of wide utility and applicability. Accordingly, while various embodiments are described in detail herein in connection with exemplary embodiments, it is to be understood that the present disclosure is intended to provide a possible disclosure that is and is illustrative and exemplified of various embodiments. Accordingly, this disclosure is not intended to be construed as limiting the embodiments or to exclude any other such embodiments, adaptations, variations, modifications and equivalent arrangements.

[00102] 전술한 설명들은 본 발명의 실시예의 상이한 구성들 및 특징들의 예들을 제공한다. 소정 명명법 및 애플리케이션/하드웨어 유형들이 설명되지만, 다른 이름들 및 애플리케이션/하드웨어 사용이 가능하고 명명법은 비제한적인 예들로서만 제공된다. 또한, 특정 실시예들이 설명되지만, 각각의 실시예의 특징들 및 기능들이 통상의 기술자의 능력 내에 있는 그대로 임의의 조합으로 조합될 수 있음이 인식되어야 한다. 도면들은 다양한 실시예들에 관한 추가의 예시적인 세부사항들을 제공한다.[00102] The foregoing description provides examples of different configurations and features of an embodiment of the invention. Although certain nomenclature and application/hardware types are described, other names and application/hardware usage are possible and the nomenclature is provided as non-limiting examples only. Further, while specific embodiments are described, it should be appreciated that the features and functions of each embodiment may be combined in any combination as they come within the ability of those of ordinary skill in the art. The drawings provide additional illustrative details regarding various embodiments.

[00103] 다양한 예시적인 방법들은 본원에서 예로서 제공된다. 설명된 방법들은 다양한 시스템들 및 모듈들 중 하나 또는 조합에 의해 실행되거나 수행될 수 있다.[00103] Various exemplary methods are provided herein by way of example. The described methods may be practiced or performed by one or a combination of various systems and modules.

[00104] 본 개시내용에서 컴퓨터 시스템이라는 용어의 사용은 단일 컴퓨터 또는 다수의 컴퓨터들과 관련될 수 있다. 다양한 실시예들에서, 다수의 컴퓨터들은 네트워크화될 수 있다. 네트워킹은 유선 및 무선 네트워크들, 근거리 통신망, 광역 통신망 및 인터넷을 포함하지만 이에 국한되지 않는 모든 유형의 네트워크일 수 있다.[00104] Use of the term computer system in this disclosure may refer to a single computer or multiple computers. In various embodiments, multiple computers may be networked. Networking may be any type of network including, but not limited to, wired and wireless networks, local area networks, wide area networks, and the Internet.

[00105] 예시적인 실시예들에 따르면, 시스템 소프트웨어는 하나 이상의 컴퓨터 프로그램 제품들, 예를 들어 데이터 프로세싱 장치에 의해 실행되거나 데이터 프로세싱 장치의 동작을 제어하기 위해 컴퓨터-판독가능 매체에 인코딩된 컴퓨터 프로그램 명령들의 하나 이상의 모듈들로 구현될 수 있다. 구현들은 알고리즘들의 단일 또는 분산 프로세싱을 포함할 수 있다. 컴퓨터-판독가능 매체는 기계-판독가능 저장 디바이스, 기계-판독가능 저장 기판, 메모리 디바이스, 또는 이들 중 하나 이상의 조합일 수 있다. "프로세서"라는 용어는 예를 들어 프로그램가능 프로세서, 컴퓨터, 또는 다중 프로세서들 또는 컴퓨터들을 포함하여, 데이터를 프로세싱하기 위한 모든 장치, 디바이스들 및 기계들을 포함한다. 장치는 하드웨어에 더하여, 해당 컴퓨터 프로그램에 대한 실행 환경을 생성하는 소프트웨어 코드, 예를 들어 프로세서 펌웨어, 프로토콜 스택, 데이터베이스 관리 시스템, 운영 체제 또는 이들의 하나 이상의 조합을 구성하는 코드를 포함할 수 있다.According to exemplary embodiments, system software is one or more computer program products, eg, a computer program encoded in a computer-readable medium for executing by or controlling operation of a data processing apparatus. It may be implemented in one or more modules of instructions. Implementations may include single or distributed processing of algorithms. The computer-readable medium may be a machine-readable storage device, a machine-readable storage substrate, a memory device, or a combination of one or more thereof. The term “processor” includes all apparatus, devices and machines for processing data, including, for example, a programmable processor, computer, or multiple processors or computers. The device may include, in addition to hardware, software code that creates an execution environment for a corresponding computer program, for example code constituting processor firmware, protocol stack, database management system, operating system, or one or more combinations thereof.

[00106] 컴퓨터 프로그램(프로그램, 소프트웨어, 소프트웨어 애플리케이션, 스크립트 또는 코드라고도 알려짐)은 컴파일된 언어나 해석된 언어를 포함한 모든 형태의 프로그래밍 언어로 작성될 수 있고, 독립형 프로그램 또는 모듈, 구성요소, 서브루틴 또는 컴퓨팅 환경에서 사용하기에 적합한 다른 유닛을 포함한 모든 형태로 배포될 수 있다. 프로그램은 다른 프로그램들이나 데이터(예를 들어, 마크업 언어 문서에 저장된 하나 이상의 스크립트들), 해당 프로그램 전용 단일 파일 또는 다수의 조정 파일들(예를 들어, 하나 이상의 모듈들, 서브 프로그램들 또는 코드 부분들을 저장하는 파일들)을 보유하는 파일의 부분에 저장될 수 있다. 컴퓨터 프로그램은 하나의 컴퓨터 또는 하나의 사이트에 위치되거나 다수의 사이트들에 분산되어 있고 통신 네트워크로 상호연결된 다수의 컴퓨터들에서 실행을 위해 배포될 수 있다.[00106] A computer program (also known as a program, software, software application, script or code) may be written in any form of programming language, including compiled or interpreted language, and may be a stand-alone program or module, component, subroutine or may be distributed in any form, including other units suitable for use in a computing environment. A program may contain other programs or data (e.g., one or more scripts stored in a markup language document), a single file or multiple control files dedicated to that program (e.g., one or more modules, subprograms or portions of code) may be stored in the portion of the file holding the files). The computer program may be distributed for execution on one computer or multiple computers located at one site or distributed over multiple sites and interconnected by a communication network.

[00107] 컴퓨터는 예를 들어 프로그램가능 프로세서, 컴퓨터, 또는 다중 프로세서들 또는 컴퓨터들을 포함하여, 데이터를 프로세싱하기 위한 모든 장치, 디바이스들 및 기계들을 포함할 수 있다. 장치는 하드웨어에 더하여, 해당 컴퓨터 프로그램에 대한 실행 환경을 생성하는 코드, 예를 들어 프로세서 펌웨어, 프로토콜 스택, 데이터베이스 관리 시스템, 운영 체제 또는 이들의 하나 이상의 조합을 구성하는 코드를 포함할 수 있다.A computer may include all apparatus, devices and machines for processing data, including, for example, a programmable processor, a computer, or multiple processors or computers. The device may include, in addition to hardware, code that creates an execution environment for a corresponding computer program, for example, code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or one or more combinations thereof.

[00108] 이 문서에서 설명된 프로세스들 및 로직 흐름들은 입력 데이터를 운영하고 출력을 생성하여 기능들을 수행하기 위해 하나 이상의 컴퓨터 프로그램들을 실행하는 하나 이상의 프로그램가능 프로세서들에 의해 수행될 수 있다. 프로세스들 및 로직 흐름들은 또한 특수 목적 로직 회로, 예를 들어 FPGA(현장 프로그래밍가능 게이트 어레이) 또는 ASIC(주문형 집적 회로)에 의해 수행될 수 있고 장치는 또한 상기 특수 목적 로직 회로로서 구현될 수 있다.[00108] The processes and logic flows described in this document may be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. Processes and logic flows may also be performed by a special purpose logic circuit, for example an FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit) and the apparatus may also be implemented as said special purpose logic circuit.

[00109] 컴퓨터 프로그램 명령들 및 데이터를 저장하기에 적합한 컴퓨터-판독가능 매체는 예로서 EPROM, EEPROM 및 플래시 메모리 디바이스들과 같은 반도체 메모리 디바이스들; 자기 디스크들, 예를 들어 내부 하드 디스크들 또는 착탈식 디스크들; 자기 광 디스크들; 및 CD ROM 및 DVD-ROM 디스크들을 포함하여, 모든 형태의 비휘발성 메모리, 매체 및 메모리 디바이스들을 포함할 수 있다. 프로세서와 메모리는 특수 목적 로직 회로에 의해 보완되거나 통합될 수 있다.[00109] A computer-readable medium suitable for storing computer program instructions and data includes, for example, semiconductor memory devices such as EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and all forms of non-volatile memory, media and memory devices, including CD ROM and DVD-ROM disks. The processor and memory may be supplemented or integrated by special purpose logic circuitry.

[00110] 실시예들이 분석을 수행하기 위한 프레임워크 내에서 특히 도시되고 설명되었지만, 다양한 실시예들의 범위를 벗어나지 않으면서 통상의 기술자에 의해 변형들 및 수정들이 영향을 받을 수 있음이 인식될 것이다. 또한, 통상의 기술자는 그러한 프로세스들 및 시스템들이 본원에 설명된 특정 실시예들로 제한될 필요가 없다는 것을 인식할 것이다. 다른 실시예들, 본 실시예들의 조합들, 및 이들의 용도들 및 장점들은 본원에 개시된 실시예들의 사양 및 실시를 고려함으로써 통상의 기술자들에게 명백할 것이다. 사양 및 예들은 예시적인 것으로 간주되어야 한다.Although embodiments have been particularly shown and described within a framework for performing an analysis, it will be appreciated that variations and modifications may be effected by those skilled in the art without departing from the scope of the various embodiments. Furthermore, those skilled in the art will recognize that such processes and systems need not be limited to the specific embodiments described herein. Other embodiments, combinations of the present embodiments, and their uses and advantages will be apparent to those skilled in the art upon consideration of the specification and practice of the embodiments disclosed herein. Specifications and examples are to be considered illustrative.

Claims

A computer-implemented method for analyzing data from various data sources, comprising:
receiving data from the various data sources as inputs;
converting the received data from each of the various data sources into a common data structure;
identifying keywords in the received data;
generating sentence or word embeddings based on the identified keywords;
receiving a selection of one or more labels based on the generated sentence or word embeddings;
adding the selected one or more labels to a model;
training the model on the common data structure based on a configuration file; and
and generating a result in response to a user question based on the model, wherein the generating comprises:
retrieving related documents from the received data;
determining from which of the retrieved related documents which information is to be reported; and
and providing the result based on the determination and a graph schema associated with the related documents.

According to claim 1,
wherein the various data sources include at least one of a machine-readable document, a machine non-readable document, a spreadsheet, an image, a Hypertext Markup Language file.

According to claim 1,
and partitioning the received data into component documents, wherein the received data is partitioned based on one of a heuristic model and a trained model.

According to claim 1,
tokenizing at least one of word elements and sentence elements of the received data; and
and adding default word embeddings to at least one of the tokenized word elements and sentence elements.

According to claim 1,
wherein the configuration file includes instructions regarding at least one of a task type, an algorithm type, and a feature.

6. The method of claim 5,
(i) the task type is one of training, validation, and prediction, (ii) the algorithm type is one of a regression algorithm and a recursive algorithm, and (iii) the features include word embeddings.

According to claim 1,
The computer-implemented method further comprising the step of performing at least one quality assessment check.

According to claim 1,
receiving, via a user interface, at least one representation;
providing the at least one representation to the model;
selecting, using an application programming interface, annotation candidates associated with the at least one representation; and
and training the model based on the selected annotation candidates.

According to claim 1,
During training, target truth labels and features are extracted from a training dataset and then provided to the model.

A computer-implemented system for analyzing data from various data sources, comprising:
A processor comprising:
receive data from the various data sources as inputs;
transform the received data from each of the various data sources into a common data structure;
identify keywords in the received data;
generate word or sentence embeddings based on the identified keywords;
receive a selection of one or more labels based on the generated word or sentence embeddings;
add the selected one or more labels to a model;
train the model on the common data structure based on a configuration file;
to generate results in response to user questions based on the above model;
constituted, and the generating is
retrieving related documents from the received data;
determining from which of the retrieved related documents which information should be reported;
and providing the result based on the determination and a graph schema associated with the related documents.

11. The method of claim 10,
wherein the various data sources include at least one of a machine-readable document, a machine non-readable document, a spreadsheet, an image, a hypertext markup language file.

11. The method of claim 10,
wherein the processor is further configured to partition the received data into component documents, wherein the received data is partitioned based on one of a heuristic model and a trained model.

11. The method of claim 10,
The processor is
tokenize at least one of word elements and sentence elements of the received data; and
add default word embeddings to at least one of the tokenized word elements and sentence elements;
further configured, a computer-implemented system.

11. The method of claim 10,
wherein the configuration file includes instructions regarding at least one of a task type, an algorithm type, and features.

15. The method of claim 14,
(i) the task type is one of training, validation, and prediction, (ii) the algorithm type is one of a regression algorithm and a recursive algorithm, and (iii) the features include word embeddings. .

11. The method of claim 10,
and the processor is further configured to perform at least one quality assessment check.

11. The method of claim 10,
The processor is
receive, via the user interface, at least one representation;
provide the at least one representation to the model;
using an application programming interface to select annotation candidates associated with the at least one representation;
to train the model based on the selected annotation candidates.
further configured, a computer-implemented system.

11. The method of claim 10,
During training, target truth tables and features are extracted from a training dataset and then provided to the model.

A computer-implemented system for analyzing data from various data sources, comprising:
application programming interface; and
including a processor;
The processor is configured to generate a result in response to a user question based on the machine learning model, wherein the generating comprises:
retrieving related documents from the received data;
determining from which of the retrieved related documents which information should be reported; and
providing the result based on the determination and a graph schema associated with the related document;
wherein the machine learning model is trained on annotation candidates provided by the application programming interface.