WO2023204724A1 - Procédé d'analyse de document juridique - Google Patents

Procédé d'analyse de document juridique Download PDF

Info

Publication number
WO2023204724A1
WO2023204724A1 PCT/RU2022/000134 RU2022000134W WO2023204724A1 WO 2023204724 A1 WO2023204724 A1 WO 2023204724A1 RU 2022000134 W RU2022000134 W RU 2022000134W WO 2023204724 A1 WO2023204724 A1 WO 2023204724A1
Authority
WO
WIPO (PCT)
Prior art keywords
documents
checked
document
fragment
model
Prior art date
Application number
PCT/RU2022/000134
Other languages
English (en)
Russian (ru)
Inventor
Виктор Борисович НАУМОВ
Денис Александрович САВЕЛЬЕВ
Original Assignee
Общество С Ограниченной Ответственностью "Дентонс Юроп" (Ооо "Дентонс Юроп")
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Общество С Ограниченной Ответственностью "Дентонс Юроп" (Ооо "Дентонс Юроп") filed Critical Общество С Ограниченной Ответственностью "Дентонс Юроп" (Ооо "Дентонс Юроп")
Priority to PCT/RU2022/000134 priority Critical patent/WO2023204724A1/fr
Publication of WO2023204724A1 publication Critical patent/WO2023204724A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images

Definitions

  • the invention relates to digital data processing methods specifically designed for specific functions that perform complex mathematical operations for automatic text analysis, and more specifically to a method for checking a text legal document for compliance with the requirements of applicable law.
  • the proposed invention relates to the field of information technology, namely to a method for automated semantic comparison of texts in natural language.
  • a peculiarity of legal documents is the difference in their style from the literary language and other types of text documents. This difference is manifested, among other things, in the absence of interrogative and exclamatory sentences, dialogues, clear structuring of the text into elements that carry independent meaning, the use of monotonous terms and phrases, their repetitions, identical syntactic constructions of sentences, in which it is difficult to find differences and similarities in texts. especially if the same legal construction is described in different ways.
  • patents known to the Applicant aimed at operating with legal documents for example, [1-4], as well as the considered analogues and prototype and many other patents do not take this feature into account.
  • the invention is also known [6], in which a functional structure is determined in a segmented text for each section of the text and, in each functional structure, triads characterizing predicate terms are found based on linearization transfer rules. Then the following features are isolated from each section of the text: named entity, identity by referent, lexical entry, semantic-structural relation.
  • This analogue is that it is focused only on search tasks.
  • the closest thing to the claimed invention is a method for automated semantic comparison of texts in natural language [22]. This method consists in the following: representing two compared texts in digital form, indexing these texts in digital form in the form of elementary units (words, word forms, set phrases) identify frequencies of occurrence of elementary units,
  • the degree of intersection of the semantic networks of two compared texts is a value characterizing the semantic similarity of these texts.
  • the disadvantage of the prototype is the impossibility of making the right legal decision, i.e. To .
  • the technical problem is the optimization of the method for analyzing legal texts for their compliance with applicable rules of law through the effective comparison of legal texts with the introduction of prepared sample documents into the analysis and the use of a prepared language model for analysis.
  • the technical result is to expand functionality by obtaining information about the similarity or difference of the checked legal documents with the prepared standard document corresponding to the applicable law, increasing the speed of analysis of legal documents and obtaining more accurate results from the perspective of the applicable law, and reducing the total time for analyzing legal documents when solving assigned legal and other tasks by a specialist.
  • Fig. 1 is a flow diagram of a method for analyzing a legal document to check it for compliance with the requirements of applicable law according to the invention.
  • the method is carried out in the following steps:
  • action operations are carried out with intermediate results stored, for example, in a random access memory or database.
  • the proposed method is carried out as follows, for example, to check sales contracts.
  • a comparison is made of the incoming (checked) documents with one or more pre-developed and/or selected samples.
  • Input documents are legal documents - for example, contracts of a certain type, in particular a purchase and sale agreement.
  • a sample document is a previously prepared and/or selected sample agreement of the relevant subject and the mutual rights and obligations of the parties.
  • Another example may be statements, for example, written consent of the subject to the processing of personal data.
  • Incoming (verified) documents may be various consents for the processing of personal data from different subjects, subject to legal verification, which must comply with the applicable law and can be formulated in different ways, with different structures and functional elements and stored digitally on remote or local servers.
  • the verification document may be a model consent to the processing of personal data, formulated by experts in the relevant field. Checking the compliance of fragments of the document being checked and the verification document will make it possible to assess the compliance of the document being checked with the applicable law by identifying discrepancies and their further expert assessment.
  • the documents being checked do not necessarily have to be created according to the specified sample.
  • Their developers may be different, unrelated persons, and the purpose of the documents may vary. They may include different numbering and/or breakdown of the document into structural elements, have different volume and content.
  • the creation of similar structural elements of a document can be done in different words (using synonyms and/or various phrases, sentences and other verbal constructions).
  • the system implements the specified method by performing the following steps; when implementing the method, the following is carried out:
  • machine learning algorithms based on a fixed-length feature vector can be used to create a vector model of a language or language domain.
  • Fixed-length properties can be represented based on an algorithm known as "bag of words”.
  • paragraph vector representation can be used - an "unsupervised learning” algorithm that creates fixed-length vector representations of objects from variable-length text fragments such as sentences, paragraphs and documents.
  • each paragraph is mapped to a unique vector represented by a column in the matrix D, and each word is also mapped to a unique vector represented by a column in the matrix W.
  • the paragraph vectors and word vectors are averaged and concatenated to predict the next word in a given context.
  • the contexts considered are of a fixed length and are obtained by sampling within a “sliding window” of paragraphs.
  • the vector of paragraphs is common to all contexts formed on the basis of the same paragraph, but not to different paragraphs.
  • paragraph vectors can be used as paragraph features (instead of or in addition to the bag-of-words algorithm mentioned above) in well-known machine learning algorithms, such as, for example, logistic regression, support vector machines or K-means.
  • FIG. 1 The block diagram of the proposed method is shown in Fig. 1 .
  • the method is implemented by loading text into computer memory or using a local database and/or remote access to the database.
  • a computer program is used that implements the described actions. Visualization can be carried out by known methods on the screen of a desktop, laptop or mobile device locally or using computer networks, including the Internet.

Abstract

L'invention concerne des procédés de traitement numérique de données. Ce procédé comprend les étapes suivantes: créer des documents types en format numérique ou effectuer une conversion de documents types en format numérique; placer des documents à vérifier dans une mémoire d'ordinateur; préparer un modèle vectoriel linguistique dans le corps des textes qui est commun ou spécifique pour le thème des documents types du lexique; effectuer un découpage des documents à vérifier et types en unités structurelles; créer un modèle vectoriel de chaque fragment de chaque document texte à vérifier et type; comparer le modèle vectoriel de chaque fragment de document à vérifier à un modèle vectoriel de chaque fragment de document type; déterminer une mesure de similitude de chaque fragment de modèle à vérifier avec chaque fragment de document type; représenter un champ de valeurs de mesure de similitude pour qu'un expert prenne une décision.
PCT/RU2022/000134 2022-04-20 2022-04-20 Procédé d'analyse de document juridique WO2023204724A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/RU2022/000134 WO2023204724A1 (fr) 2022-04-20 2022-04-20 Procédé d'analyse de document juridique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2022/000134 WO2023204724A1 (fr) 2022-04-20 2022-04-20 Procédé d'analyse de document juridique

Publications (1)

Publication Number Publication Date
WO2023204724A1 true WO2023204724A1 (fr) 2023-10-26

Family

ID=88420220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2022/000134 WO2023204724A1 (fr) 2022-04-20 2022-04-20 Procédé d'analyse de document juridique

Country Status (1)

Country Link
WO (1) WO2023204724A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130054612A1 (en) * 2006-10-10 2013-02-28 Abbyy Software Ltd. Universal Document Similarity
RU2538303C1 (ru) * 2013-08-07 2015-01-10 Александр Александрович Харламов Способ автоматизированного семантического сравнения текстов на естественном языке
RU2597163C2 (ru) * 2014-11-06 2016-09-10 Общество с ограниченной ответственностью "Аби Девелопмент" Сравнение документов с использованием достоверного источника
CN106776503A (zh) * 2016-12-22 2017-05-31 东软集团股份有限公司 文本语义相似度的确定方法及装置
RU2643438C2 (ru) * 2013-12-25 2018-02-01 Общество с ограниченной ответственностью "Аби Продакшн" Обнаружение языковой неоднозначности в тексте
RU2722571C1 (ru) * 2017-05-27 2020-06-01 Чайна Юниверсити Оф Майнинг Энд Текнолоджи Способ распознавания именованных сущностей в сетевом тексте на основе устранения неоднозначности вероятности в нейронной сети

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130054612A1 (en) * 2006-10-10 2013-02-28 Abbyy Software Ltd. Universal Document Similarity
RU2538303C1 (ru) * 2013-08-07 2015-01-10 Александр Александрович Харламов Способ автоматизированного семантического сравнения текстов на естественном языке
RU2643438C2 (ru) * 2013-12-25 2018-02-01 Общество с ограниченной ответственностью "Аби Продакшн" Обнаружение языковой неоднозначности в тексте
RU2597163C2 (ru) * 2014-11-06 2016-09-10 Общество с ограниченной ответственностью "Аби Девелопмент" Сравнение документов с использованием достоверного источника
CN106776503A (zh) * 2016-12-22 2017-05-31 东软集团股份有限公司 文本语义相似度的确定方法及装置
RU2722571C1 (ru) * 2017-05-27 2020-06-01 Чайна Юниверсити Оф Майнинг Энд Текнолоджи Способ распознавания именованных сущностей в сетевом тексте на основе устранения неоднозначности вероятности в нейронной сети

Similar Documents

Publication Publication Date Title
JP4701292B2 (ja) テキスト・データに含まれる固有表現又は専門用語から用語辞書を作成するためのコンピュータ・システム、並びにその方法及びコンピュータ・プログラム
Ehsan et al. Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information
US11580100B2 (en) Systems and methods for advanced query generation
Yalcin et al. An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding
Gentile et al. Explore and exploit. Dictionary expansion with human-in-the-loop
JP2020190970A (ja) 文書処理装置およびその方法、プログラム
US10229194B2 (en) Providing known distribution patterns associated with specific measures and metrics
Hall et al. Phonological CorpusTools: Software for doing phonological analysis on transcribed corpora
Wong et al. isentenizer-: Multilingual sentence boundary detection model
KR101811565B1 (ko) 자연어 질의에 대한 전문가 답변을 제공하는 시스템
JP4361299B2 (ja) 評価表現抽出装置、プログラム、及び記憶媒体
WO2023204724A1 (fr) Procédé d'analyse de document juridique
US20090249197A1 (en) Document proofreading support method and document proofreading support apparatus
KR20220041336A (ko) 중요 키워드 추천 및 핵심 문서를 추출하기 위한 그래프 생성 시스템 및 이를 이용한 그래프 생성 방법
CN113326348A (zh) 一种博客质量评估方法及工具
Mekki et al. Tokenization of Tunisian Arabic: a comparison between three Machine Learning models
Qasim et al. Exploiting affinity propagation for automatic acquisition of domain concept in ontology learning
KR101803095B1 (ko) 자연어 질의에 대한 전문가 답변을 제공하는 방법 및 시스템
JP4592556B2 (ja) 文書検索装置、文書検索方法および文書検索プログラム
JP4985096B2 (ja) 文書解析システム、および文書解析方法、並びにコンピュータ・プログラム
Van Hecke Computational stylometric approach to the Dead Sea Scrolls: towards a new research agenda
DeVille et al. Text as Data: Computational Methods of Understanding Written Expression Using SAS
JP2012146046A (ja) 要求獲得支援装置、要求獲得支援方法、およびプログラム
Vasili et al. A study of summarization techniques in Albanian language
JP7309489B2 (ja) 要約文作成方法、及び要約文作成システム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22938670

Country of ref document: EP

Kind code of ref document: A1