WO2020100018A1 - Système et procédé pour correcteur de textes basé sur l'intelligence artificielle pour des documents - Google Patents
Système et procédé pour correcteur de textes basé sur l'intelligence artificielle pour des documents Download PDFInfo
- Publication number
- WO2020100018A1 WO2020100018A1 PCT/IB2019/059690 IB2019059690W WO2020100018A1 WO 2020100018 A1 WO2020100018 A1 WO 2020100018A1 IB 2019059690 W IB2019059690 W IB 2019059690W WO 2020100018 A1 WO2020100018 A1 WO 2020100018A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sentences
- positive
- negative
- machine learning
- text
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- Embodiments of the present disclosure relate to proof reading documents, and more particularly to, a system and method for artificial intelligence-based proof reader for documents.
- the research reports are usually prepared by a research analyst, and sometimes there might be errors, especially those which might jeopardise the integrity of a company or wrongly influence the readers.
- the news agencies nowadays suffer from the circulation of fake news and inflammatory news which may disrupt the public law and order.
- a system for artificial intelligence-based proof reader for documents includes a machine learning module including a machine learning classifier and configured to receive a digital document.
- the machine learning classifier is also configured to identify at least one of one or more positive sentences and one or more negative sentences present in the digital document.
- the system also includes a shallow parser module configured to receive the one or more negative sentences from the machine learning module.
- the shallow parser is also configured to apply a set of predetermined rules to the one or more negative sentences to extract one or more positive texts in the one or more negative sentences.
- the shallow parser module is also configured to filter the one or more positive texts corresponding to a set of predefined patterns.
- the shallow parser module is also configured to highlight the filtered one or more positive texts.
- a method for artificial intelligence-based proof reader for documents includes receiving, by a machine learning module, a digital document.
- the method also includes identifying, by the machine learning module, at least one of one or more positive sentences and one or more negative sentences present in the digital document.
- the method also includes receiving by a shallow parser, the one or more negative sentences from the machine learning module.
- the method also includes applying, by the shallow parser, a set of predetermined rules to the one or more negative sentences to extract the one or more positive texts in the one or more negative sentences.
- the method also includes filtering, by the shallow parser, the validated one or more positive texts corresponding to a set of predefined patterns.
- the method also includes highlighting, by the shallow parser, the filtered one or more positive texts.
- FIG. 1 illustrates a block diagram of a system for artificial intelligence-based proof reader for documents in accordance with an embodiment of the present disclosure
- FIG. 2 is a schematic representation of an exemplary system of a system for artificial intelligence-based proof reader for documents of FIG. 1 in accordance with an embodiment of the present disclosure.
- FIG. 3 illustrates a flowchart representing the steps involved in a method for artificial intelligence-based proof reader for documents in accordance with an embodiment of the present disclosure.
- Embodiments of the present disclosure relate to a system for artificial intelligence- based proof reader for documents.
- the system includes a machine learning module including a machine learning classifier and configured to receive a digital document.
- the machine learning classifier is also configured to identify at least one of one or more positive sentences and one or more negative sentences present in the digital document.
- the system also includes a shallow parser module configured to receive the one or more negative sentences from the machine learning module.
- the shallow parser is also configured to apply a set of predetermined rules to the one or more negative sentences to extract one or more positive texts in the one or more negative sentences.
- the shallow parser module is also configured to filter the one or more positive texts corresponding to a set of predefined patterns.
- the shallow parser module is also configured to highlight the filtered one or more positive texts.
- FIG. 1 is a block diagram of a system (100) for artificial intelligence-based proof reader for documents in accordance with an embodiment of the present disclosure.
- the system (100) includes a machine learning module (110), wherein the machine learning module (110) includes a machine learning classifier (120).
- the machine learning classifier (120) may be based on supervised learning.
- the machine learning classifier (120) may include a binary classifier.
- the binary classifier may include a logistic regression, a support vector machine (SVM) classifier, a neural network, a k-nearest neighbour (kNN) classifier or a Naive Bayes classifier.
- SVM support vector machine
- kNN k-nearest neighbour
- the machine learning classifier (120) is configured to receive a digital document.
- the machine learning classifier (120) evaluates received digital document based on a historical report.
- the received digital document may include a financial document or a research report.
- the financial document may be prepared by research analysts or financial brokers.
- the historical report in an offline environment is created manually by research analysts and reviewed by supervisory analysts.
- the historical report includes review tracker changes along with a published document.
- the review tracker changes along with the published document provides required sample data for training the binary classifier in order to identify positive and negative part of sentences.
- the sample data which is obtained from corrected text in the review tracker changes is categorized as positive samples, whereas the sample data which is obtained from published text is categorized as negative samples.
- the positive and negative samples of the sample data of the historical reports becomes the labels or target values used in prediction.
- the received digital document is evaluated based on the labels of the historical reports by generation of a machine learning model.
- the machine learning classifier (120) is also configured to identify at least one of one or more positive sentences and one or more negative sentences present in the received digital document.
- the sentences of the received digital document or the dataset is trained and classified, using the generated machine learning model by splitting dataset into a training set and a testing set.
- the training set is a subset to train the generated model and the testing set is the subset to test the generated model.
- the training set and the testing set is created by splitting the dataset based on a predefined split ratio.
- the sentences of the received digital document after classification categorises the one or more positive sentences and the one or more negative sentences.
- the one or more positive sentences includes a promissory text, a political text, an inflated text, a fact without source, a conflicting text and a speculative text.
- the one or more positive texts may represent an incorrect text.
- the one or more negative sentences includes a non-conflicting text and a non- speculative text.
- the one or more positive sentences and the one or more negative sentences which are classified by the machine learning classifier (120) is evaluated by using a precision and a recall technique.
- precision is defined as number of samples which are retrieved are relevant.
- the precision evaluates number of true positive samples divided by the number of true positive samples plus the number of false positive samples.
- recall is defined as number of relevant samples which are retrieved.
- the recall evaluates the number of true positive samples divided by the number of true positive samples plus the number of false negative samples.
- the true positive samples are the positive samples which are correctly classified as positive.
- the false positive samples are the samples of a test result which wrongly indicates that a particular condition or an attribute is present.
- the false negative samples are the samples of the test result which wrongly indicates that the particular condition or attribute is absent.
- the machine learning model is configured to have a lower review threshold, so that none of incorrect text goes a miss.
- the machine learning model hence will have a lower precision value but a higher recall value.
- the system (100) also includes a shallow parser module (130) configured to receive the one or more negative sentences from the machine learning module (110).
- the shallow parser module (130) identifies text which is based on language construct.
- the shallow parser module (130) analyses a sentence, identifies constituent parts of the sentence such as nouns, verbs, adjectives, determiners, modals or the like and then links such parts of the sentence to units with discrete grammatical meanings such as noun groups or phrases or verb groups.
- the shallow parser module (130) is also configured to apply a set of predetermined rules for further validation to the one or more negative sentences to extract one or more positive texts in the one or more negative sentences.
- the set of predetermined rules may include one or more chunking rules.
- the one or more chunking rules may be configured to identify one or more positive texts and the one or more negative texts.
- the chunking rule may be defined for identifying the one or more positive texts from a sentence as used herein,‘The price of oil will go up’.
- the chunking rule for identifying the one or more positive texts may include a rule, which is represented as L ( ⁇ DT>? ⁇ NN. *> ⁇ IN> ⁇ NN. *> ⁇ MD> ⁇ VB> ⁇ RP>?), wherein the DT represents a determiner, NN represents a noun which may be singular or plural, MD represents a modal, VB represents a verb and RP represents a particle.
- the chunking rule may be defined for identifying the one or more positive texts from a sentence as used herein,‘Company A and Company B will merge and the benchmark to go up by 10%’.
- the chunking rule for identifying the one or more positive texts may include a rule, which is represented as ( ⁇ NN. *>* ⁇ MD> ⁇ RB>? ⁇ VB>), wherein the NN represents a noun which may be singular or plural, the MD represents a modal, RB represents adverb and VB represents verb.
- the chunking rule may be defined for identifying one or more positive texts from a sentence as used herein, ‘The market is overly optimistic on its future growth’.
- the chunking rule may include a ( ⁇ NN> ⁇ VBZ> ⁇ JJ> ⁇ IN>)
- the shallow parser module (130) is also configured to filter the one or more positive texts corresponding to a set of predefined patterns.
- the shallow parser module (130) upon matching the sentence with the corresponding chunking rule filters the one or more positive texts for further validation, corresponding to the set of predefined patterns.
- the set of predefined patterns may include at least one of one or more permitted patterns and at least one of one or more non-permitted patterns.
- the at least one of one or more permitted patterns may include phrases such as we expect, we believe, maintain, may or could.
- the at least one of one or more non-permitted patterns may include phrases such as will or should.
- the shallow parser module (130) is also configured to highlight the filtered one or more positive texts.
- the filtered one or more positive texts are the texts with issues.
- the texts with issues may include the conflicting text or the speculative text.
- FIG. 2 is a schematic representation of an exemplary system (200) of a system (100) for artificial intelligence-based proof reader for documents of FIG. 1 in accordance with an embodiment of the present disclosure.
- One or more modules of a system for artificial intelligence-based proof reader for documents of FIG.2 is substantially similar to the one or more modules of the system for artificial intelligence -based proof reader for documents of FIG. 1.
- the system of FIG. 2 is a hybrid approach which includes a machine learning approach as well as a shallow parser for identifying conflicting texts.
- an offline process includes creating a machine learning model based on a historical report. The historical report in an offline environment is created manually by research analysts and reviewed by supervisory analysts.
- the historical report includes review tracker changes along with a published document.
- the review tracker changes along with the published document provides required sample data for training the binary classifier in order to identify positive and negative part of sentences.
- the sample data obtained from corrected text or the published text is categorized as positive samples, whereas the sample data which is obtained after the review tracker changes or corrections done to the sample data are negative samples.
- the positive and negative samples of the sample data of the historical reports becomes the labels or target values used in prediction.
- a shallow parser rules database is created manually in the offline process to receive one or more negative sentences.
- the received one or more negative sentences are further validated by manually creating a set of permitted and a set of non-permitted phrases.
- real-time process of the hybrid approach includes receiving and reading a digital document by a machine learning module.
- the received digital document for example a financial document of an organisation is read and evaluated by a binary machine learning classifier such as a support vector machine (SVM) based on the historical report.
- SVM support vector machine
- the SVM classifier classifies one or more sentences of the financial document into one or more positive sentences or one or more negative sentences.
- the sentences of the financial document which is a dataset for the machine learning model is split into a training set and a testing set based on a predefined split ratio.
- the split ratio considered for the classification is 70:30, wherein 70 percent of the dataset is the training set and 30 percent of the dataset is the testing set.
- the one or more positive sentences are a promissory text, a political text, an inflated text, a fact without source, a conflicting text or a speculative text.
- the one or more negative sentences includes a non-conflicting text and a non- speculative text.
- the one or more positive sentences and the one or more negative sentences after the classification are evaluated by using a precision and a recall technique.
- the precision here calculates exactness or accuracy of the samples of the dataset which are predicted correctly.
- the precision calculates how many of the selected samples were correctly predicted.
- the recall calculates how many of the sample that should have been selected were actually selected. For example, the recall identifies the number of relevant samples which are retrieved correctly.
- the machine learning model needs to have a lower precision value but a higher recall value in order to ensure that no review content or incorrect text is missed out and goes into the published document. But the shallow parser has a higher precision value and a lower recall value.
- the one or more negative sentences after the classification are passed to the shallow parser for identifying text based on language construct.
- the shallow parser analyses the one or more sentences and parses the sentences based on parts of speech (POS) tagging.
- the shallow parser analyses a sentence, identifies constituent parts of the sentence such as nouns, verbs, adjectives, determiners, modals or the like and then links such parts of the sentence to units with discrete grammatical meanings such as noun groups or phrases or verb groups.
- a negative sentence may be ‘The price of oil will go up’.
- the shallow parser also applies a set of chunking rules on the abovementioned negative sentence for further validation to extract one or more positive texts.
- the set of chunking rules includes the chunking rule for identifying positive texts and the chunking rule for identifying the one or more negative texts.
- the chunking rule for such sentence may be defined as L ( ⁇ DT>? ⁇ NN. *> ⁇ IN> ⁇ NN. *> ⁇ MD> ⁇ VB> ⁇ RP>?), wherein the DT represents a determiner, NN represents a noun which may be singular or plural, MD represents a modal, VB represents a verb and RP represents a particle. So, here the positive text is‘will’.
- the positive text is then again filtered by the shallow parser for further validation corresponding to a set of predefined patterns.
- the set of predefined patterns may include at least one of one or more permitted patterns and at least one of one or more non-permitted patterns.
- the at least one of one or more permitted patterns may include phrases such as we expect, we believe, maintain, may or could.
- the at least one of one or more non-permitted patterns may include phrases such as will or should.
- the shallow parser filters the positive text corresponding to the set of predefined patterns, the positive text for example,‘will’ gets highlighted and one or more suggestions for permitting the sentence may be shown such as‘We expect, the price of oil will go up’.
- ‘we expect’ is the permitted pattern and the corresponding sentence becomes negative or sentences without issues or the non-conflicting text, which when passed to the shallow parser again will be discarded.
- FIG. 3 illustrates a flowchart representing the steps involved in a method (300) for artificial intelligence-based proof reader for documents in accordance with an embodiment of the present disclosure.
- the method (300) includes receiving, by a machine learning module, a digital document in step 310.
- the method (300) includes receiving by the machine learning module the digital document such as a financial report or a research report.
- the machine learning module by receiving the digital document is configured to evaluate received digital document based on a historical report.
- the method (300) also includes identifying, by the machine learning module, at least one of one or more positive sentences and one or more negative sentences present in the digital document in step 320.
- identifying by the machine learning module the at least one of one or more positive sentences and the one or more negative sentences may include identifying or classifying by a machine learning classifier the at least one of one or more positive sentences or the at least one of one or more negative sentences.
- Classification or identification of the at least one of one or more positive sentences and the at least one of one or more negative sentences of the received digital document includes training one or more sentences of the received digital document or a dataset and splitting the dataset into a training set and a testing set based on a split ratio, by a generated machine learning model.
- the one or more positive sentences includes a promissory text, a political text, an inflated text, a fact without source, a conflicting text and a speculative text.
- the one or more positive texts may represent an incorrect text.
- the one or more negative sentences includes a non-conflicting text and a non- speculative text.
- the one or more positive sentences and the one or more negative sentences which are classified by the machine learning classifier is evaluated by using a precision and a recall technique.
- the precision is a fraction of relevant instances among the retrieved instances.
- the method (300) also includes receiving by a shallow parser, the one or more negative sentences from the machine learning module in step 330.
- receiving by the shallow parser, the one or more negative sentences includes obtaining the one or more negative sentences from the machine learning module and identifying text based on language construct.
- identifying the text may include analysing a sentence, identifying constituent parts of the sentence such as nouns, verbs, adjectives, determiners, modals or the like and linking such parts of the sentence to units with discrete grammatical meanings such as noun groups, or phrases or verb groups.
- the method (300) also includes applying, by the shallow parser, a set of predetermined rules to the one or more negative sentences to extract the one or more positive texts in the one or more negative sentences in step 340.
- applying by the shallow parser, the set of predetermined rules to the one or more negative sentences to extract the one or more positive texts in the one or more negative sentences may include applying the set of predetermined rules such as one or more chunking rules.
- the one or more chunking rules may be configured to identify one or more positive texts and the one or more negative texts.
- the method (300) also includes filtering, by the shallow parser, the validated one or more positive texts corresponding to a set of predefined patterns in step 350.
- filtering, by the shallow parser, the validated one or more positive texts corresponding to the set of predefined patterns may include filtering the one or more positive texts for further validation corresponding to the set of predetermined patterns such as at least one of one or more permitted patterns or at least one of one or more permitted patterns.
- the at least one of one or more permitted patterns may include phrases such as we expect, we believe, maintain, may or could.
- the at least one of one or more non-permitted patterns may include phrases such as will or should.
- the method (300) also includes highlighting, by the shallow parser, the filtered one or more positive texts in step 360.
- highlighting by the shallow parser the filtered one or more positive texts may include highlighting the texts with issues.
- the texts with the issues may include the conflicting text or the speculative text.
- Various embodiments of the present disclosure enables identifying conflicting text within the digital document that can lead to potential litigations in the future from institutional investors and retail clients by using a hybrid approach of machine learning as well as shallow parser.
- the present disclosed system provides a solution which allows the digital document to be verified for issues based on the entire context rather than using pattern matching techniques.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
La présente invention concerne un système pour un correcteur de textes basé sur l'intelligence artificielle pour des documents. Le système comprend un module d'apprentissage automatique comprenant un classificateur d'apprentissage automatique et configuré pour recevoir un document numérique et identifier une ou plusieurs phrases positives et/ou une ou plusieurs phrases négatives présentes dans le document numérique. Le système comprend également un module d'analyse syntaxique de surface configuré pour recevoir la ou les phrases négatives en provenance du module d'apprentissage automatique. L'analyseur syntaxique de surface est également configuré pour appliquer un ensemble de règles prédéfinies à la ou aux phrases négatives pour extraire un ou plusieurs textes positifs dans la ou les phrases négatives. Le module d'analyse syntaxique de surface est également configuré pour filtrer le ou les textes positifs correspondant à un ensemble de modèles prédéfinis. Le module d'analyse syntaxique de surface est également configuré pour mettre en évidence le ou les textes positifs filtrés.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IN201841043008 | 2018-11-15 | ||
IN201841043008 | 2018-11-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020100018A1 true WO2020100018A1 (fr) | 2020-05-22 |
Family
ID=70730412
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2019/059690 WO2020100018A1 (fr) | 2018-11-15 | 2019-11-12 | Système et procédé pour correcteur de textes basé sur l'intelligence artificielle pour des documents |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2020100018A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12021367B2 (en) | 2021-05-03 | 2024-06-25 | Via Labs, Inc. | Protection circuit and hub chip |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130246048A1 (en) * | 2012-03-19 | 2013-09-19 | Fujitsu Limited | Text proofreading apparatus and text proofreading method |
US20180300315A1 (en) * | 2017-04-14 | 2018-10-18 | Novabase Business Solutions, S.A. | Systems and methods for document processing using machine learning |
-
2019
- 2019-11-12 WO PCT/IB2019/059690 patent/WO2020100018A1/fr active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130246048A1 (en) * | 2012-03-19 | 2013-09-19 | Fujitsu Limited | Text proofreading apparatus and text proofreading method |
US20180300315A1 (en) * | 2017-04-14 | 2018-10-18 | Novabase Business Solutions, S.A. | Systems and methods for document processing using machine learning |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12021367B2 (en) | 2021-05-03 | 2024-06-25 | Via Labs, Inc. | Protection circuit and hub chip |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230222366A1 (en) | Systems and methods for semantic analysis based on knowledge graph | |
US9645988B1 (en) | System and method for identifying passages in electronic documents | |
Glaser et al. | Classifying semantic types of legal sentences: Portability of machine learning models | |
KR20210086817A (ko) | 인공지능 기반의 투자지표 결정방법 및 그 시스템 | |
US20230028664A1 (en) | System and method for automatically tagging documents | |
Asogwa et al. | Hate speech classification using SVM and naive BAYES | |
CN112464670A (zh) | 识别方法、识别模型的训练方法、装置、设备、存储介质 | |
CN118013963B (zh) | 敏感词的识别和替换方法及其装置 | |
US20230016925A1 (en) | System and Method for Electronic Chat Production | |
Petrou et al. | A Multiple change-point detection framework on linguistic characteristics of real versus fake news articles | |
US11989677B2 (en) | Framework for early warning of domain-specific events | |
Tian et al. | Machine learning in internet financial risk management: A systematic literature review | |
WO2020100018A1 (fr) | Système et procédé pour correcteur de textes basé sur l'intelligence artificielle pour des documents | |
Bhadani et al. | Mining financial risk events from news and assessing their impact on stocks | |
Agrawal et al. | Hierarchical model for goal guided summarization of annual financial reports | |
Kamaruddin et al. | A text mining system for deviation detection in financial documents | |
Heidari et al. | Financial footnote analysis: developing a text mining approach | |
Pilankar et al. | Detecting violation of human rights via social media | |
Chaib et al. | Improved multi-label medical text classification using features cooperation | |
Oswal | Identifying and categorizing offensive language in social media | |
Yeom et al. | study of machine-learning classifier and feature set selection for intent classification of Korean tweets about food safety | |
Asooja et al. | Using semantic frames for automatic annotation of regulatory texts | |
Bhatti et al. | Benchmarking Performance of Document Level Classification and Topic Modeling | |
Spalenza et al. | Using ner+ ml to automatically detect fake news | |
Kos et al. | Classification in a Skewed Online Trade Fraud Complaint Corpus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19885203 Country of ref document: EP Kind code of ref document: A1 |