CN118070783A - Text intelligent correction method, system and equipment based on large language model - Google Patents
Text intelligent correction method, system and equipment based on large language model Download PDFInfo
- Publication number
- CN118070783A CN118070783A CN202410182006.9A CN202410182006A CN118070783A CN 118070783 A CN118070783 A CN 118070783A CN 202410182006 A CN202410182006 A CN 202410182006A CN 118070783 A CN118070783 A CN 118070783A
- Authority
- CN
- China
- Prior art keywords
- text
- entity
- language model
- user
- large language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000012937 correction Methods 0.000 title claims description 13
- 238000001514 detection method Methods 0.000 claims abstract description 16
- 239000013598 vector Substances 0.000 claims abstract description 14
- 230000001915 proofreading effect Effects 0.000 claims abstract description 9
- 230000011218 segmentation Effects 0.000 claims abstract description 6
- 238000006243 chemical reaction Methods 0.000 claims abstract description 4
- 238000000605 extraction Methods 0.000 claims description 16
- 238000012545 processing Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 230000018109 developmental process Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003930 cognitive ability Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 235000003642 hunger Nutrition 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 230000037351 starvation Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a text intelligent proofreading method, a system and equipment based on a large language model, which relate to the technical field of text detection, and the method comprises the following steps: acquiring an input text, and segmenting the text; converting the segmentation result into a vector by using a vectorization model, and searching the conversion result by using a vector database to obtain a related fact text; extracting the entity of the input text; performing two-by-two traversal combination on the entity, constructing related prompts based on the input text and the fact text, then calling a large language model to judge the entity relationship, constructing a triplet, and then sequentially traversing the triplet according to a rule matching mode to construct a multi-element group; aiming at the multiple groups, calling a large language model to construct a problem to obtain a problem set; and positioning error entities and sentences based on the fact text and the problem set, calling a large language model to correct errors, and displaying corresponding real information to a user. The invention can well solve the problem of text quality.
Description
Technical Field
The invention relates to the technical field of text detection, in particular to a text intelligent proofreading method, system and equipment based on a large language model.
Background
With the development of internet technology and the improvement of the degree of social informatization, the amount of text data has been explosively increased. Text data has become an important form of people to acquire knowledge, express views and transfer information in terms of its abundant information content and wide application fields. However, various errors such as word fraud, word errors, unhappy choice of words, etc. may occur when people write and edit text for various reasons such as individual's language level, cognitive ability, limitations of the input device, etc. These errors not only affect the quality of the text, but also prevent people from accurately understanding and effectively utilizing the text information. In the face of a large number of erroneous texts, the traditional manual proofreading method cannot meet the ever-increasing proofreading requirements, and is huge in workload, time-consuming, labor-consuming and low in efficiency.
Therefore, research and design of a computer system capable of automatically detecting and correcting text errors has extremely important theoretical significance and practical value. With the development of artificial intelligence and deep learning, text error detection is a technology with machine learning as a core, and is mainly used for deep analysis and error correction of text. In modern society, large-scale text information processing has become a requirement, and many of these text information often have various grammatical and spelling errors. A powerful text error detection technique is needed to improve the quality and accuracy of text information. In essence, text error detection can be seen as a problem of Natural Language Processing (NLP), which uses machine learning and deep learning techniques to predict, detect and correct errors in different application scenarios. Specific technical contexts include, but are not limited to, machine learning (e.g., decision trees, random forests, logistic regression, support vector machines, etc.), deep learning (e.g., neural networks, long-short-term memory networks-LSTM, convolutional neural networks-CNN, variational self-encoders-VAEs, etc.), and various techniques of natural language processing (including, but not limited to, language models, word segmentation techniques, word sense disambiguation, syntactic analysis, etc.).
In recent years, with the flourishing and development of deep learning and machine learning technologies, text error detection technologies have also made significant progress. For example, the Transformer model (e.g., BERT, GPT, etc.) performs well in text error detection. Even more, some models have been able to take into account the context of sentences, making fine-grained detection and correction, such as the pre-trained model BERT. However, despite advances, text error detection still faces challenges such as various types of errors, starvation of high quality annotation data, detection of field-specific errors, and the like. In general, text error detection is still a problem worthy of intensive research in the face of increasingly complex and bulky text information.
Disclosure of Invention
Aiming at the needs and the shortcomings of the prior art development, the invention provides a text intelligent correction method, a system and equipment based on a large language model, which are used for improving the quality of texts, enhancing the communication effect of information, reducing the workload of manual correction and improving the efficiency of text processing.
In a first aspect, the invention provides a text intelligent correction method based on a large language model, which solves the technical problems as follows:
a text intelligent proofreading method based on a large language model comprises the following steps:
acquiring a user input text, and segmenting the user input text;
converting the segmentation result into a vector by using a vectorization model, and searching the conversion result by using a vector database to obtain a related fact text;
Entity extraction is carried out on the text input by the user, so that an entity set is obtained;
Performing two-by-two traversal combination on the entities in the entity set, constructing related prompts based on the text input by the user and the fact text, then calling a large language model to judge entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct multiple groups;
aiming at the multiple groups, calling a large language model to construct a problem to obtain a problem set;
And constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling a large language model to correct errors, and displaying corresponding real information to a user.
Optionally, the NLTK tool is used for segmenting the user input text, so as to sequentially obtain paragraph-level text and sentence-level text, and then normalization processing is carried out on the sentence-level text.
Optionally, using the named entity recognition model to perform entity extraction on the text input by the user to obtain an entity set.
Further optionally, using a named entity recognition model to perform entity extraction on the text input by the user, and using a useless entity library to filter useless entities to obtain an entity set.
In a second aspect, the invention provides a text intelligent checking system based on a large language model, which solves the technical problems as follows:
A large language model based text intelligent collation system comprising:
the text preprocessing module is used for acquiring a text input by a user and carrying out segmentation and standardization processing on the text input by the user;
The text retrieval module is used for converting sentence-level texts into vectors by using the vectorization model, and retrieving a vector database to retrieve related fact texts;
The entity extraction module is used for extracting the entity of the text input by the user to obtain an entity set;
The entity relation construction module is used for carrying out pairwise traversal combination on the entities in the entity set, then constructing related prompts based on the user input text and the fact text, calling a large language model to judge the entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct a multi-tuple;
The problem generation module is used for calling a large language model to construct a problem aiming at the multiple groups to obtain a problem set;
and the error detection and correction module is used for constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling the large language model to correct errors, and displaying corresponding real information to a user.
Optionally, the related text preprocessing module uses NLTK tools to segment the user input text, sequentially obtains paragraph-level text and sentence-level text, and then performs normalization processing on the sentence-level text.
Optionally, the related entity extraction module uses the named entity recognition model to perform entity extraction on the text input by the user to obtain the entity set.
Further optionally, the related entity extraction module uses a named entity recognition model to perform entity extraction on the text input by the user and the fact text, and then uses a useless entity library to filter useless entities to obtain an entity set.
In a third aspect, the present invention provides a computer device, which solves the above technical problems by adopting the following technical scheme:
A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of the first aspect.
The text intelligent correction method, system and equipment based on the large language model have the beneficial effects that compared with the prior art:
(1) The invention can well solve the problem of text quality, does not need to collect a large amount of data manually, and has the advantages of simplicity, high efficiency, simple maintenance, wide application scene and the like;
(2) The invention realizes the efficient and low-cost error detection and correction method, provides convenient and accurate text error detection service for users, has good expandability, can be applied to different fields and scenes, and has the advantages of low training cost, simple maintenance, high accuracy and the like.
Drawings
Fig. 1 is a block diagram of a second embodiment of the present invention.
Detailed Description
In order to make the technical scheme, the technical problems to be solved and the technical effects of the invention more clear, the technical scheme of the invention is clearly and completely described below by combining specific embodiments.
Embodiment one:
referring to fig. 1, the present embodiment proposes a text intelligent proofreading method based on a large language model, which includes the following steps:
S1, acquiring a user input text, segmenting the user input text by using a NLTK tool, sequentially obtaining a paragraph-level text and a sentence-level text, and then carrying out standardization processing on the sentence-level text.
S2, converting the normalized sentence-level text into a vector by using a vectorization model, and searching a conversion result by using a vector database to obtain a related fact text.
S3, extracting entities from the text input by the user by using a named entity recognition model NER, and filtering useless entities by using a useless entity library to obtain an entity set;
S4, performing pairwise traversal combination on the entities in the entity set, constructing related prompts based on the text input by the user and the fact text, then calling a large language model LLM to judge entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct quintuple;
Aiming at five-tuple, calling a large language model LLM to construct a problem to obtain a problem set;
and constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling a large language model LLM to correct errors, and displaying corresponding real information to a user.
Embodiment two:
referring to fig. 1, the present embodiment proposes a text intelligent collation system based on a large language model, which includes:
the text preprocessing module is used for acquiring a user input text, segmenting the user input text by using a NLTK tool, sequentially obtaining a paragraph-level text and a sentence-level text, and then carrying out standardization processing on the sentence-level text;
the text retrieval module is used for converting the normalized sentence-level text into a vector by using the vectorization model, retrieving a vector database and retrieving relevant fact text;
The entity extraction module is used for extracting the entity of the text input by the user by using the named entity recognition model NER, and filtering useless entities by using a useless entity library to obtain an entity set;
The entity relation construction module is used for carrying out pairwise traversal combination on the entities in the entity set, then constructing related prompts based on the user input text and the fact text, calling a large language model LLM to judge the entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct quintuple;
The problem generation module is used for calling a large language model LLM to construct a problem aiming at the five-tuple to obtain a problem set;
the error detection and correction module is used for constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling the large language model LLM to correct errors, and displaying corresponding real information to a user.
In a third aspect, an embodiment of the present disclosure provides a computing device, including a memory having executable code stored therein and a processor, which when executing the executable code, implements a method of performing embodiment one.
It may be appreciated that, for explanation, specific implementation, beneficial effects, examples, etc. of the content in the computing device provided by the embodiment of the present invention, reference may be made to corresponding parts in the method provided in the first aspect, which are not repeated herein.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, a pendant, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.
Claims (9)
1. A text intelligent correction method based on a large language model is characterized by comprising the following steps:
acquiring a user input text, and segmenting the user input text;
converting the segmentation result into a vector by using a vectorization model, and searching the conversion result by using a vector database to obtain a related fact text;
Entity extraction is carried out on the text input by the user, so that an entity set is obtained;
Performing two-by-two traversal combination on the entities in the entity set, constructing related prompts based on the text input by the user and the fact text, then calling a large language model to judge entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct multiple groups;
aiming at the multiple groups, calling a large language model to construct a problem to obtain a problem set;
And constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling a large language model to correct errors, and displaying corresponding real information to a user.
2. The intelligent text proofreading method based on a large language model according to claim 1, wherein a NLTK tool is used for segmenting user input text, paragraph level text and sentence level text are obtained in sequence, and then normalization processing is carried out on the sentence level text.
3. The intelligent text proofreading method based on a large language model according to claim 1, wherein a named entity recognition model is used to perform entity extraction on a text input by a user to obtain an entity set.
4. A method for intelligent proofreading of text based on a large language model according to claim 3, wherein named entity recognition models are used to perform entity extraction on the text input by the user, and useless entities are filtered by using a useless entity library to obtain an entity set.
5. A large language model based text intelligent collation system comprising:
the text preprocessing module is used for acquiring a text input by a user and carrying out segmentation and standardization processing on the text input by the user;
The text retrieval module is used for converting sentence-level texts into vectors by using the vectorization model, and retrieving a vector database to retrieve related fact texts;
The entity extraction module is used for extracting the entity of the text input by the user to obtain an entity set;
The entity relation construction module is used for carrying out pairwise traversal combination on the entities in the entity set, then constructing related prompts based on the user input text and the fact text, calling a large language model to judge the entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct a multi-tuple;
The problem generation module is used for calling a large language model to construct a problem aiming at the multiple groups to obtain a problem set;
and the error detection and correction module is used for constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling the large language model to correct errors, and displaying corresponding real information to a user.
6. The intelligent text collation system based on large language model according to claim 5, wherein the text preprocessing module uses NLTK tool to cut user input text, sequentially obtain paragraph level text, sentence level text, and then normalize sentence level text.
7. The intelligent text collation system based on large language model according to claim 5, wherein the entity extraction module uses named entity recognition model to perform entity extraction on the text input by the user to obtain entity set.
8. The intelligent text collation system based on large language model according to claim 7, wherein the entity extraction module uses named entity recognition model to extract the entities of user input text and fact text, and then uses useless entity library to filter useless entities to obtain entity set.
9. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, performs the method of any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410182006.9A CN118070783A (en) | 2024-02-19 | 2024-02-19 | Text intelligent correction method, system and equipment based on large language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410182006.9A CN118070783A (en) | 2024-02-19 | 2024-02-19 | Text intelligent correction method, system and equipment based on large language model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118070783A true CN118070783A (en) | 2024-05-24 |
Family
ID=91096603
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410182006.9A Pending CN118070783A (en) | 2024-02-19 | 2024-02-19 | Text intelligent correction method, system and equipment based on large language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118070783A (en) |
-
2024
- 2024-02-19 CN CN202410182006.9A patent/CN118070783A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lei et al. | Re-examining the Role of Schema Linking in Text-to-SQL | |
CN110096570B (en) | Intention identification method and device applied to intelligent customer service robot | |
CN111931506B (en) | Entity relationship extraction method based on graph information enhancement | |
CN111325029B (en) | Text similarity calculation method based on deep learning integrated model | |
CN110276071B (en) | Text matching method and device, computer equipment and storage medium | |
CN110765277B (en) | Knowledge-graph-based mobile terminal online equipment fault diagnosis method | |
CN112069826A (en) | Vertical domain entity disambiguation method fusing topic model and convolutional neural network | |
CN110427612A (en) | Based on multilingual entity disambiguation method, device, equipment and storage medium | |
CN115357719A (en) | Power audit text classification method and device based on improved BERT model | |
CN113593661A (en) | Clinical term standardization method, device, electronic equipment and storage medium | |
CN112347339A (en) | Search result processing method and device | |
CN112183102A (en) | Named entity identification method based on attention mechanism and graph attention network | |
CN111831624A (en) | Data table creating method and device, computer equipment and storage medium | |
CN113742446A (en) | Knowledge graph question-answering method and system based on path sorting | |
CN118296120A (en) | Large-scale language model retrieval enhancement generation method for multi-mode multi-scale multi-channel recall | |
Zhang et al. | Refsql: A retrieval-augmentation framework for text-to-sql generation | |
CN112084788A (en) | Automatic marking method and system for implicit emotional tendency of image captions | |
CN106776590A (en) | A kind of method and system for obtaining entry translation | |
CN116680407A (en) | Knowledge graph construction method and device | |
Shen et al. | Evaluating Code Summarization with Improved Correlation with Human Assessment | |
CN116090450A (en) | Text processing method and computing device | |
CN114239555A (en) | Training method of keyword extraction model and related device | |
CN118070783A (en) | Text intelligent correction method, system and equipment based on large language model | |
CN114722821A (en) | Text matching method and device, storage medium and electronic equipment | |
CN114780700A (en) | Intelligent question-answering method, device, equipment and medium based on machine reading understanding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |