CN118070783A - Text intelligent correction method, system and equipment based on large language model - Google Patents

Text intelligent correction method, system and equipment based on large language model Download PDF

Info

Publication number
CN118070783A
CN118070783A CN202410182006.9A CN202410182006A CN118070783A CN 118070783 A CN118070783 A CN 118070783A CN 202410182006 A CN202410182006 A CN 202410182006A CN 118070783 A CN118070783 A CN 118070783A
Authority
CN
China
Prior art keywords
text
entity
language model
user
large language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410182006.9A
Other languages
Chinese (zh)
Inventor
黄登蓉
郭冬升
张其来
张思嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Science Research Institute Co Ltd
Original Assignee
Shandong Inspur Science Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Science Research Institute Co Ltd filed Critical Shandong Inspur Science Research Institute Co Ltd
Priority to CN202410182006.9A priority Critical patent/CN118070783A/en
Publication of CN118070783A publication Critical patent/CN118070783A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text intelligent proofreading method, a system and equipment based on a large language model, which relate to the technical field of text detection, and the method comprises the following steps: acquiring an input text, and segmenting the text; converting the segmentation result into a vector by using a vectorization model, and searching the conversion result by using a vector database to obtain a related fact text; extracting the entity of the input text; performing two-by-two traversal combination on the entity, constructing related prompts based on the input text and the fact text, then calling a large language model to judge the entity relationship, constructing a triplet, and then sequentially traversing the triplet according to a rule matching mode to construct a multi-element group; aiming at the multiple groups, calling a large language model to construct a problem to obtain a problem set; and positioning error entities and sentences based on the fact text and the problem set, calling a large language model to correct errors, and displaying corresponding real information to a user. The invention can well solve the problem of text quality.

Description

Text intelligent correction method, system and equipment based on large language model
Technical Field
The invention relates to the technical field of text detection, in particular to a text intelligent proofreading method, system and equipment based on a large language model.
Background
With the development of internet technology and the improvement of the degree of social informatization, the amount of text data has been explosively increased. Text data has become an important form of people to acquire knowledge, express views and transfer information in terms of its abundant information content and wide application fields. However, various errors such as word fraud, word errors, unhappy choice of words, etc. may occur when people write and edit text for various reasons such as individual's language level, cognitive ability, limitations of the input device, etc. These errors not only affect the quality of the text, but also prevent people from accurately understanding and effectively utilizing the text information. In the face of a large number of erroneous texts, the traditional manual proofreading method cannot meet the ever-increasing proofreading requirements, and is huge in workload, time-consuming, labor-consuming and low in efficiency.
Therefore, research and design of a computer system capable of automatically detecting and correcting text errors has extremely important theoretical significance and practical value. With the development of artificial intelligence and deep learning, text error detection is a technology with machine learning as a core, and is mainly used for deep analysis and error correction of text. In modern society, large-scale text information processing has become a requirement, and many of these text information often have various grammatical and spelling errors. A powerful text error detection technique is needed to improve the quality and accuracy of text information. In essence, text error detection can be seen as a problem of Natural Language Processing (NLP), which uses machine learning and deep learning techniques to predict, detect and correct errors in different application scenarios. Specific technical contexts include, but are not limited to, machine learning (e.g., decision trees, random forests, logistic regression, support vector machines, etc.), deep learning (e.g., neural networks, long-short-term memory networks-LSTM, convolutional neural networks-CNN, variational self-encoders-VAEs, etc.), and various techniques of natural language processing (including, but not limited to, language models, word segmentation techniques, word sense disambiguation, syntactic analysis, etc.).
In recent years, with the flourishing and development of deep learning and machine learning technologies, text error detection technologies have also made significant progress. For example, the Transformer model (e.g., BERT, GPT, etc.) performs well in text error detection. Even more, some models have been able to take into account the context of sentences, making fine-grained detection and correction, such as the pre-trained model BERT. However, despite advances, text error detection still faces challenges such as various types of errors, starvation of high quality annotation data, detection of field-specific errors, and the like. In general, text error detection is still a problem worthy of intensive research in the face of increasingly complex and bulky text information.
Disclosure of Invention
Aiming at the needs and the shortcomings of the prior art development, the invention provides a text intelligent correction method, a system and equipment based on a large language model, which are used for improving the quality of texts, enhancing the communication effect of information, reducing the workload of manual correction and improving the efficiency of text processing.
In a first aspect, the invention provides a text intelligent correction method based on a large language model, which solves the technical problems as follows:
a text intelligent proofreading method based on a large language model comprises the following steps:
acquiring a user input text, and segmenting the user input text;
converting the segmentation result into a vector by using a vectorization model, and searching the conversion result by using a vector database to obtain a related fact text;
Entity extraction is carried out on the text input by the user, so that an entity set is obtained;
Performing two-by-two traversal combination on the entities in the entity set, constructing related prompts based on the text input by the user and the fact text, then calling a large language model to judge entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct multiple groups;
aiming at the multiple groups, calling a large language model to construct a problem to obtain a problem set;
And constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling a large language model to correct errors, and displaying corresponding real information to a user.
Optionally, the NLTK tool is used for segmenting the user input text, so as to sequentially obtain paragraph-level text and sentence-level text, and then normalization processing is carried out on the sentence-level text.
Optionally, using the named entity recognition model to perform entity extraction on the text input by the user to obtain an entity set.
Further optionally, using a named entity recognition model to perform entity extraction on the text input by the user, and using a useless entity library to filter useless entities to obtain an entity set.
In a second aspect, the invention provides a text intelligent checking system based on a large language model, which solves the technical problems as follows:
A large language model based text intelligent collation system comprising:
the text preprocessing module is used for acquiring a text input by a user and carrying out segmentation and standardization processing on the text input by the user;
The text retrieval module is used for converting sentence-level texts into vectors by using the vectorization model, and retrieving a vector database to retrieve related fact texts;
The entity extraction module is used for extracting the entity of the text input by the user to obtain an entity set;
The entity relation construction module is used for carrying out pairwise traversal combination on the entities in the entity set, then constructing related prompts based on the user input text and the fact text, calling a large language model to judge the entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct a multi-tuple;
The problem generation module is used for calling a large language model to construct a problem aiming at the multiple groups to obtain a problem set;
and the error detection and correction module is used for constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling the large language model to correct errors, and displaying corresponding real information to a user.
Optionally, the related text preprocessing module uses NLTK tools to segment the user input text, sequentially obtains paragraph-level text and sentence-level text, and then performs normalization processing on the sentence-level text.
Optionally, the related entity extraction module uses the named entity recognition model to perform entity extraction on the text input by the user to obtain the entity set.
Further optionally, the related entity extraction module uses a named entity recognition model to perform entity extraction on the text input by the user and the fact text, and then uses a useless entity library to filter useless entities to obtain an entity set.
In a third aspect, the present invention provides a computer device, which solves the above technical problems by adopting the following technical scheme:
A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of the first aspect.
The text intelligent correction method, system and equipment based on the large language model have the beneficial effects that compared with the prior art:
(1) The invention can well solve the problem of text quality, does not need to collect a large amount of data manually, and has the advantages of simplicity, high efficiency, simple maintenance, wide application scene and the like;
(2) The invention realizes the efficient and low-cost error detection and correction method, provides convenient and accurate text error detection service for users, has good expandability, can be applied to different fields and scenes, and has the advantages of low training cost, simple maintenance, high accuracy and the like.
Drawings
Fig. 1 is a block diagram of a second embodiment of the present invention.
Detailed Description
In order to make the technical scheme, the technical problems to be solved and the technical effects of the invention more clear, the technical scheme of the invention is clearly and completely described below by combining specific embodiments.
Embodiment one:
referring to fig. 1, the present embodiment proposes a text intelligent proofreading method based on a large language model, which includes the following steps:
S1, acquiring a user input text, segmenting the user input text by using a NLTK tool, sequentially obtaining a paragraph-level text and a sentence-level text, and then carrying out standardization processing on the sentence-level text.
S2, converting the normalized sentence-level text into a vector by using a vectorization model, and searching a conversion result by using a vector database to obtain a related fact text.
S3, extracting entities from the text input by the user by using a named entity recognition model NER, and filtering useless entities by using a useless entity library to obtain an entity set;
S4, performing pairwise traversal combination on the entities in the entity set, constructing related prompts based on the text input by the user and the fact text, then calling a large language model LLM to judge entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct quintuple;
Aiming at five-tuple, calling a large language model LLM to construct a problem to obtain a problem set;
and constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling a large language model LLM to correct errors, and displaying corresponding real information to a user.
Embodiment two:
referring to fig. 1, the present embodiment proposes a text intelligent collation system based on a large language model, which includes:
the text preprocessing module is used for acquiring a user input text, segmenting the user input text by using a NLTK tool, sequentially obtaining a paragraph-level text and a sentence-level text, and then carrying out standardization processing on the sentence-level text;
the text retrieval module is used for converting the normalized sentence-level text into a vector by using the vectorization model, retrieving a vector database and retrieving relevant fact text;
The entity extraction module is used for extracting the entity of the text input by the user by using the named entity recognition model NER, and filtering useless entities by using a useless entity library to obtain an entity set;
The entity relation construction module is used for carrying out pairwise traversal combination on the entities in the entity set, then constructing related prompts based on the user input text and the fact text, calling a large language model LLM to judge the entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct quintuple;
The problem generation module is used for calling a large language model LLM to construct a problem aiming at the five-tuple to obtain a problem set;
the error detection and correction module is used for constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling the large language model LLM to correct errors, and displaying corresponding real information to a user.
In a third aspect, an embodiment of the present disclosure provides a computing device, including a memory having executable code stored therein and a processor, which when executing the executable code, implements a method of performing embodiment one.
It may be appreciated that, for explanation, specific implementation, beneficial effects, examples, etc. of the content in the computing device provided by the embodiment of the present invention, reference may be made to corresponding parts in the method provided in the first aspect, which are not repeated herein.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, a pendant, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims (9)

1. A text intelligent correction method based on a large language model is characterized by comprising the following steps:
acquiring a user input text, and segmenting the user input text;
converting the segmentation result into a vector by using a vectorization model, and searching the conversion result by using a vector database to obtain a related fact text;
Entity extraction is carried out on the text input by the user, so that an entity set is obtained;
Performing two-by-two traversal combination on the entities in the entity set, constructing related prompts based on the text input by the user and the fact text, then calling a large language model to judge entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct multiple groups;
aiming at the multiple groups, calling a large language model to construct a problem to obtain a problem set;
And constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling a large language model to correct errors, and displaying corresponding real information to a user.
2. The intelligent text proofreading method based on a large language model according to claim 1, wherein a NLTK tool is used for segmenting user input text, paragraph level text and sentence level text are obtained in sequence, and then normalization processing is carried out on the sentence level text.
3. The intelligent text proofreading method based on a large language model according to claim 1, wherein a named entity recognition model is used to perform entity extraction on a text input by a user to obtain an entity set.
4. A method for intelligent proofreading of text based on a large language model according to claim 3, wherein named entity recognition models are used to perform entity extraction on the text input by the user, and useless entities are filtered by using a useless entity library to obtain an entity set.
5. A large language model based text intelligent collation system comprising:
the text preprocessing module is used for acquiring a text input by a user and carrying out segmentation and standardization processing on the text input by the user;
The text retrieval module is used for converting sentence-level texts into vectors by using the vectorization model, and retrieving a vector database to retrieve related fact texts;
The entity extraction module is used for extracting the entity of the text input by the user to obtain an entity set;
The entity relation construction module is used for carrying out pairwise traversal combination on the entities in the entity set, then constructing related prompts based on the user input text and the fact text, calling a large language model to judge the entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct a multi-tuple;
The problem generation module is used for calling a large language model to construct a problem aiming at the multiple groups to obtain a problem set;
and the error detection and correction module is used for constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling the large language model to correct errors, and displaying corresponding real information to a user.
6. The intelligent text collation system based on large language model according to claim 5, wherein the text preprocessing module uses NLTK tool to cut user input text, sequentially obtain paragraph level text, sentence level text, and then normalize sentence level text.
7. The intelligent text collation system based on large language model according to claim 5, wherein the entity extraction module uses named entity recognition model to perform entity extraction on the text input by the user to obtain entity set.
8. The intelligent text collation system based on large language model according to claim 7, wherein the entity extraction module uses named entity recognition model to extract the entities of user input text and fact text, and then uses useless entity library to filter useless entities to obtain entity set.
9. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, performs the method of any of claims 1-4.
CN202410182006.9A 2024-02-19 2024-02-19 Text intelligent correction method, system and equipment based on large language model Pending CN118070783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410182006.9A CN118070783A (en) 2024-02-19 2024-02-19 Text intelligent correction method, system and equipment based on large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410182006.9A CN118070783A (en) 2024-02-19 2024-02-19 Text intelligent correction method, system and equipment based on large language model

Publications (1)

Publication Number Publication Date
CN118070783A true CN118070783A (en) 2024-05-24

Family

ID=91096603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410182006.9A Pending CN118070783A (en) 2024-02-19 2024-02-19 Text intelligent correction method, system and equipment based on large language model

Country Status (1)

Country Link
CN (1) CN118070783A (en)

Similar Documents

Publication Publication Date Title
Lei et al. Re-examining the Role of Schema Linking in Text-to-SQL
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN110276071B (en) Text matching method and device, computer equipment and storage medium
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN112069826A (en) Vertical domain entity disambiguation method fusing topic model and convolutional neural network
CN110427612A (en) Based on multilingual entity disambiguation method, device, equipment and storage medium
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN113593661A (en) Clinical term standardization method, device, electronic equipment and storage medium
CN112347339A (en) Search result processing method and device
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN113742446A (en) Knowledge graph question-answering method and system based on path sorting
CN118296120A (en) Large-scale language model retrieval enhancement generation method for multi-mode multi-scale multi-channel recall
Zhang et al. Refsql: A retrieval-augmentation framework for text-to-sql generation
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN106776590A (en) A kind of method and system for obtaining entry translation
CN116680407A (en) Knowledge graph construction method and device
Shen et al. Evaluating Code Summarization with Improved Correlation with Human Assessment
CN116090450A (en) Text processing method and computing device
CN114239555A (en) Training method of keyword extraction model and related device
CN118070783A (en) Text intelligent correction method, system and equipment based on large language model
CN114722821A (en) Text matching method and device, storage medium and electronic equipment
CN114780700A (en) Intelligent question-answering method, device, equipment and medium based on machine reading understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination