CN118070783A

CN118070783A - A text intelligent proofreading method, system and device based on large language model

Info

Publication number: CN118070783A
Application number: CN202410182006.9A
Authority: CN
Inventors: 黄登蓉; 郭冬升; 张其来; 张思嘉
Original assignee: Shandong Inspur Science Research Institute Co Ltd
Current assignee: Shandong Inspur Science Research Institute Co Ltd
Priority date: 2024-02-19
Filing date: 2024-02-19
Publication date: 2024-05-24

Abstract

The present invention discloses a text intelligent proofreading method, system and device based on a large language model, and relates to the technical field of text detection. The method comprises: obtaining input text, segmenting the text; converting the segmentation result into a vector using a vectorization model, searching the conversion result using a vector database, and obtaining relevant factual text; extracting entities from the input text; traversing and combining entities in pairs, and constructing relevant prompts based on the input text and the factual text, then calling the large language model to judge the entity relationship, constructing triples, and then traversing and combining the triples in turn according to the rule matching method to construct multi-tuples; for the multi-tuples, calling the large language model to construct questions, and obtaining a question set; locating wrong entities and sentences based on the factual text and the question set, and calling the large language model to correct errors, and displaying the corresponding real information to the user. The present invention can solve the text quality problem well.

Description

A text intelligent proofreading method, system and device based on large language model

技术领域Technical Field

本发明涉及文本检测技术领域，具体的说是一种基于大语言模型的文本智能校对方法、系统和设备。The present invention relates to the field of text detection technology, and in particular to a method, system and device for intelligent text proofreading based on a large language model.

背景技术Background technique

随着互联网技术的发展和社会信息化程度的提高，文本数据量呈现爆炸性增长。文本数据以其丰富的信息内容和广泛的应用领域，已经成为人们获取知识、表达观点和传递信息的重要形式。然而，由于种种原因，如个人的语言水平、认知能力、输入设备的限制等，使得人们在撰写和编辑文本时可能会产生各种错误，如字诈、词误、语病等。这些错误不仅影响了文本的质量，也阻碍了人们对文本信息的准确理解和有效利用。面对大量存在错误的文本，传统的人工校对方法无法满足日益增长的校对需求，而且工作量巨大，耗时、耗力，效率低下。With the development of Internet technology and the improvement of social informatization, the amount of text data has shown explosive growth. With its rich information content and wide application fields, text data has become an important form for people to acquire knowledge, express opinions and transmit information. However, due to various reasons, such as personal language level, cognitive ability, and limitations of input devices, people may make various errors when writing and editing texts, such as word fraud, word errors, and grammatical errors. These errors not only affect the quality of the text, but also hinder people's accurate understanding and effective use of text information. Faced with a large number of erroneous texts, traditional manual proofreading methods cannot meet the growing proofreading needs, and the workload is huge, time-consuming, labor-intensive, and inefficient.

因此，研究并设计一种能够自动检测和纠正文本错误的计算机系统具有极其重要的理论意义和实际价值。随着人工智能和深度学习的发展，文本错误检测是一种以机器学习为核心的技术，主要用于对文本进行深度解析和纠错。在现代社会，大规模的文本信息处理已成为一种需求，而很多这些文本信息往往存在着各种语法和拼写的错误。这就需要一种强大的文本错误检测技术来提升文本信息的质量和准确性。本质上，文本错误检测可以被视为一种自然语言处理(NLP)的问题，它利用机器学习和深度学习技术在不同的应用场景中进行预测、检测和纠正错误。具体的技术背景包括但不限于机器学习(如决策树、随机森林、逻辑回归、支持向量机等)，深度学习(如神经网络、长短期记忆网络-LSTM、卷积神经网络-CNN、变分自编码器-VAE等)，以及自然语言处理的各种技术(包括但不限于语言模型、分词技术、词义消歧、句法分析等)。Therefore, it is of great theoretical significance and practical value to study and design a computer system that can automatically detect and correct text errors. With the development of artificial intelligence and deep learning, text error detection is a technology based on machine learning, which is mainly used for deep analysis and error correction of text. In modern society, large-scale text information processing has become a demand, and many of these text information often have various grammatical and spelling errors. This requires a powerful text error detection technology to improve the quality and accuracy of text information. In essence, text error detection can be regarded as a natural language processing (NLP) problem, which uses machine learning and deep learning techniques to predict, detect and correct errors in different application scenarios. The specific technical background includes but is not limited to machine learning (such as decision trees, random forests, logistic regression, support vector machines, etc.), deep learning (such as neural networks, long short-term memory networks-LSTM, convolutional neural networks-CNN, variational autoencoders-VAE, etc.), and various technologies of natural language processing (including but not limited to language models, word segmentation technology, word sense disambiguation, syntactic analysis, etc.).

近年来，随着深度学习和机器学习技术的繁荣与发展，文本错误检测技术也取得了显著的进步。比如，Transformer模型(如BERT，GPT等)在文本错误检测方面表现出色。更甚至，一些模型已经能够考虑句子的语境，进行细粒度的检测和修正，比如预训练模型BERT。然而，尽管有所进步，文本错误检测依然面临着诸如多样性的错误类型、高质量标注数据的匮乏、领域专业性错误的检测等挑战。总的来说，面对日益复杂和庞大的文本信息，文本错误检测仍是一个值得深入研究的问题。In recent years, with the prosperity and development of deep learning and machine learning technologies, text error detection technology has also made significant progress. For example, Transformer models (such as BERT, GPT, etc.) perform well in text error detection. What's more, some models are already able to consider the context of sentences and perform fine-grained detection and correction, such as the pre-trained model BERT. However, despite the progress, text error detection still faces challenges such as diverse error types, lack of high-quality annotated data, and detection of domain-specific errors. In general, in the face of increasingly complex and massive text information, text error detection is still an issue worthy of in-depth study.

发明内容Summary of the invention

本发明针对目前技术发展的需求和不足之处，提供一种基于大语言模型的文本智能校对方法、系统和设备，来提高文本的质量，增强信息的交流效果，同时减轻人工校对的工作量，提高文本处理的效率。In view of the needs and shortcomings of current technological development, the present invention provides a text intelligent proofreading method, system and device based on a large language model to improve the quality of text and enhance the communication effect of information, while reducing the workload of manual proofreading and improving the efficiency of text processing.

第一方面，本发明的一种基于大语言模型的文本智能校对方法，解决上述技术问题采用的技术方案如下：In the first aspect, the present invention provides a method for intelligent text proofreading based on a large language model, and the technical solution adopted to solve the above technical problems is as follows:

一种基于大语言模型的文本智能校对方法，其包括如下步骤：A text intelligent proofreading method based on a large language model comprises the following steps:

获取用户输入文本，对用户输入文本进行切分；Get the user input text and segment the user input text;

利用向量化模型将切分结果转化为向量，利用向量数据库对转化结果进行检索，得到相关事实文本；The segmentation results are converted into vectors using a vectorization model, and the converted results are searched using a vector database to obtain relevant factual texts;

对用户输入文本进行实体抽取，得到实体集合；Perform entity extraction on the user input text to obtain an entity set;

对实体集合中的实体进行两两遍历组合，并基于用户输入文本以及事实文本构建相关提示，随后调用大语言模型判断实体关系，构建三元组，再后对三元组按照规则匹配方式依次遍历组合，构建多元组；The entities in the entity set are traversed and combined in pairs, and relevant prompts are constructed based on the user input text and the factual text. The large language model is then called to determine the entity relationship and construct triples. The triples are then traversed and combined in sequence according to the rule matching method to construct multi-tuples.

针对多元组，调用大语言模型构建问题，得到问题集合；For the multi-tuple, call the large language model to construct questions and obtain a set of questions;

基于事实文本及问题集合构建提示，利用提示进行错误实体以及句子定位，并调用大语言模型进行错误纠正，将对应的真实信息展示给用户。Build prompts based on factual text and question sets, use prompts to locate incorrect entities and sentences, call a large language model to correct errors, and display the corresponding real information to users.

可选的，使用NLTK工具切分用户输入文本，依次得到段落级文本、句子级文本，随后对句子级文本进行规范化处理。Optionally, use NLTK tool to segment the user input text, obtain paragraph-level text and sentence-level text in turn, and then normalize the sentence-level text.

可选的，使用命名实体识别模型对用户输入文本进行实体抽取，得到实体集合。Optionally, a named entity recognition model is used to extract entities from the user input text to obtain an entity set.

进一步可选的，使用命名实体识别模型对用户输入文本进行实体抽取，使用无用实体库过滤无用实体，得到实体集合。Further optionally, a named entity recognition model is used to extract entities from the user input text, and a useless entity library is used to filter useless entities to obtain an entity set.

第二方面，本发明的一种基于大语言模型的文本智能校对系统，解决上述技术问题采用的技术方案如下：In the second aspect, the present invention is a text intelligent proofreading system based on a large language model, and the technical solution adopted to solve the above technical problems is as follows:

一种基于大语言模型的文本智能校对系统，其包括：A text intelligent proofreading system based on a large language model, comprising:

文本预处理模块，用于获取用户输入文本，并对用户输入文本进行切分和规范化处理；The text preprocessing module is used to obtain user input text and segment and normalize the user input text;

文本检索模块，用于使用向量化模型将句子级文本转化为向量，并对向量数据库进行检索，检索到相关事实文本；The text retrieval module is used to convert sentence-level text into vectors using a vectorization model, and search the vector database to retrieve relevant factual text;

实体抽取模块，用于对用户输入文本进行实体抽取，得到实体集合；The entity extraction module is used to extract entities from the user input text to obtain an entity set;

实体关系构建模块，用于对实体集合中的实体进行两两遍历组合，随后基于用户输入文本以及事实文本构建相关提示，并调用大语言模型判断实体关系，构建三元组，再后对三元组按照规则匹配方式依次遍历组合，构建多元组；The entity relationship construction module is used to traverse and combine entities in the entity set in pairs, then build relevant prompts based on the user input text and factual text, and call the large language model to determine the entity relationship and build triples. Then, the triples are traversed and combined in turn according to the rule matching method to build multi-tuples;

问题生成模块，用于针对多元组，调用大语言模型构建问题，得到问题集合；The question generation module is used to construct questions for multiple groups by calling the large language model to obtain a question set;

错误检测与纠正模块，用于基于事实文本及问题集合构建提示，利用提示进行错误实体以及句子定位，并调用大语言模型进行错误纠正，将对应的真实信息展示给用户。The error detection and correction module is used to build prompts based on factual text and question sets, use prompts to locate erroneous entities and sentences, call the large language model to correct errors, and display the corresponding real information to users.

可选的，所涉及文本预处理模块使用NLTK工具切分用户输入文本，依次得到段落级文本、句子级文本，随后对句子级文本进行规范化处理。Optionally, the text preprocessing module involved uses the NLTK tool to segment the user input text, obtain paragraph-level text and sentence-level text in turn, and then normalizes the sentence-level text.

可选的，所涉及实体抽取模块使用命名实体识别模型对用户输入文本进行实体抽取，得到实体集合。Optionally, the entity extraction module involved uses a named entity recognition model to perform entity extraction on the user input text to obtain an entity set.

进一步可选的，所涉及实体抽取模块使用命名实体识别模型对用户输入文本以及事实文本进行实体抽取后，使用无用实体库过滤无用实体，得到实体集合。Further optionally, the entity extraction module involved uses a named entity recognition model to perform entity extraction on the user input text and the factual text, and then uses a useless entity library to filter useless entities to obtain an entity set.

第三方面，本发明的一种计算机设备，解决上述技术问题采用的技术方案如下：In a third aspect, a computer device of the present invention adopts the following technical solution to solve the above technical problem:

一种计算设备，其包括存储器和处理器，所述存储器中存储有可执行代码，所述处理器执行所述可执行代码时，实现第一方面所述的方法。A computing device comprises a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the method described in the first aspect is implemented.

本发明的一种基于大语言模型的文本智能校对方法、系统和设备，与现有技术相比具有的有益效果是：The text intelligent proofreading method, system and device based on a large language model of the present invention have the following beneficial effects compared with the prior art:

(1)本发明可以很好的解决文本质量问题，并且无需人工收集大量数据，具有简单高效、维护简单、应用场景广泛等优点；(1) The present invention can solve the text quality problem well, and does not require manual collection of a large amount of data. It has the advantages of being simple and efficient, easy to maintain, and having a wide range of application scenarios;

(2)本发明实现了高效且低成本的错误检测与纠正方法，为用户提供了便捷、精准的文本错误检测服务，同时，还具备良好的可扩展性，能够应用于不同领域和场景，具有训练成本低、维护简单、准确率高等优点。(2) The present invention realizes an efficient and low-cost error detection and correction method, providing users with a convenient and accurate text error detection service. At the same time, it also has good scalability and can be applied to different fields and scenarios. It has the advantages of low training cost, simple maintenance, and high accuracy.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图1是本发明实施例二的模块连接图。FIG1 is a module connection diagram of the second embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的技术方案、解决的技术问题和技术效果更加清楚明白，以下结合具体实施例，对本发明的技术方案进行清楚、完整的描述。In order to make the technical solution, the technical problem solved and the technical effect of the present invention more clearly understood, the technical solution of the present invention is clearly and completely described below in conjunction with specific embodiments.

实施例一：Embodiment 1:

参考附图1，本实施例提出一种基于大语言模型的文本智能校对方法，其包括如下步骤：Referring to FIG. 1 , this embodiment proposes a text intelligent proofreading method based on a large language model, which includes the following steps:

S1、获取用户输入文本，使用NLTK工具切分用户输入文本，依次得到段落级文本、句子级文本，随后对句子级文本进行规范化处理。S1. Obtain user input text, use NLTK tool to segment the user input text, obtain paragraph-level text and sentence-level text in turn, and then normalize the sentence-level text.

S2、利用向量化模型将规范化处理后的句子级文本转化为向量，利用向量数据库对转化结果进行检索，得到相关事实文本。S2. Use the vectorization model to convert the normalized sentence-level text into a vector, and use the vector database to search the conversion results to obtain relevant factual text.

S3、使用命名实体识别模型NER对用户输入文本进行实体抽取，使用无用实体库过滤无用实体，得到实体集合；S3, use the named entity recognition model NER to extract entities from the user input text, use the useless entity library to filter out useless entities, and obtain an entity set;

S4、对实体集合中的实体进行两两遍历组合，并基于用户输入文本以及事实文本构建相关提示，随后调用大语言模型LLM判断实体关系，构建三元组，再后对三元组按照规则匹配方式依次遍历组合，构建五元组；S4, traverse and combine entities in the entity set in pairs, and build relevant prompts based on the user input text and the factual text, then call the large language model LLM to determine the entity relationship and build triples, and then traverse and combine the triples in turn according to the rule matching method to build five-tuples;

针对五元组，调用大语言模型LLM构建问题，得到问题集合；For the five-tuple, call the large language model LLM to construct questions and obtain a question set;

基于事实文本及问题集合构建提示，利用提示进行错误实体以及句子定位，并调用大语言模型LLM进行错误纠正，将对应的真实信息展示给用户。Prompts are built based on factual text and question sets, and incorrect entities and sentences are located using the prompts. The large language model (LLM) is called to correct errors and the corresponding real information is displayed to users.

实施例二：Embodiment 2:

参考附图1，本实施例提出一种基于大语言模型的文本智能校对系统，其包括：Referring to FIG. 1 , this embodiment proposes a text intelligent proofreading system based on a large language model, which includes:

文本预处理模块，用于获取用户输入文本，使用NLTK工具切分用户输入文本，依次得到段落级文本、句子级文本，随后对句子级文本进行规范化处理；The text preprocessing module is used to obtain user input text, segment the user input text using the NLTK tool, obtain paragraph-level text and sentence-level text in turn, and then normalize the sentence-level text;

文本检索模块，用于使用向量化模型将规范化处理后的句子级文本转化为向量，并对向量数据库进行检索，检索到相关事实文本；The text retrieval module is used to convert the normalized sentence-level text into vectors using a vectorization model, and search the vector database to retrieve relevant factual texts;

实体抽取模块，用于使用命名实体识别模型NER对用户输入文本进行实体抽取，使用无用实体库过滤无用实体，得到实体集合；The entity extraction module is used to extract entities from user input text using the named entity recognition model NER, and to filter useless entities using the useless entity library to obtain an entity set;

实体关系构建模块，用于对实体集合中的实体进行两两遍历组合，随后基于用户输入文本以及事实文本构建相关提示，并调用大语言模型LLM判断实体关系，构建三元组，再后对三元组按照规则匹配方式依次遍历组合，构建五元组；The entity relationship construction module is used to traverse and combine entities in the entity set in pairs, then build relevant prompts based on the user input text and factual text, and call the large language model LLM to determine the entity relationship and build triples. Then, the triples are traversed and combined in sequence according to the rule matching method to build five-tuples;

问题生成模块，用于针对五元组，调用大语言模型LLM构建问题，得到问题集合；The question generation module is used to construct questions for the five-tuple by calling the large language model LLM to obtain a question set;

错误检测与纠正模块，用于基于事实文本及问题集合构建提示，利用提示进行错误实体以及句子定位，并调用大语言模型LLM进行错误纠正，将对应的真实信息展示给用户。The error detection and correction module is used to build prompts based on factual text and question sets, use prompts to locate erroneous entities and sentences, call the large language model LLM to correct errors, and display the corresponding real information to users.

第三方面，本说明书一个实施例提供了一种计算设备，包括存储器和处理器，所述存储器中存储有可执行代码，所述处理器执行所述可执行代码时，实现执行实施例一中的方法。In a third aspect, an embodiment of the present specification provides a computing device, including a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the method in the first embodiment is implemented.

可理解的是，本发明实施例提供的计算设备中有关内容的解释、具体实施方式、有益效果、举例等内容可以参见第一方面提供的方法中的相应部分，此处不再赘述。It is understandable that the explanation, specific implementation, beneficial effects, examples, etc. of the relevant contents in the computing device provided in the embodiment of the present invention can be found in the corresponding parts of the method provided in the first aspect, and will not be repeated here.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于装置实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a progressive manner, and the same or similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

本领域技术人员应该可以意识到，在上述一个或多个示例中，本发明所描述的功能可以用硬件、软件、挂件或它们的任意组合来实现。当使用软件实现时，可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。Those skilled in the art should be aware that in one or more of the above examples, the functions described in the present invention can be implemented by hardware, software, widgets, or any combination thereof. When implemented by software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on a computer-readable medium.

以上所述的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的技术方案的基础之上，所做的任何修改、等同替换、改进等，均应包括在本发明的保护范围之内。The specific implementation methods described above further illustrate the objectives, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above description is only a specific implementation method of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc. made on the basis of the technical solution of the present invention should be included in the scope of protection of the present invention.

Claims

1. A text intelligent correction method based on a large language model is characterized by comprising the following steps:

acquiring a user input text, and segmenting the user input text;

converting the segmentation result into a vector by using a vectorization model, and searching the conversion result by using a vector database to obtain a related fact text;

Entity extraction is carried out on the text input by the user, so that an entity set is obtained;

Performing two-by-two traversal combination on the entities in the entity set, constructing related prompts based on the text input by the user and the fact text, then calling a large language model to judge entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct multiple groups;

aiming at the multiple groups, calling a large language model to construct a problem to obtain a problem set;

And constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling a large language model to correct errors, and displaying corresponding real information to a user.

2. The intelligent text proofreading method based on a large language model according to claim 1, wherein a NLTK tool is used for segmenting user input text, paragraph level text and sentence level text are obtained in sequence, and then normalization processing is carried out on the sentence level text.

3. The intelligent text proofreading method based on a large language model according to claim 1, wherein a named entity recognition model is used to perform entity extraction on a text input by a user to obtain an entity set.

4. A method for intelligent proofreading of text based on a large language model according to claim 3, wherein named entity recognition models are used to perform entity extraction on the text input by the user, and useless entities are filtered by using a useless entity library to obtain an entity set.

5. A large language model based text intelligent collation system comprising:

the text preprocessing module is used for acquiring a text input by a user and carrying out segmentation and standardization processing on the text input by the user;

The text retrieval module is used for converting sentence-level texts into vectors by using the vectorization model, and retrieving a vector database to retrieve related fact texts;

The entity extraction module is used for extracting the entity of the text input by the user to obtain an entity set;

The entity relation construction module is used for carrying out pairwise traversal combination on the entities in the entity set, then constructing related prompts based on the user input text and the fact text, calling a large language model to judge the entity relation, constructing triples, and then sequentially traversing the triples according to a rule matching mode to construct a multi-tuple;

The problem generation module is used for calling a large language model to construct a problem aiming at the multiple groups to obtain a problem set;

and the error detection and correction module is used for constructing prompts based on the fact text and the problem set, positioning error entities and sentences by utilizing the prompts, calling the large language model to correct errors, and displaying corresponding real information to a user.

6. The intelligent text collation system based on large language model according to claim 5, wherein the text preprocessing module uses NLTK tool to cut user input text, sequentially obtain paragraph level text, sentence level text, and then normalize sentence level text.

7. The intelligent text collation system based on large language model according to claim 5, wherein the entity extraction module uses named entity recognition model to perform entity extraction on the text input by the user to obtain entity set.

8. The intelligent text collation system based on large language model according to claim 7, wherein the entity extraction module uses named entity recognition model to extract the entities of user input text and fact text, and then uses useless entity library to filter useless entities to obtain entity set.

9. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, performs the method of any of claims 1-4.