CN116821285A

CN116821285A - Text processing method, device, equipment and medium based on artificial intelligence

Info

Publication number: CN116821285A
Application number: CN202310846702.0A
Authority: CN
Inventors: 窦剑文; 周建峰; 季然; 朱运周; 谭启明; 黄杰; 刘卿; 李雄; 王晨; 朱华振; 刘锋; 刘博超; 何博; 汪雅璇
Original assignee: Heimer Pandora Data Technology Shenzhen Co ltd
Current assignee: Heimer Pandora Data Technology Shenzhen Co ltd
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-09-29

Abstract

The invention relates to a text processing method, device, equipment and medium based on artificial intelligence, which acquire first text information input by a user through a preset MyGPTmate model; performing vectorization processing on the first text information by using a GPTmate engine on the MyGPTmate model, so as to perform word vector decomposition on the first text information based on the word embedding sub-model of the OPEN-AI by using the GPTmate engine, and generating a plurality of word vectors matched with the first text information; cosine similarity calculation is carried out on the word vectors so as to determine the word vectors with similarity values meeting a preset threshold value from the word vectors and generate second text information; based on the second text information, executing a corresponding user service process, wherein the user service process comprises question answering, data analysis, draft drawing, file derivation and retrieval, so that the user trains real-time data at extremely low cost and specializes in a customized model of the user, thereby providing more personalized services for daily work/study and the like of the user.

Description

Text processing method, device, equipment and medium based on artificial intelligence

Technical Field

The present invention relates to the field of text data models, and in particular, to a text processing method, apparatus, device, and medium based on artificial intelligence.

Background

My GPTrate is a software system that can run on a variety of operating systems, including Windows, linux, macOS, etc. Specifically, on different operating systems, my GPTmate needs to use a corresponding version of Python interpreter and install the relevant dependency libraries and software packages, such as NLP, pyTorch, transformers, etc. In addition, the My GPTmate may also incorporate other tools and frameworks, such as Docker, kubernetes, JVM, to implement distributed training, deployment, and management functions.

The GPT (generating Pre-trained Transformer) model is an unsupervised learning language model developed by the OpenAI team, and can improve the performance of natural language processing tasks through extensive text Pre-training. The transducer is a neural network architecture based on an attention mechanism, is used for processing sequence data, is widely applied to the field of natural language processing, has huge model training cost of GPT, and is not capable of bearing high dedicated GPT model training cost for common users.

Disclosure of Invention

The invention mainly aims to provide a text processing method, device, equipment and medium based on artificial intelligence, which enable a user to train real-time data at extremely low cost and dedicate a customized model of the user, thereby providing more personalized services for daily work/study and the like of the user.

In order to achieve the above object, the present invention provides a text processing method based on artificial intelligence, comprising the steps of:

acquiring first text information input by a user through a preset MyGPTrate model;

performing vectorization processing on the first text information by using a GPTmate engine on the MyGPTmate model, so as to perform word vector decomposition on the first text information by using the GPTmate engine based on an OPEN-AI word embedding sub-model, and generating a plurality of word vectors matched with the first text information;

cosine similarity calculation is carried out on a plurality of word vectors so as to determine word vectors with similarity values meeting a preset threshold value from the plurality of word vectors and generate second text information;

and executing corresponding user service processes based on the second text information, wherein the user service processes comprise question answering, data analysis, a literature drawing, file deriving and searching.

Further, before the step of obtaining the first text information input by the user through the preset MyGPTmate model, the method includes:

identifying a locally preset GPT model, wherein the GPT model is generated by a local knowledge base;

performing autoregressive training on the GPT model;

performing sequence data deep learning on the GPT model subjected to autoregressive training by adopting a transducer architecture;

and packaging the GPT model subjected to the deep learning of the sequence data through an EMBeddings model and a natural language processing technology to obtain a GPTmate engine and constructing the GPTmate engine on the MyGPTmate model.

Further, the cosine similarity calculation algorithm comprises:

cosine_similarity(A, B) = dot_product(A, B) / (norm(A) * norm(B))

wherein A and B are two word vectors, dot_product (A, B) is the dot product of A and B, norm (A) and norm (B) are Euclidean lengths of A and B respectively, the result value is between-1 and 1, and the closer the value is to 1, the closer the directions of the two word vectors are; the closer the value is to-1, the opposite direction the two word vectors are represented; a value close to 0 indicates that the two word vectors are orthogonal indicating no similarity.

Further, when the user service process is a question and answer, executing a corresponding user service process based on the second text information, including:

identifying a GPTmate problem steering amount of the second text information;

tuning the second text information by using a local knowledge base through a GPTmate engine;

the second text information subjected to the adjustment by the web site crawler is further linked with the corresponding internet corpus, and the second text information subjected to the adjustment by the GPTmate engine is linked with the local corpus of the local knowledge base;

and generating answer information corresponding to the second text information by combining the MyGPTrate model with the Internet corpus and the local corpus.

Further, before the step of performing vectorization processing on the first text information by using a GPTmate engine on the MyGPTmate model, the method includes:

word segmentation is carried out on the first text information by using an open source separation tool, wherein the open source separation tool comprises, but is not limited to, jieba or HanLP;

performing part-of-speech tagging on the first text information after word segmentation;

removing stop words from the first text information after part-of-speech tagging;

performing interference word removal on the first text information from which the stop word is removed;

and carrying out label substitution on the first text information after the interference words are removed so as to obtain the first text information which is convenient for understanding of the MyGPTmate model.

Further, when the user service process is question answering/retrieval/data analysis, executing a corresponding user service process based on the second text information, including:

and carrying out double-weighted matching processing on the second text information by adopting an elastic search word segmentation technology, wherein the double-weighted matching is to search by adopting vector similarity and a GPTmate engine to obtain double-weighted search scores, and taking top N pieces downwards along the highest score to optimize the second text information.

Further, the MyGPTmate model includes:

a user management module for providing conventional user management authentication capability;

the corpus management module is used for linking the internet corpus and the local corpus;

the question-answering module is used for supporting chat question-answering of long text context memory;

the draft image module is used for generating images in a chat mode;

and the GPTmate engine module is used for vectorizing and storing the first text information.

The invention also provides a text processing device based on artificial intelligence, which comprises:

the acquisition unit is used for acquiring first text information input by a user through a preset MyGPTmate model;

the engine unit is used for carrying out vectorization processing on the first text information by utilizing a GPTmate engine on the MyGPTmate model so as to carry out word vector decomposition on the first text information by utilizing the GPTmate engine based on a word embedding sub-model of OPEN-AI and generate a plurality of word vectors matched with the first text information;

the computing unit is used for carrying out cosine similarity computation on a plurality of word vectors so as to determine the word vector with the similarity value meeting a preset threshold value from the plurality of word vectors and generate second text information;

and the service unit is used for executing a corresponding user service process based on the second text information, wherein the user service process comprises question answering, data analysis, a literature drawing, file deriving and searching.

The invention also provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps of the text processing method based on artificial intelligence.

The present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of any of the above-described artificial intelligence based text processing methods.

The text processing method, device, equipment and medium based on artificial intelligence provided by the invention have the following beneficial effects:

language understanding: the My GPTrate provided by the invention can understand natural language texts and extract key information and semantic content in the natural language texts. In the tasks of voice recognition, emotion analysis, text classification and the like, the method combines the user preference and performs better than the traditional algorithm.

Language generation: the My GPTmate may generate new text conforming to grammatical and semantic rules, such as generating conversations, articles, mails, etc., from the input text. In the fields of dialog systems, automatic summary generation, text creation, etc., it may also bring about a significant effect improvement.

Multilingual interaction: my GPTrate may support communication and translation between multiple languages so that users may communicate and cooperate in different language environments. In the fields of national enterprises, international organizations and the like, the communication system can bring more convenience and high-efficiency communication experience.

Language understanding and generating capability is strong: the My GPTmate is based on a GPT model and a transducer architecture, has strong natural language understanding and generating capability, and can adapt to different scenes and requirements.

The expandability is good: the My GPTrate can adopt distributed training and efficient model compression technology to realize the expandability and the operation speed improvement of the model.

The application scene is wide: my GPTrate may be applied to a variety of natural language processing tasks such as text classification, machine translation, dialog generation, and the like.

The customizability is strong: the My GPTmate may be adjusted and optimized according to different application scenarios and requirements, including using domain-specific pre-training data, employing different model structures or algorithms, etc. My GPTrate supports training a user-specific GPT model in real time by importing information such as corpus/PDF/data text and the like, so that the defect of data coverage of a basic GPT model is overcome, and the model is more prone to personalized question-answering requirements of users.

Drawings

FIG. 1 is a schematic diagram of steps of an artificial intelligence based text processing method in accordance with an embodiment of the present invention;

FIG. 2 is a schematic overview of text processing by MyGPTrate based on an artificial intelligence text processing method in accordance with an embodiment of the invention;

FIG. 3 is a diagram of an answer tuning process when a user services a process based on an artificial intelligence based text processing method in an embodiment of the present invention;

FIG. 4 is a text preprocessing flow chart of an artificial intelligence based text processing method in an embodiment of the invention;

FIG. 5 is a compensated search flow diagram of dual weighted matching based on an artificial intelligence based text processing method in an embodiment of the invention;

FIG. 6 is a diagram of an overall MyGPTrate model application for an artificial intelligence based text processing method in accordance with an embodiment of the present invention;

FIG. 7 is a block diagram of an artificial intelligence based text processing device in accordance with an embodiment of the present invention;

fig. 8 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, the text processing method based on artificial intelligence provided by the invention comprises the following steps:

s1, acquiring first text information input by a user through a preset MyGPTrate model;

s2, carrying out vectorization processing on the first text information by utilizing a GPTmate engine on the MyGPTmate model so as to carry out word vector decomposition on the first text information by utilizing the GPTmate engine based on a word embedding sub-model of OPEN-AI and generate a plurality of word vectors matched with the first text information;

s3, carrying out cosine similarity calculation on a plurality of word vectors to determine word vectors with similarity values meeting a preset threshold value from the plurality of word vectors and generate second text information;

and S4, executing a corresponding user service process based on the second text information, wherein the user service process comprises question answering, data analysis, a literature drawing, file deriving and searching.

In the course of the specific implementation thereof,

the MyGPTmate model employs an OPEN-AI based word embedding (empeddings) sub-model to process word embedding for the first text information, where word embedding is a method of representing text in which each word or phrase is mapped to a vector in a high-dimensional space. These vectors capture the semantic and grammatical relations between words. In this high-dimensional space, semantically similar words are mapped to mutually close locations. Word embedding may be used as input for other natural language processing tasks such as text classification, named entity recognition, emotion analysis, etc. In these tasks, the input text is first converted into word embeddings and then input into the model for training. Whereas word-embedded training word embeddings are typically learned from large amounts of text data in an unsupervised manner. These models attempt to learn the relationships between words and their contexts and then encode these relationships into a high-dimensional vector. Dimension of the embedded vector the dimension of the embedded vector is generally configurable and may be adjusted according to the particular application and computing resources. Higher dimensions may capture more complex word relationships, but may increase computational complexity and model size.

Text embedding of OpenAI measures the relevance of text strings. Embedding is typically used to:

search (results are ordered by relevance to the query string),

Clustering (in which text strings are grouped by similarity),

recommendation (recommending items with related text strings),

Abnormality detection (identifying an abnormal value with little correlation),

Diversity measurement (analysis of similarity distribution),

Classification (where text strings are classified by their most similar labels);

word embedding is often used in natural language processing to represent words or phrases in text, which are typically mapped into a high-dimensional vector space. In this space, words with similar word senses tend to be mapped to similar locations. Then the similarity of the two word vectors is measured and a cosine similarity (Cosine Similarity) algorithm is used.

Cosine similarity is a measure of the directional similarity of two vectors, and is calculated by dividing the dot product of two vectors by the product of the Euclidean lengths of the two vectors. The specific formula is as follows:

cosine_similarity(A, B) = dot_product(A, B) / (norm(A) * norm(B))

in this formula, A and B are two vectors, dot_product (A, B) is the dot product of A and B, and norm (A) and norm (B) are the Euclidean lengths of A and B, respectively. The resulting value of this formula will be between-1 and 1, the closer the value is to 1, the more similar the direction of the two vectors; the closer the value is to-1, the opposite direction the two vectors are represented; a value close to 0 indicates that the two vectors are nearly orthogonal, that is, there is little similarity between them.

In word embedding, cosine similarity is widely used to calculate the similarity of two word vectors. For example, you can use cosine similarity to find the word most similar to a given word, or find a pair of words most similar in a set of word vectors.

In one embodiment, before the step of obtaining the first text information input by the user through the preset MyGPTmate model, the method includes:

performing autoregressive training on the GPT model;

Referring to fig. 2, the my GPTmate is based on a GPT model, and the GPT model is trained by using an autoregressive (autoregressive) mode, so that word-by-word prediction can be performed on an input text sequence, and language generation and understanding can be realized. The GPT model adopts a transducer architecture, which is a deep learning model mainly used for processing sequence data. The method adopts a pre-training mechanism during production, the pre-training process enables the model to learn a series of knowledge such as basic grammar, common sense information, emotion colors and the like of the language, and then when a specific task (such as machine translation) is carried out, a small amount of marking data can be used for fine adjustment of the model; based on GPT model, my GPTmate adopts Natural Language Processing (NLP) technology and vectorization technology based on EMBeddings model to realize the encapsulation of gptMate Q & V engine, and provides the ability of question-answering system, text graph and self-built chat knowledge base based on personal knowledge base for user, and simultaneously provides a method for enriching model to generate richer language by crawling network information into chat system. The gptMate Q & V engine is the core capability of the platform with innovation, and based on the GPT model, the capability of a question-answering system and the capability of vectorization retrieval are redefined by introducing vectorization technology, so that the defect of the GPT large language model in question-answering of long words is enhanced.

In one embodiment, when the user service process is a question and answer, the step of executing the corresponding user service process based on the second text information includes:

identifying a GPTmate problem steering amount of the second text information;

and generating answer information corresponding to the Internet corpus and the local corpus through the MyGPTrate model.

Referring to fig. 3, the my GPTmate encapsulates the embedded word model in an upper layer, and through the document text vectorization technology of the GPTmate Q & V engine, knowledge with similar similarity is retrieved from a local knowledge base when AI chat questions and answers, and is submitted to a large language model in combination with the embedded word technology, so that the long text processing capability which cannot be met by the GPT model is realized.

In one embodiment, before the step of vectorizing the first text information with a GPTmate engine on the MyGPTmate model, the method further comprises:

Referring to fig. 4, which is a text preprocessing technology based on the NLP technology, the process of My GPTmate needs to preprocess the text/document before embedding the text word, which is generally called text denoising, so as to better match prediction related information with a word vector model, reduce the generation of machine illusion, and the My GPTmate uses various NLP technologies, such as word segmentation, named entity recognition, semantic role labeling, and the like, to process and analyze the text, so that the My GPTmate can be helped to better understand and generate natural language.

Specifically, text information is acquired, sentences are divided into single words by using open source tools jieba, hanLP and the like, and word segmentation processing is performed. Word segmentation is a key step in Chinese text preprocessing, and because Chinese text does not have word boundaries (spaces) as obvious as English, special word segmentation tools such as jieba, hanLP and the like are needed to divide sentences into single words. Part of speech tagging using HanLP is an optional step that can tag each word with its grammatical role in a sentence (noun, verb, adjective, etc.). This may be useful in tasks such as entity recognition or relation extraction where the removal of stop words is similar to english, but where chinese also has words that occur at high frequency but do not carry much information, such as "on", "off", etc., which we would normally remove in GptMate. Removing the interfering words will do the removal process for the words that interfere with the text generation during the training process. And removing common punctuation marks, adding punctuation which enables the gpt model to be better understood, and adding self text preprocessing logic by GPTmate through means of field-specific pre-training data, fine tuning and the like in the implementation process so as to meet natural language processing tasks of different scenes and requirements.

In one embodiment, when the user service process is question-answer/search/data analysis, the step of executing the corresponding user service process based on the second text information includes:

Referring to fig. 5, it is mentioned that a cosine word embedding matching algorithm is adopted to match a text once, but vector (word embedding) matching is more similar, for more accurate scenes such as legal industry and education industry, vector combination search engine technology is needed at this time, and the innovation of dual weighted matching text database content is that the word segmentation technology and analyzer capability of an elastic search are combined while vector matching, so that a more accurate text retrieval matching algorithm is realized. In combination with the above flow, conventional vector search matching is performed, the vector is not matched accurately alone, and the obtained search score is not ideal. Then we will combine the vector matches together with the compensation search to get a composite score and then rank the first 4 in the composite score, which is the key to the double weighted match. Such as what is i's going to be the first hundred eighty of criminal law matching a user's question? In terms of criminal law, the content in one hundred eighty pieces may cover many contents, such as crime and conviction, if vector is simply used for matching (similarity is adopted by vector cosine, and the viewpoint of discussion is focused on), the short and accurate keyword matching effect is not ideal, so that a vector matching+search engine is adopted to obtain a double weighted search score, and top N pieces are taken down from high score to optimize the effect.

In one embodiment, referring to fig. 6, the mygptmate model includes:

the draft image module is used for generating images in a chat mode;

a GPTmate engine module (GPTmate Q & V engine) for vectorizing and storing the first text information.

Referring to fig. 7, a block diagram of an artificial intelligence based text processing device according to the present invention includes:

the acquiring unit 1 is used for acquiring first text information input by a user through a preset MyGPTmate model;

the engine unit 2 is used for carrying out vectorization processing on the first text information by utilizing a GPTmate engine on the MyGPTmate model so as to carry out word vector decomposition on the first text information by utilizing the GPTmate engine based on a word embedding sub-model of OPEN-AI and generate a plurality of word vectors matched with the first text information;

the calculating unit 3 is used for performing cosine similarity calculation on a plurality of word vectors so as to determine the word vector with the similarity value meeting a preset threshold value from the plurality of word vectors and generate second text information;

and a service unit 4, configured to execute a corresponding user service procedure based on the second text information, where the user service procedure includes question answering, data analysis, a document map, file derivation, and retrieval.

In this embodiment, for specific implementation of each unit in the above embodiment of the apparatus, please refer to the description in the above embodiment of the method, and no further description is given here.

Referring to fig. 8, a computer device is further provided in an embodiment of the present invention, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 8. The computer device includes a processor, a memory, a display screen, an input device, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store the corresponding data in this embodiment. The network interface of the computer device is used for communicating with an external terminal through a network connection. Which computer program, when being executed by a processor, carries out the above-mentioned method.

It will be appreciated by those skilled in the art that the architecture shown in fig. 8 is merely a block diagram of a portion of the architecture in connection with the present inventive arrangements and is not intended to limit the computer devices to which the present inventive arrangements are applicable.

An embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above method. It is understood that the computer readable storage medium in this embodiment may be a volatile readable storage medium or a nonvolatile readable storage medium.

In summary, the first text information input by the user is obtained through a preset MyGPTmate model; performing vectorization processing on the first text information by using a GPTmate engine on the MyGPTmate model, so as to perform word vector decomposition on the first text information by using the GPTmate engine based on an OPEN-AI word embedding sub-model, and generating a plurality of word vectors matched with the first text information; cosine similarity calculation is carried out on a plurality of word vectors so as to determine word vectors with similarity values meeting a preset threshold value from the plurality of word vectors and generate second text information; and executing a corresponding user service process based on the second text information, wherein the user service process comprises question answering, data analysis, a literature drawing, file deriving and searching, so that the user trains real-time data at extremely low cost and specializes in a customized model of the user, thereby providing more personalized services for daily work/study and the like of the user.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present invention and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM, among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes using the descriptions and drawings of the present invention or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims

1. A text processing method based on artificial intelligence, comprising the steps of:

2. The artificial intelligence based text processing method according to claim 1, wherein before the step of acquiring the first text information input by the user through the preset MyGPTmate model, the method comprises:

performing autoregressive training on the GPT model;

3. The artificial intelligence based text processing method of claim 1, wherein the algorithm for cosine similarity calculation comprises:

cosine_similarity(A, B) = dot_product(A, B) / (norm(A) * norm(B))

4. The artificial intelligence based text processing method according to claim 1, wherein the step of executing the corresponding user service process based on the second text information when the user service process is a question-answer, comprises:

identifying a GPTmate problem steering amount of the second text information;

5. The artificial intelligence based text processing method of claim 1, wherein prior to the step of vectorizing the first text information using a GPTmate engine on the MyGPTmate model, comprising:

6. The artificial intelligence based text processing method of claim 1, wherein when the user service process is question-answer/search/data analysis, the step of executing the corresponding user service process based on the second text information comprises:

7. The artificial intelligence based text processing method of claim 1, wherein the MyGPTmate model comprises:

the draft image module is used for generating images in a chat mode;

8. An artificial intelligence based text processing apparatus comprising:

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the artificial intelligence based text processing method of any one of claims 1 to 7.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the artificial intelligence based text processing method of any one of claims 1 to 7.