CN116933757A - Document generation method and system applying language artificial intelligence - Google Patents

Document generation method and system applying language artificial intelligence Download PDF

Info

Publication number
CN116933757A
CN116933757A CN202311187668.7A CN202311187668A CN116933757A CN 116933757 A CN116933757 A CN 116933757A CN 202311187668 A CN202311187668 A CN 202311187668A CN 116933757 A CN116933757 A CN 116933757A
Authority
CN
China
Prior art keywords
semantic
text
adjacent
triplet
lemmas
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311187668.7A
Other languages
Chinese (zh)
Other versions
CN116933757B (en
Inventor
蓝建敏
池沐霖
李观春
徐泳坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Excellence Information Technology Co ltd
Original Assignee
Excellence Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Excellence Information Technology Co ltd filed Critical Excellence Information Technology Co ltd
Priority to CN202311187668.7A priority Critical patent/CN116933757B/en
Publication of CN116933757A publication Critical patent/CN116933757A/en
Application granted granted Critical
Publication of CN116933757B publication Critical patent/CN116933757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a document generation method and a system applying language artificial intelligence, wherein an information extraction model is used for extracting the relation of a text document to obtain a plurality of triples to form a triplet set, a text document is used for carrying out fine tuning training on a pre-training language model to obtain a generation model, the generation model is used for complementing the template document to obtain a complement text, a semantic condensation reaction is carried out on the triplet set according to the complement text to obtain a text reaction coefficient, and the complement text is condensed according to the text reaction coefficient, so that the safety and quality of text generation are better ensured.

Description

Document generation method and system applying language artificial intelligence
Technical Field
The application belongs to the field of processing optimization, and particularly relates to a document generation method and system applying language artificial intelligence.
Background
The application of language artificial intelligence to generate a document refers to automatically generating the document meeting grammar, logic and semantic requirements through a computer system by using related technologies such as natural language processing, machine learning, deep learning and the like. The technology has wide application prospect in various fields such as law, public service, medical treatment, finance and the like. Although Natural Language Processing (NLP): NLP technology can be used for tasks such as lexical analysis, syntactic analysis, semantic understanding, and the like, learning a large amount of text data through training models, but the prior art still has challenges in understanding complex knowledge and contexts. In generating a long document, the model may suffer from logic errors, incompatibilities, or lack of context. In the generation of documents in a particular field, it is a challenge to obtain a large amount of training data of high quality. Lack of domain specific data may cause the generated results to deviate from expected. The generated documents may involve plagiarism problems, as well as generating inappropriate, illegal or biased content, which requires the establishment of appropriate regulatory mechanisms and algorithms to ensure the reliability and compliance of the documents. And the data set used by the model may have sample bias and tendency, which may cause problems of bias, discrimination or unfair of the generated document, and special attention needs to be paid to avoid such problems for document generation which has an influence on topics on information monitoring. A legal document generation method based on a knowledge graph is provided in the patent document with publication number CN113868391a, and although a target referee result corresponding to a case to be processed can be determined from the case knowledge graph, it is difficult to manage for generating inappropriate or biased content. In publication number CN113420143a, a document abstract generating method is provided, and although context semantic analysis can be performed on a target text based on document entity elements to obtain context Wen Yuyi vectors of the document entity elements, it is difficult to capture a preset multi-hop knowledge relationship, and it is also difficult to avoid sample bias and tendency.
Disclosure of Invention
The application aims to provide a document generation method and a document generation system applying language artificial intelligence, which are used for solving one or more technical problems in the prior art and at least providing a beneficial selection or creation condition.
The application provides a document generation method and a system applying language artificial intelligence, wherein an information extraction model is used for extracting the relation of a text document to obtain a plurality of triples to form a triplet set, a text document is used for carrying out fine tuning training on a pre-training language model to obtain a generation model, the generation model is used for complementing the template document to obtain a complement text, a semantic condensation reaction is carried out on the triplet set according to the complement text to obtain a text reaction coefficient, and the complement text is condensed according to the text reaction coefficient, so that the safety and quality of text generation are better ensured.
To achieve the above object, according to an aspect of the present application, there is provided a document generation method applying language artificial intelligence, the method comprising the steps of:
inputting a text document;
using an information extraction model to extract the relation of the text document to obtain a plurality of triples to form a triplet set;
inputting a template document;
using the generated model to complement the template document to obtain a complement text;
carrying out semantic condensation reaction on the triplet set according to the complement text to obtain a text reaction coefficient;
condensing the complement text according to the text reaction coefficient.
Further, the text document entered is string data representing one or more articles.
Further, the information extraction model is an information extraction model based on a pre-training language model, and the generation model is a generation model obtained by performing fine-tuning training on the pre-training language model according to the text document;
in some embodiments, to save training costs, the information extraction model may be implemented by performing zero-shot information extraction through chat with ChatGPT, while in some embodiments, to ensure data security and independence, a chinese information extraction framework (e.g., bert-NER) built based on Bert-NER may be used.
Further, the triples in the triples set are three-dimensional arrays composed of character strings, the character strings in the triples all belong to an input text document, and the triples in the triples set have mutual dissimilarity. The triplet is (Subject, precede, object), wherein the Subject at the head, i.e. the head entity, and the Subject at the end, i.e. the tail entity, are two entities, the middle precede being the entity relationship, subject, predicate and the Object being in the form of a string.
Further, the template document is a text containing a plurality of gap filling positions, the complete text is composed of a plurality of different lemmas, (lemmas can represent token, token is of character string type) each lemma corresponds to one gap filling position, each gap filling position is not connected with each other and has interval characters, only interval characters exist between two gap filling positions, but no other gap filling positions are called as adjacent gap filling positions, the adjacent gap filling positions are called as adjacent gap filling positions, and lemmas corresponding to the adjacent gap filling positions are adjacent lemmas.
Further, the method for obtaining a plurality of complement texts by using the generation model to complement the template document comprises the following steps: and using a masking mechanism of the pre-training language model to make the generated model complement the template document to obtain a complement text.
Further, according to the complement text, carrying out semantic condensation reaction on the triplet set, and obtaining a text reaction coefficient by the following method:
creating a semantic embedding function, wherein the semantic embedding function converts a character string input into the semantic embedding function into a semantic vector with a fixed dimension size for output;
the number of dimensions of the semantic vectors is k, the sequence number of each dimension in the semantic vectors is v, v is E [1, k ], and the semantic similarity between the semantic vectors can be a value of 0-1;
for two adjacent lemmas, acquiring the characters of the interval between the two adjacent lemmas, and the three-dimensional array formed by the adjacent lemmas and the characters of the interval between the adjacent lemmas is called as an adjacent lemma;
taking a set formed by all adjacent tuples as an adjacent tuple set;
in each adjacent word group, converting two words and the words at intervals into semantic vectors respectively through the semantic embedding function, calculating the semantic similarity of the semantic vectors of the two words and the semantic vectors of the words at intervals respectively, multiplying the semantic vectors of the two words and the semantic similarity of the semantic vectors of the words at intervals and taking square roots, taking the numerical value of the square roots as the deviation weight of the adjacent word groups, multiplying the numerical value of each dimension of the semantic vectors of the words at intervals by the deviation weight to obtain a relation correction vector,
recording semantic similarity y1 and y2 between semantic vectors of two words and semantic vectors of the words at intervals, wherein the semantic vectors of the words at intervals are Gvec, the numerical value of the dimension with the sequence number v in Gvec is Gvec [ v ], the relation correction vector is Male,
in the Malec, the numerical calculation of each dimension Gvec v (y1×y2) can be parallel, which is different from the high-complexity calculation of the semantic vector to be subjected to matrix decomposition, so that the method is beneficial to accelerating the calculation process by using the distributed computing equipment, relieves the problem caused by long running time of a large-scale pre-training model, and can be used for generating a document on a large scale;
in the triplet set, the head entity, entity relation and tail entity of each triplet are respectively converted into semantic vectors through the semantic embedding function, the semantic vectors of the head entity in the triplet are recorded as Subvec, the semantic vectors of the tail entity in the triplet are recorded as ovvec, the semantic vectors of the entity relation in the triplet are recorded as Relvec,
calculating the semantic similarity of Subvec and Relvec to be SmR, calculating the semantic similarity of Subvec and Relvec to be OmR,
calculating a semantic transition value of the triplet, wherein the semantic transition value has a plurality of scores, the number of scores of the semantic transition value is consistent with the number of dimensions of a semantic vector, the sequence number of scores of the semantic transition value is consistent with the sequence number of dimensions of the semantic vector, the semantic transition value is Benec, the score with the sequence number v in Benec is Benec [ v ], and the calculation formula of Benec [ v ] is:
it should be noted that, the semantic transition value Benec should not be regarded as a vector, the order of the dimensions of the semantic transition value is not ordered and fixed like the semantic vector, in the embodiment provided by the application, one state of the semantic transition value is selected for the convenience of calculation, namely, the number of the scores of the semantic transition value is equal to the number of the dimensions of the semantic vector, and the sequence number of the scores of the semantic transition value is marked by the sequence number of the dimensions of the semantic vector, in addition, the number of the scores of the semantic transition value can be different from that, preferably, the number of the scores of the external semantic transition value should be greater than or equal to the number of the dimensions of the semantic vector, wherein the scores can also be disordered, so that the entity nodes of the knowledge graph can be fully represented for posterior probability among a plurality of jump paths, and the posterior probability of transfer connection among the score of the head entity and tail entity is extracted by dividing the combination of the dimensional component of the two sides and the semantic similarity of entity relationship respectively;
wherein Subvec [ v ] represents the number of the dimension with the number v in Subvec, and Relvec [ v ] represents the number of the dimension with the number v in Relvec.
In the triplet set, each triplet corresponds to different adjacent tuples and has corresponding text reaction coefficients respectively, and each adjacent tuple corresponds to each triplet and has corresponding text reaction coefficients respectively;
for each adjacent word tuple, calculating the text reaction coefficient of each adjacent word tuple for each triplet in the triplet set, wherein the text reaction coefficient specifically comprises the following steps:
the number of triples in the triplet set is recorded as n, the sequence number of the triples in the triplet set is recorded as i, the triples with the sequence number of i in the triplet set is recorded as Triple (i),
the number of adjacent tuples in the adjacent tuple set is recorded as m, the sequence number of the adjacent tuple in the adjacent tuple set is recorded as j, the sequence number of the adjacent tuple in the adjacent tuple set is recorded as token (j),
for token (j), calculating a relation correction vector corresponding to the token (j) as a Malec (j), wherein the number of the dimension with the sequence number v in the Malec (j) is a Malec (j) [ v ], calculating a semantic transition value Benec (i) corresponding to each Triple (i), wherein the number v in the Benec (i) is a Benec (i) [ v ],
here, the sequence numbers of the elements in the set are denoted by brackets (), and the dimensions, components, scores, or the like are denoted by brackets [ ];
to distinguish the cyclic traversal of v in Malec (i) and Malec (j) [ v ] and replace the original traversal of the symbol v in Malec (j) with v1 to obtain Malec (j) [ v1], replace the original traversal of the symbol v in benc (i) with v2 to obtain benc (i) [ v2], v1 and v2 are similarly changed within the original [1, k ] interval only by replacing the symbol, so that the serial numbers of each dimension in the Malec (j) and the serial numbers of each score in the benc (i) are enumerated independently of each other, thereby realizing a double nested loop, and calculating the text reaction coefficient Condes (j, i) of Tokens (j) to Triple (i):
simplifying the denominator of the formula can obtain:
in the prior art, the calculation of corresponding dimensions is generally carried out between vectors or tensors, which is the calculation of single hops of the corresponding dimensions, but triples in a knowledge base have multi-hop relations, and each gap-filling position in the corresponding template document also has multi-hop relations, so that the calculation of the corresponding dimensions in the prior art is not suitable for the multi-hop relations, and the double nesting circulation is exactly used for measuring the posterior probability of the gap-filling connection in the template document and the mathematical characteristics of paths of entities between triples reaching a plurality of entities through entity relations, thereby being beneficial to measuring the multi-hop relations of the full-filling text in the template document.
Further, according to the text reaction coefficient, the method for condensing the complement text comprises the following steps:
condensing the completion text refers to condensing large-scale adjacent word tuples with triples in a knowledge base from large-scale alignment to small-scale alignment in mass data based on various template documents, and in some embodiments, the document generation method applying language artificial intelligence is applied to document generation of a public service institution, and a database of the public service institution contains a large number of conference record texts, speech texts and the like, wherein the semantics of words expressed in background texts (context) and general semantics (general common sense) are in and out with great probability, and automatically generated documents are often used in the public service field, if the general semantics method has great probability to influence, the completion text needs to be condensed first, and the safety and quality of the text generation are ensured;
for each adjacent tuple, counting the text reaction coefficient of each triplet corresponding to the adjacent tuple, and selecting one triplet with the numerical value of the text reaction coefficient being at the minimum value of the text reaction coefficient of each triplet as the closest triplet;
calculating the semantic similarity between two adjacent words in the adjacent word groups and the words spaced by the two adjacent words, and calculating the semantic similarity between a head entity and a tail entity in the closest triple and the entity relationship in the closest triple, if the semantic similarity between the words and the words spaced by the two adjacent words in the adjacent word groups is lower than the semantic similarity between the head entity and the tail entity in the closest triple and the entity relationship in the closest triple, judging the word which is lower than the semantic similarity between the head entity and the tail entity in the closest triple and the entity relationship in the closest triple as a word to be condensed, wherein: firstly calculating the semantic similarity of a head entity and a tail entity in the closest triplet to the entity relationship in the closest triplet, and the numerical values R1 and R2 of the semantic similarity of the tail entity in the closest triplet and the entity relationship in the closest triplet, wherein two lemmas and interval characters between the two lemmas are also arranged in the adjacent lemmas, calculating the numerical values S1 and S2 of the two lemmas in the adjacent lemmas and the semantic similarity of the characters between the two lemmas are also obtained, and then comparing the numerical values of the two lemmas with the numerical values of the two lemmas to judge whether S1 is smaller than R1 and smaller than R2, and S2 is smaller than R1 and smaller than R2, if so, the two lemmas are used as the lemmas to be condensed;
the closest triplet is used as a condensation triplet (the step can effectively screen the position of filling the gap with risk, avoid sample prejudice and tendency, and better ensure the safety and quality of text generation);
storing and outputting the word elements to be condensed and the condensed triples;
identifying the word elements to be condensed, displaying the condensed triplet, and enabling the condensed triplet to be replaced alternatively.
The application also provides a document generation system of the application language artificial intelligence, which comprises: a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor implements steps in the method for generating a document using language artificial intelligence when the processor executes the computer program, the system for generating a document using language artificial intelligence can be executed in a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud data center, and the like, and the executable system can include, but is not limited to, a processor, a memory, and a server cluster, and the processor executes the computer program to execute in units of the following systems:
the triplet set unit is used for extracting the relation of the text document by using an information extraction model to obtain a plurality of triples to form a triplet set;
the text completion unit is used for completing the template document by using the generated model to obtain a completed text;
the text reaction coefficient unit is used for carrying out semantic condensation reaction on the triplet set according to the completed text to obtain a text reaction coefficient;
and the text condensation unit is used for condensing the complement text according to the text reaction coefficient.
The beneficial effects of the application are as follows: the application provides a document generation method and a system applying language artificial intelligence, wherein an information extraction model is used for extracting the relation of a text document to obtain a plurality of triples to form a triplet set, a text document is used for carrying out fine tuning training on a pre-training language model to obtain a generation model, the generation model is used for complementing the template document to obtain a complement text, a semantic condensation reaction is carried out on the triplet set according to the complement text to obtain a text reaction coefficient, and the complement text is condensed according to the text reaction coefficient, so that the safety and quality of text generation are better ensured.
Drawings
The above and other features of the present application will become more apparent from the detailed description of the embodiments thereof given in conjunction with the accompanying drawings, in which like reference characters designate like or similar elements, and it is apparent that the drawings in the following description are merely some examples of the present application, and other drawings may be obtained from these drawings without inventive effort to those of ordinary skill in the art, in which:
FIG. 1 is a flow chart of a method for generating a document using language artificial intelligence;
FIG. 2 is a system architecture diagram of a document generation system employing language artificial intelligence.
Detailed Description
The conception, specific structure, and technical effects produced by the present application will be clearly and completely described below with reference to the embodiments and the drawings to fully understand the objects, aspects, and effects of the present application. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
In the description of the present application, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
Referring now to FIG. 1, a flowchart of a document generation method using language artificial intelligence according to the present application will be described, with reference to FIG. 1, in which a document generation method and system using language artificial intelligence according to an embodiment of the present application are described.
The application provides a document generation method applying language artificial intelligence, which specifically comprises the following steps:
using an information extraction model to extract the relation of the text document to obtain a plurality of triples to form a triplet set;
using the generated model to complement the template document to obtain a complement text;
carrying out semantic condensation reaction on the triplet set according to the complement text to obtain a text reaction coefficient;
condensing the complement text according to the text reaction coefficient.
Further, the text document entered is string data representing one or more articles.
Further, the information extraction model is an information extraction model based on a pre-training language model, and the generation model is a generation model obtained by performing fine-tuning training on the pre-training language model according to the text document;
in some embodiments, to save training costs, the information extraction model ([ 1] Zero-Shot Information Extraction via Chatting with ChatGPT ArXiv 2023 Xiang WeiXingyu CuiNing ChengXiaobin WangXin ZhangShen HuangPengjun XieJinan XuYufeng ChenMeishan Zhang) may be implemented by performing Zero-shot information extraction through chat with ChatGPT, while in some embodiments, to ensure data security and independence, a chinese information extraction framework (e.g., bert-NER) built based on Bert-NER may be used.
Further, the triples in the triples set are three-dimensional arrays composed of character strings, the character strings in the triples all belong to an input text document, and the triples in the triples set have mutual dissimilarity. The triplet is (Subject, precede, object), wherein the Subject at the head, i.e. the head entity, and the Subject at the end, i.e. the tail entity, are two entities, the middle precede being the entity relationship, subject, predicate and the Object being in the form of a string.
Further, the template document is a text containing a plurality of gap filling positions, the complete text is composed of a plurality of different lemmas, (lemmas can represent token, token is of character string type) each lemma corresponds to one gap filling position, each gap filling position is not connected with each other and has interval characters, only interval characters exist between two gap filling positions, but no other gap filling positions are called as adjacent gap filling positions, the adjacent gap filling positions are called as adjacent gap filling positions, and lemmas corresponding to the adjacent gap filling positions are adjacent lemmas.
Further, the method for obtaining a plurality of complement texts by using the generation model to complement the template document comprises the following steps: and using a masking mechanism of the pre-training language model to make the generated model complement the template document to obtain a complement text.
Further, according to the complement text, carrying out semantic condensation reaction on the triplet set, and obtaining a text reaction coefficient by the following method:
creating a semantic embedding function, wherein the semantic embedding function converts a character string input into the semantic embedding function into a semantic vector with a fixed dimension size for output;
the number of dimensions of the semantic vectors is k, the sequence number of each dimension in the semantic vectors is v, v epsilon [1, k ], and the semantic similarity between the semantic vectors is a numerical value of 0-1;
for two adjacent lemmas, acquiring the characters of the interval between the two adjacent lemmas, and the three-dimensional array formed by the adjacent lemmas and the characters of the interval between the adjacent lemmas is called as an adjacent lemma;
taking a set formed by all adjacent tuples as an adjacent tuple set;
in each adjacent word group, converting two words and the words at intervals into semantic vectors respectively through the semantic embedding function, calculating the semantic similarity of the semantic vectors of the two words and the semantic vectors of the words at intervals respectively, multiplying the semantic vectors of the two words and the semantic similarity of the semantic vectors of the words at intervals and taking square roots, taking the numerical value of the square roots as the deviation weight of the adjacent word groups, multiplying the numerical value of each dimension of the semantic vectors of the words at intervals by the deviation weight to obtain a relation correction vector,
recording semantic similarity y1 and y2 between semantic vectors of two words and semantic vectors of the words at intervals, wherein the semantic vectors of the words at intervals are Gvec, the numerical value of the dimension with the sequence number v in Gvec is Gvec [ v ], the relation correction vector is Male,
in the Malec, the numerical calculation of each dimension Gvec v (y1×y2) can be parallel, which is different from the high-complexity calculation of the semantic vector to be subjected to matrix decomposition, so that the method is beneficial to accelerating the calculation process by using the distributed computing equipment, relieves the problem caused by long running time of a large-scale pre-training model, and can be used for generating a document on a large scale;
in the triplet set, the head entity, entity relation and tail entity of each triplet are respectively converted into semantic vectors through the semantic embedding function, the semantic vectors of the head entity in the triplet are recorded as Subvec, the semantic vectors of the tail entity in the triplet are recorded as ovvec, the semantic vectors of the entity relation in the triplet are recorded as Relvec,
calculating the semantic similarity of Subvec and Relvec to be SmR, calculating the semantic similarity of Subvec and Relvec to be OmR,
calculating a semantic transition value of the triplet, wherein the semantic transition value has a plurality of scores, the number of scores of the semantic transition value is consistent with the number of dimensions of a semantic vector, the sequence number of scores of the semantic transition value is consistent with the sequence number of dimensions of the semantic vector, the semantic transition value is Benec, the score with the sequence number v in Benec is Benec [ v ], and the calculation formula of Benec [ v ] is:
it should be noted that, the semantic transition value Benec should not be regarded as a vector, the order of the dimensions of the semantic transition value is not ordered and fixed like the semantic vector, in the embodiment provided by the application, one state of the semantic transition value is selected for the convenience of calculation, namely, the number of the scores of the semantic transition value is equal to the number of the dimensions of the semantic vector, and the sequence number of the scores of the semantic transition value is marked by the sequence number of the dimensions of the semantic vector, in addition, the number of the scores of the semantic transition value can be different from that, preferably, the number of the scores of the external semantic transition value should be greater than or equal to the number of the dimensions of the semantic vector, wherein the scores can also be disordered, so that the entity nodes of the knowledge graph can be fully represented for posterior probability among a plurality of jump paths, and the posterior probability of transfer connection among the score of the head entity and tail entity is extracted by dividing the combination of the dimensional component of the two sides and the semantic similarity of entity relationship respectively;
wherein Subvec [ v ] represents the number of the dimension with the number v in Subvec, and Relvec [ v ] represents the number of the dimension with the number v in Relvec.
In the triplet set, each triplet corresponds to different adjacent tuples and has corresponding text reaction coefficients respectively, and each adjacent tuple corresponds to each triplet and has corresponding text reaction coefficients respectively;
for each adjacent word tuple, calculating the text reaction coefficient of each adjacent word tuple for each triplet in the triplet set, wherein the text reaction coefficient specifically comprises the following steps:
the number of triples in the triplet set is recorded as n, the sequence number of the triples in the triplet set is recorded as i, the triples with the sequence number of i in the triplet set is recorded as Triple (i),
the number of adjacent tuples in the adjacent tuple set is recorded as m, the sequence number of the adjacent tuple in the adjacent tuple set is recorded as j, the sequence number of the adjacent tuple in the adjacent tuple set is recorded as token (j),
for token (j), calculating a relation correction vector corresponding to the token (j) as a Malec (j), wherein the number of the dimension with the sequence number v in the Malec (j) is a Malec (j) [ v ], calculating a semantic transition value Benec (i) corresponding to each Triple (i), wherein the number v in the Benec (i) is a Benec (i) [ v ],
here, the sequence numbers of the elements in the set are denoted by brackets (), and the dimensions, components, scores, or the like are denoted by brackets [ ];
to distinguish the cyclic traversal of v in Malec (i) and Malec (j) [ v ] and replace the original traversal of the symbol v in Malec (j) with v1 to obtain Malec (j) [ v1], replace the original traversal of the symbol v in benc (i) with v2 to obtain benc (i) [ v2], v1 and v2 are similarly changed within the original [1, k ] interval only by replacing the symbol, so that the serial numbers of each dimension in the Malec (j) and the serial numbers of each score in the benc (i) are enumerated independently of each other, thereby realizing a double nested loop, and calculating the text reaction coefficient Condes (j, i) of Tokens (j) to Triple (i):
simplifying the denominator of the formula can obtain:
in the prior art, the calculation of corresponding dimensions is generally carried out between vectors or tensors, which is the calculation of single hops of the corresponding dimensions, but triples in a knowledge base have multi-hop relations, and each gap-filling position in a corresponding template document also has multi-hop relations, so that the calculation of the corresponding dimensions in the prior art is not suitable for the multi-hop relations, and the double nesting circulation in the text reaction coefficient calculation is exactly used for measuring the posterior probability of the jump connection between each gap-filling position in the template document and the mathematical characteristics of paths of entities between the triples to a plurality of entities through entity relations, thereby being beneficial to measuring the multi-hop relations of the complement text in the template document.
Further, according to the text reaction coefficient, the method for condensing the complement text comprises the following steps:
condensing the completion text refers to condensing large-scale adjacent word tuples with triples in a knowledge base from large-scale alignment to small-scale alignment in mass data based on various template documents, and in some embodiments, the document generation method applying language artificial intelligence is applied to document generation of a public service institution, and a database of the public service institution contains a large number of conference record texts, speech texts and the like, wherein the semantics of words expressed in background texts (context) and general semantics (general common sense) are in and out with great probability, and automatically generated documents are often used in the public service field, if the general semantics method has great probability to influence, the completion text needs to be condensed first, and the safety and quality of the text generation are ensured;
for each adjacent tuple, counting the text reaction coefficient of each triplet corresponding to the adjacent tuple, and selecting one triplet with the numerical value of the text reaction coefficient being at the minimum value of the text reaction coefficient of each triplet as the closest triplet;
calculating the semantic similarity between two adjacent words in the adjacent word groups and the words spaced by the two adjacent words, calculating the semantic similarity between a head entity and a tail entity in the closest triplet and the entity relationship in the closest triplet, and if the semantic similarity between the words and the words spaced by the two adjacent words in the adjacent word groups is lower than the semantic similarity between the head entity and the tail entity in the closest triplet and the entity relationship in the closest triplet, judging that the word which is lower than the semantic similarity between the head entity and the tail entity in the closest triplet and the entity relationship in the closest triplet is a word to be condensed, and taking the closest triplet as a condensation triplet, wherein the step can effectively screen the position of filling gaps with risks, avoid sample bias and tendency, and better ensure the safety and quality of text generation;
storing and outputting the word elements to be condensed and the condensed triples;
identifying the word elements to be condensed, displaying the condensed triplet, and enabling the condensed triplet to be replaced alternatively.
The document generation system applying language artificial intelligence operates in any computing device of a desktop computer, a notebook computer, a palm computer or a cloud data center, and the computing device comprises: a processor, a memory, and a computer program stored in and running on the memory, the processor implementing the steps in the document generation method of application language artificial intelligence when executing the computer program, and the operable system may include, but is not limited to, a processor, a memory, a server cluster.
As shown in fig. 2, a document generation system using language artificial intelligence according to an embodiment of the present application includes: a processor, a memory, and a computer program stored in the memory and executable on the processor, the processor implementing the steps in an embodiment of a document generation method for artificial intelligence in an application language when the computer program is executed, the processor executing the computer program to run in units of the following system:
the triplet set unit is used for extracting the relation of the text document by using an information extraction model to obtain a plurality of triples to form a triplet set;
the text completion unit is used for completing the template document by using the generated model to obtain a completed text;
the text reaction coefficient unit is used for carrying out semantic condensation reaction on the triplet set according to the completed text to obtain a text reaction coefficient;
and the text condensation unit is used for condensing the complement text according to the text reaction coefficient.
Preferably, all undefined variables in the present application, if not explicitly defined, can be threshold set manually; preferably, for numerical calculation between unit different physical quantities, in order to better count the linear relation or probability relation of numerical distribution between different physical quantities, dimensionless processing and normalization processing can be performed to convert the numerical relation between different physical quantities so as to unify the numerical relation between different physical quantities.
The document generation system applying language artificial intelligence can be operated in computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud data center and the like. The document generation system applying language artificial intelligence comprises, but is not limited to, a processor and a memory. It will be appreciated by those skilled in the art that the examples are merely examples of a document generation method and system for application language artificial intelligence, and are not limiting of a document generation method and system for application language artificial intelligence, and may include more or fewer components than examples, or may combine certain components, or different components, e.g., the document generation system for application language artificial intelligence may further include input and output devices, network access devices, buses, etc.
The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete component gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the document generation system of the application language artificial intelligence, and various interfaces and lines are used to connect various sub-areas of the entire document generation system of the application language artificial intelligence.
The memory may be used to store the computer program and/or module, and the processor may implement the functions of the document generation method and system of application language artificial intelligence by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
The application provides a document generation method and a system applying language artificial intelligence, wherein an information extraction model is used for extracting the relation of a text document to obtain a plurality of triples to form a triplet set, a text document is used for carrying out fine tuning training on a pre-training language model to obtain a generation model, the generation model is used for complementing the template document to obtain a complement text, a semantic condensation reaction is carried out on the triplet set according to the complement text to obtain a text reaction coefficient, and the complement text is condensed according to the text reaction coefficient, so that the safety and quality of text generation are better ensured. After the template document is generated by using the method of the application, the F1-score is increased from 0.67 to 0.85 which is not used by the method of the application.
Although the present application has been described in considerable detail and with particularity with respect to several described embodiments, it is not intended to be limited to any such detail or embodiment or any particular embodiment so as to effectively cover the intended scope of the application. Furthermore, the foregoing description of the application has been presented in its embodiments contemplated by the inventors for the purpose of providing a useful description, and for the purposes of providing a non-essential modification of the application that may not be presently contemplated, may represent an equivalent modification of the application.

Claims (9)

1. A document generation method employing language artificial intelligence, the method comprising the steps of:
inputting a text document;
using an information extraction model to extract the relation of the text document to obtain a plurality of triples to form a triplet set;
inputting a template document;
using the generated model to complement the template document to obtain a complement text;
carrying out semantic condensation reaction on the triplet set according to the complement text to obtain a text reaction coefficient;
condensing the complement text according to the text reaction coefficient.
2. The method for document generation using language artificial intelligence of claim 1, wherein the text document input is character string data representing one or more articles.
3. The document generation method using language artificial intelligence according to claim 1, wherein the information extraction model is an information extraction model based on a pre-training language model, and the generation model is a generation model obtained by performing fine-tuning training on the pre-training language model according to the text document.
4. The method for generating a document using language artificial intelligence according to claim 1, wherein the triples in the triples set are three-dimensional arrays composed of character strings, the character strings in the triples all belong to an input text document, and the triples in the triples set have mutual dissimilarity.
5. The method for generating a document using language artificial intelligence according to claim 1, wherein the template document is a text including a plurality of gap-filling positions, the complete text is composed of a plurality of different lemmas, each lemma corresponds to a gap-filling position, each gap-filling position is not connected and has a character with a space, only the character with the space between the two gap-filling positions and no other gap-filling position are called as adjacent gap-filling positions, the adjacent two gap-filling positions are called as adjacent gap-filling positions, and the lemma corresponding to the adjacent gap-filling position is the adjacent lemma.
6. The document generation method of claim 1, wherein the method for using the generation model to complement the template document to obtain a plurality of complement texts is as follows: and using a masking mechanism of the pre-training language model to make the generated model complement the template document to obtain a complement text.
7. The method for generating a document using linguistic artificial intelligence according to claim 5, wherein the method for performing semantic condensation reaction on the triplet set according to the complement text to obtain the text reaction coefficient comprises the steps of:
creating a semantic embedding function, wherein the semantic embedding function converts a character string input into the semantic embedding function into a semantic vector with a fixed dimension size for output; the number of dimensions of the semantic vector is k;
for two adjacent lemmas, acquiring the characters of the interval between the two adjacent lemmas, and the three-dimensional array formed by the adjacent lemmas and the characters of the interval between the adjacent lemmas is called as an adjacent lemma;
taking a set formed by all adjacent tuples as an adjacent tuple set;
in each adjacent word group, converting two words and the words at intervals into semantic vectors respectively through the semantic embedding function, calculating the semantic similarity of the semantic vectors of the two words and the semantic vectors of the words at intervals respectively, multiplying the semantic vectors of the two words and the semantic similarity of the semantic vectors of the words at intervals and taking square roots, taking the numerical value of the square roots as the deviation weight of the adjacent word groups, multiplying the numerical value of each dimension of the semantic vectors of the words at intervals by the deviation weight to obtain a relation correction vector,
recording semantic similarity y1 and y2 between semantic vectors of two words and semantic vectors of the words at intervals, wherein the semantic vectors of the words at intervals are Gvec, the numerical value of the dimension with the sequence number v in Gvec is Gvec [ v ], the relation correction vector is Male,
in the triplet set, the head entity, entity relation and tail entity of each triplet are respectively converted into semantic vectors through the semantic embedding function, the semantic vectors of the head entity in the triplet are recorded as Subvec, the semantic vectors of the tail entity in the triplet are recorded as ovvec, the semantic vectors of the entity relation in the triplet are recorded as Relvec,
calculating the semantic similarity of Subvec and Relvec to be SmR, calculating the semantic similarity of Subvec and Relvec to be OmR,
calculating a semantic transition value of the triplet, wherein the semantic transition value has a plurality of scores, the number of scores of the semantic transition value is consistent with the number of dimensions of a semantic vector, the sequence number of scores of the semantic transition value is consistent with the sequence number of dimensions of the semantic vector, the semantic transition value is Benec, the score with the sequence number v in Benec is Benec [ v ], and the calculation formula of Benec [ v ] is:
wherein Subvec [ v ] represents the number of the dimension with the sequence number v in Subvec, and Relvec [ v ] represents the number of the dimension with the sequence number v in Relvec;
in the triplet set, each triplet corresponds to different adjacent tuples and has corresponding text reaction coefficients respectively, and each adjacent tuple corresponds to each triplet and has corresponding text reaction coefficients respectively;
for each adjacent word tuple, calculating the text reaction coefficient of each adjacent word tuple for each triplet in the triplet set, wherein the text reaction coefficient specifically comprises the following steps:
the number of triples in the triplet set is recorded as n, the sequence number of the triples in the triplet set is recorded as i, the triples with the sequence number of i in the triplet set is recorded as Triple (i),
the number of adjacent tuples in the adjacent tuple set is recorded as m, the sequence number of the adjacent tuple in the adjacent tuple set is recorded as j, the sequence number of the adjacent tuple in the adjacent tuple set is recorded as token (j),
for token (j), calculating a relation correction vector corresponding to token (j) as a Malec (j), wherein the numerical value of the dimension with the sequence number v in the Malec (j) is Malec (j) v, calculating a semantic transition value Benec (i) corresponding to each Triple (i), the score with the sequence number v in Benec (i) is Benec (i) v, and in order to distinguish the cyclic traversal of v in the Malec (j) v and the Benec (i) v, replacing the traversal of the original symbol v in the Malec (j) with v1 to obtain the Malec (j) v1, the original symbol v is replaced by v2 to traverse in the Benec (i) to obtain Benec (i) [ v2], v1 and v2 are changed in the original [1, k ] interval just by replacing the symbol, so that the serial numbers of each dimension in the Malec (j) and the serial numbers of each score in the Benec (i) are enumerated independently, a double nesting cycle is realized, and the text reaction coefficient Condes (j, i) of Tokens (j) to Triple (i) is calculated:
and obtaining text reaction coefficients corresponding to each triplet corresponding to each adjacent tuple.
8. The document generation method using language artificial intelligence according to claim 6 or 7, wherein the method for condensing the complement text according to the text reaction coefficient is as follows:
for each adjacent tuple, counting the text reaction coefficient of each triplet corresponding to the adjacent tuple, and selecting one triplet with the numerical value of the text reaction coefficient being at the minimum value of the text reaction coefficient of each triplet as the closest triplet;
calculating the semantic similarity between two adjacent lemmas in the adjacent lemmas and the text spaced by the two adjacent lemmas, calculating the semantic similarity between a head entity and a tail entity in the closest lemmas and the entity relationship in the closest lemmas, and judging that the lemmas which are simultaneously lower than the semantic similarity between the head entity and the tail entity in the closest lemmas and the entity relationship in the closest lemmas are to be condensed lemmas if the semantic similarity between the lemmas and the text spaced by the adjacent lemmas exists in the adjacent lemmas and the semantic similarity between the head entity and the tail entity in the closest lemmas and the entity relationship in the closest lemmas is simultaneously lower than the semantic similarity between the head entity and the tail entity in the closest lemmas and the entity relationship in the closest lemmas, and taking the closest lemmas as condensation triples;
storing and outputting the word elements to be condensed and the condensed triples;
identifying the word elements to be condensed, and displaying the condensed triplet for replacement.
9. A document generation system employing language artificial intelligence, the document generation system employing language artificial intelligence operating in any computing device of a desktop computer, a notebook computer, or a cloud data center, the computing device comprising: a processor, a memory and a computer program stored in the memory and running on the processor, which processor, when executing the computer program, implements the steps of a document generation method employing language artificial intelligence as claimed in any one of claims 1 to 7.
CN202311187668.7A 2023-09-15 2023-09-15 Document generation method and system applying language artificial intelligence Active CN116933757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311187668.7A CN116933757B (en) 2023-09-15 2023-09-15 Document generation method and system applying language artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311187668.7A CN116933757B (en) 2023-09-15 2023-09-15 Document generation method and system applying language artificial intelligence

Publications (2)

Publication Number Publication Date
CN116933757A true CN116933757A (en) 2023-10-24
CN116933757B CN116933757B (en) 2023-12-29

Family

ID=88380897

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311187668.7A Active CN116933757B (en) 2023-09-15 2023-09-15 Document generation method and system applying language artificial intelligence

Country Status (1)

Country Link
CN (1) CN116933757B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150019589A1 (en) * 2013-07-15 2015-01-15 Sinuhé Arroyo Template-driven structured query generation
CN110956026A (en) * 2019-11-28 2020-04-03 北京华宇元典信息服务有限公司 Legal document generation method and device and electronic equipment
CN114281966A (en) * 2021-11-29 2022-04-05 科大讯飞华南人工智能研究院(广州)有限公司 Question template generation method, question answering device and electronic equipment
CN114330281A (en) * 2022-03-08 2022-04-12 北京京东方技术开发有限公司 Training method of natural language processing model, text processing method and device
CN115457586A (en) * 2022-09-06 2022-12-09 云知声智能科技股份有限公司 Case information extraction method, device, equipment and storage medium
CN115525773A (en) * 2022-10-10 2022-12-27 北京智源人工智能研究院 Training method and device of knowledge graph complement model
CN115687647A (en) * 2022-11-01 2023-02-03 法信公证云(厦门)科技有限公司 Notarization document generation method and device, electronic equipment and storage medium
CN116152843A (en) * 2022-11-22 2023-05-23 南京擎盾信息科技有限公司 Category identification method, device and storage medium for contract template to be filled-in content
US20230177363A1 (en) * 2021-12-03 2023-06-08 International Business Machines Corporation Generation of query templates for knowledge-graph based question answering system
CN116501875A (en) * 2023-04-28 2023-07-28 中电科大数据研究院有限公司 Document processing method and system based on natural language and knowledge graph

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150019589A1 (en) * 2013-07-15 2015-01-15 Sinuhé Arroyo Template-driven structured query generation
CN110956026A (en) * 2019-11-28 2020-04-03 北京华宇元典信息服务有限公司 Legal document generation method and device and electronic equipment
CN114281966A (en) * 2021-11-29 2022-04-05 科大讯飞华南人工智能研究院(广州)有限公司 Question template generation method, question answering device and electronic equipment
US20230177363A1 (en) * 2021-12-03 2023-06-08 International Business Machines Corporation Generation of query templates for knowledge-graph based question answering system
CN114330281A (en) * 2022-03-08 2022-04-12 北京京东方技术开发有限公司 Training method of natural language processing model, text processing method and device
CN115457586A (en) * 2022-09-06 2022-12-09 云知声智能科技股份有限公司 Case information extraction method, device, equipment and storage medium
CN115525773A (en) * 2022-10-10 2022-12-27 北京智源人工智能研究院 Training method and device of knowledge graph complement model
CN115687647A (en) * 2022-11-01 2023-02-03 法信公证云(厦门)科技有限公司 Notarization document generation method and device, electronic equipment and storage medium
CN116152843A (en) * 2022-11-22 2023-05-23 南京擎盾信息科技有限公司 Category identification method, device and storage medium for contract template to be filled-in content
CN116501875A (en) * 2023-04-28 2023-07-28 中电科大数据研究院有限公司 Document processing method and system based on natural language and knowledge graph

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JINGLEI LI ET.AL: "Semantic-consistent learning for one-shot joint entity and relation extraction", 《SPRINGER》, pages 5964 - 5976 *
洪文兴;胡志强;翁洋;张恒;王竹;郭志新;: "面向司法案件的案情知识图谱自动构建", 中文信息学报, no. 01, pages 39 - 49 *

Also Published As

Publication number Publication date
CN116933757B (en) 2023-12-29

Similar Documents

Publication Publication Date Title
Liu et al. Topical word embeddings
Kuncheva Combining pattern classifiers: methods and algorithms
Yavuz et al. Improving semantic parsing via answer type inference
US8630989B2 (en) Systems and methods for information extraction using contextual pattern discovery
CN110945500A (en) Key value memory network
Nagamanjula et al. A novel framework based on bi-objective optimization and LAN2FIS for Twitter sentiment analysis
CN111581949B (en) Method and device for disambiguating name of learner, storage medium and terminal
CN111666350B (en) Medical text relation extraction method based on BERT model
Wang et al. A transfer-learnable natural language interface for databases
CN112597300A (en) Text clustering method and device, terminal equipment and storage medium
Wang et al. A natural language interface for database: Achieving transfer-learnability using adversarial method for question understanding
Feng et al. Probing and fine-tuning reading comprehension models for few-shot event extraction
Zhang et al. Question answering in knowledge bases: A verification assisted model with iterative training
CN116956835A (en) Document generation method based on pre-training language model
Roldán et al. TOMATE: A heuristic-based approach to extract data from HTML tables
Lyu et al. Deep learning for textual entailment recognition
CN116933757B (en) Document generation method and system applying language artificial intelligence
Phuvipadawat et al. Detecting a multi-level content similarity from microblogs based on community structures and named entities
JP2001155027A (en) Method, system and device for calculating similarity between documents, and recording medium recorded with program for similarity calculation
Wibawa et al. Classification Analysis of MotoGP Comments on Media Social Twitter Using Algorithm Support Vector Machine and Naive Bayes
CN110457455B (en) Ternary logic question-answer consultation optimization method, system, medium and equipment
KR102324196B1 (en) System and method for consolidating knowledge base
Singh et al. Universal Schema for Slot Filling and Cold Start: UMass IESL at TACKBP 2013.
Shehu et al. Enhancements to language modeling techniques for adaptable log message classification
Jahan et al. Hate and Offensive language detection using BERT for English Subtask A

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant