CN112528674A - Text processing method, model training method, device, equipment and storage medium - Google Patents

Text processing method, model training method, device, equipment and storage medium Download PDF

Info

Publication number
CN112528674A
CN112528674A CN202011479376.7A CN202011479376A CN112528674A CN 112528674 A CN112528674 A CN 112528674A CN 202011479376 A CN202011479376 A CN 202011479376A CN 112528674 A CN112528674 A CN 112528674A
Authority
CN
China
Prior art keywords
text
sample
processed
editing operation
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011479376.7A
Other languages
Chinese (zh)
Other versions
CN112528674B (en
Inventor
汪硕芃
张荣升
黄诗磊
张聪
范长杰
胡志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202011479376.7A priority Critical patent/CN112528674B/en
Publication of CN112528674A publication Critical patent/CN112528674A/en
Application granted granted Critical
Publication of CN112528674B publication Critical patent/CN112528674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a text processing method, a model training method, a device, equipment and a storage medium, and relates to the technical field of data processing. The text processing method comprises the following steps: acquiring a text to be processed; obtaining a text editing operation sequence corresponding to the text to be processed by adopting a text processing model obtained by pre-training according to the text to be processed, wherein the text editing operation sequence is a sequence formed by characters in the text to be processed and comprises an identifier of an editing operation required to be executed by each character, the text processing model is obtained by using a sample text marked with a text editing operation sequence label, and the text editing operation sequence label is obtained according to the sample text and a marked target text corresponding to the sample text; and obtaining a target text corresponding to the text to be processed according to the text editing operation sequence, wherein the target text comprises a compressed text or a rewritten text corresponding to the text to be processed. The target file readability of the text to be processed obtained by the scheme is high.

Description

Text processing method, model training method, device, equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a text processing method, a model training method, a device, an apparatus, and a storage medium.
Background
With the rapid development of internet and AI (Artificial intelligence) technologies, there are a large number of word commodity descriptions, also called sample texts, in a plurality of fields such as social media. How the content of the sample text can be compressed into phrases or phrases that are less numerous words and more easily understood is some of the problems that remain to be solved in these areas. In the field of NLP (Natural Language processing), text compression (text reduction) is more suitable for the above-mentioned scenario, and text compression, i.e. inputting a segment of characters, is expected to output a small segment of expression with unchanged core semantics and more concise expression.
In the prior art, a text compression task is regarded as a sequence labeling task, each character in a training sample is labeled for model training based on the consideration of simple characters, after the training is completed, a sequence is output based on user input, which words in the text should be reserved and which words should be deleted are indicated, and a compressed text can be obtained after the sequence is restored.
However, in the prior art, semantic control on the whole sentence is lacked, so that the semantic representation of the obtained compressed text is unclear, and the accuracy of the text compression result is poor.
Disclosure of Invention
The present application aims to provide a text processing method, a training method of a model, a training device of a model, a training apparatus of a model, and a storage medium, so as to solve the problems in the prior art that the readability of the compressed text is poor and the semantic expression is unclear due to the lack of overall semantic control, which is caused by the fact that the editing operation of each character is judged based on the character consideration.
In order to achieve the above purpose, the technical solutions adopted in the embodiments of the present application are as follows:
in a first aspect, an embodiment of the present application provides a text processing method, including:
acquiring a text to be processed;
obtaining a text editing operation sequence corresponding to the text to be processed by adopting a text processing model obtained by pre-training according to the text to be processed, wherein the text editing operation sequence is a sequence formed by characters in the text to be processed, the sequence comprises an identifier of an editing operation required to be executed by each character, the text processing model is obtained by using a sample text marked with a text editing operation sequence label, and the text editing operation sequence label is obtained according to the sample text and a marked target text corresponding to the sample text;
and obtaining a target text corresponding to the text to be processed according to the text editing operation sequence, wherein the target text corresponding to the text to be processed comprises a compressed text or an rewritten text corresponding to the text to be processed.
Optionally, before the text editing operation sequence corresponding to the text to be processed is obtained by adopting a text processing model obtained by pre-training according to the text to be processed, the method further includes:
performing field word replacement on the text to be processed by adopting a pre-constructed field word dictionary to obtain a replaced text to be processed;
copying the replaced text to be processed, and splicing the copied text and the text to be processed to obtain a preprocessed text to be processed;
the obtaining of the text editing operation sequence corresponding to the text to be processed by adopting a text processing model obtained by pre-training according to the text to be processed includes:
and obtaining a text editing operation sequence corresponding to the preprocessed to-be-processed text by adopting a text processing model obtained by pre-training according to the preprocessed to-be-processed text.
Optionally, the text processing model includes: an encoding layer and a decoding layer;
the obtaining of the text editing operation sequence corresponding to the preprocessed to-be-processed text by adopting a text processing model obtained by pre-training according to the preprocessed to-be-processed text comprises:
inputting the preprocessed to-be-processed text into the coding layer for semantic coding to obtain a semantic vector of the preprocessed to-be-processed text;
and inputting the semantic vector of the preprocessed text to be processed into the decoding layer for decoding to obtain a text editing operation sequence corresponding to the preprocessed text to be processed.
Optionally, the obtaining a target text corresponding to the text to be processed according to the text editing operation sequence includes:
and obtaining a target text corresponding to the text to be processed according to the identifier of the editing operation required to be executed by each character in the text editing operation sequence and the mapping relation between the identifier of the editing operation and the preset editing operation.
Optionally, the editing operation required to be performed by each character includes: delete operations, reserve operations, or replace operations.
In a second aspect, an embodiment of the present application provides a method for training a text processing model, including:
collecting a sample text data set, the sample text data set comprising a plurality of sample texts, each sample text labeled with: the text editing operation sequence labels are used for identifying the editing operation required to be executed by each character in the sample text, and the text editing operation sequence labels are obtained according to the sample text and the labeled target text corresponding to the sample text;
preprocessing each sample text in the sample text data set to obtain a preprocessed sample text data set;
and training to obtain the text processing model by adopting the preprocessed sample text data set, wherein the text processing model is used for obtaining a text editing operation sequence corresponding to the text to be processed, the text editing operation sequence is a sequence formed by all characters in the text to be processed, and the sequence comprises an identifier of an editing operation required to be executed by each character.
Optionally, the preprocessing each sample text in the sample text data set to obtain a preprocessed sample text data set includes:
acquiring a domain word from each sample text in the sample text data set according to knowledge graph information and/or domain guide information, wherein the domain word is used for representing entity information;
constructing a corresponding relation between each sample text and each field word to form an initial field word data set;
obtaining a domain word dictionary according to the initial domain word data set;
and obtaining the preprocessed sample text data set according to the domain word dictionary.
Optionally, the obtaining a domain word dictionary according to the initial domain word data set includes:
training and acquiring a sequence labeling model according to the initial field word data set, wherein the sequence labeling model is used for identifying field words in a text;
inputting the sample text data set into the sequence labeling model, and identifying and acquiring the field words contained in the sample text data set;
and obtaining the domain word dictionary according to the domain words contained in the sample text data set and the domain words contained in the initial domain word data set.
Optionally, the obtaining the preprocessed sample text data set according to the domain word dictionary includes:
according to the domain word dictionary, performing domain word replacement on each sample text in the sample text data set to obtain a replaced sample text data set;
copying each sample text in the replaced sample text data set to obtain a copied text corresponding to each sample text, and splicing the copied text with each sample text to obtain a plurality of preprocessed sample texts;
and obtaining the preprocessed sample text data set according to the preprocessed sample texts.
Optionally, before preprocessing each sample text in the sample text data set to obtain a preprocessed sample text data set, the method further includes:
acquiring an initial target text corresponding to each sample text in the sample text data set;
and performing field word replacement on the initial target text according to a field word dictionary to obtain a labeled target text corresponding to each sample text.
Optionally, the training to obtain the text processing model by using the preprocessed sample text data set includes:
obtaining a text editing operation sequence label marked by each sample text in the preprocessed sample text data set by adopting a text editing algorithm according to the preprocessed sample text data set and the marked target text corresponding to each sample text;
and training to obtain the text processing model according to the preprocessed sample text data set and the text editing operation sequence label marked by each sample text in the preprocessed sample text data set.
Optionally, the training and obtaining the text processing model according to the preprocessed sample text data set and the text editing operation sequence label of each sample text label in the preprocessed sample text data set includes:
inputting the preprocessed sample text data set into a pre-trained text processing model to obtain a training text editing operation sequence of the pre-trained text processing model;
determining the cross entropy of the pre-trained text processing model according to the training text editing operation sequence and the text editing operation sequence label;
and correcting the pre-trained text processing model according to the cross entropy to obtain the text processing model.
In a third aspect, an embodiment of the present application provides a text processing apparatus, including: the device comprises an acquisition module and a processing module;
the acquisition module is used for acquiring a text to be processed;
the processing module is configured to obtain a text editing operation sequence corresponding to the to-be-processed text by using a text processing model obtained through pre-training according to the to-be-processed text, where the text editing operation sequence is a sequence formed by characters in the to-be-processed text, the sequence includes an identifier of an editing operation that needs to be executed by each character, the text processing model is obtained by using a sample text labeled with a text editing operation sequence label, and the text editing operation sequence label is obtained according to the sample text and a labeled target text corresponding to the sample text;
the obtaining module is further configured to obtain a target text corresponding to the text to be processed according to the text editing operation sequence, where the target text corresponding to the text to be processed includes a compressed text or an rewritten text corresponding to the text to be processed.
Optionally, the apparatus further comprises: a preprocessing module;
the preprocessing module is used for replacing the field words of the text to be processed by adopting a pre-constructed field word dictionary to obtain the replaced text to be processed; and copying the replaced text to be processed, and splicing the copied text and the text to be processed to obtain the preprocessed text to be processed.
The processing module is specifically configured to obtain, according to the preprocessed to-be-processed text, a text editing operation sequence corresponding to the preprocessed to-be-processed text by using a text processing model obtained through pre-training.
Optionally, the text processing model includes: an encoding layer and a decoding layer;
the processing module is specifically configured to input the preprocessed to-be-processed text into the coding layer for semantic coding, so as to obtain a semantic vector of the preprocessed to-be-processed text; and inputting the semantic vector of the preprocessed text to be processed into the decoding layer for decoding to obtain a text editing operation sequence corresponding to the preprocessed text to be processed.
Optionally, the obtaining module is specifically configured to obtain a target text corresponding to the text to be processed according to an identifier of an editing operation that needs to be executed by each character in the text editing operation sequence and a mapping relationship between the identifier of the editing operation and a preset editing operation.
Optionally, the editing operation required to be performed by each character includes: delete operations, reserve operations, or replace operations.
In a fourth aspect, an embodiment of the present application provides a training apparatus for a text processing model, including: the device comprises an acquisition module, a preprocessing module and a training module;
the acquisition module is used for acquiring a sample text data set, the sample text data set comprises a plurality of sample texts, and each sample text is marked with: the text editing operation sequence labels are used for identifying the editing operation required to be executed by each character in the sample text, and the text editing operation sequence labels are obtained according to the sample text and the labeled target text corresponding to the sample text;
the preprocessing module is used for preprocessing each sample text in the sample text data set to obtain a preprocessed sample text data set;
the training module is configured to train and acquire the text processing model by using the preprocessed sample text data set, where the text processing model is used to acquire a text editing operation sequence corresponding to a text to be processed, the text editing operation sequence is a sequence formed by characters in the text to be processed, and the sequence includes an identifier of an editing operation that needs to be executed by each character.
Optionally, the preprocessing module is specifically configured to obtain a domain word from each sample text in the sample text data set according to knowledge graph information and/or domain guidance information, where the domain word is used to represent entity information; constructing a corresponding relation between each sample text and each field word to form an initial field word data set; obtaining a domain word dictionary according to the initial domain word data set; and obtaining the preprocessed sample text data set according to the domain word dictionary.
Optionally, the preprocessing module is specifically configured to train and acquire a sequence tagging model according to the initial field word data set, where the sequence tagging model is used to identify a field word in a text; inputting the sample text data set into the sequence labeling model, and identifying and acquiring the field words contained in the sample text data set; and obtaining the domain word dictionary according to the domain words contained in the sample text data set and the domain words contained in the initial domain word data set.
Optionally, the preprocessing module is specifically configured to perform, according to the domain word dictionary, domain word replacement on each sample text in the sample text data set to obtain a replaced sample text data set; copying each sample text in the replaced sample text data set to obtain a copied text corresponding to each sample text, and splicing the copied text with each sample text to obtain a plurality of preprocessed sample texts; and obtaining the preprocessed sample text data set according to the preprocessed sample texts.
Optionally, the apparatus further comprises: an acquisition module;
the acquisition module is used for acquiring an initial target text corresponding to each sample text in the sample text data set;
and the preprocessing module is further used for performing field word replacement on the initial target text according to a field word dictionary to obtain a labeled target text corresponding to each sample text.
Optionally, the training module is specifically configured to obtain, by using a text editing algorithm, a text editing operation sequence tag marked by each sample text in the preprocessed sample text data set according to the preprocessed sample text data set and the labeled target text corresponding to each sample text; and training to obtain the text processing model according to the preprocessed sample text data set and the text editing operation sequence label marked by each sample text in the preprocessed sample text data set.
Optionally, the training module is specifically configured to input the preprocessed sample text data set into a pre-trained text processing model, so as to obtain a training text editing operation sequence of the pre-trained text processing model; determining the cross entropy of the pre-trained text processing model according to the training text editing operation sequence and the text editing operation sequence label; and correcting the pre-trained text processing model according to the cross entropy to obtain the text processing model.
In a fifth aspect, an embodiment of the present application provides an electronic device, including: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operated, the processor executing the machine-readable instructions to perform the steps of the method as provided in the first or second aspect when executed.
In a sixth aspect, embodiments of the present application provide a storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the steps of the method as provided in the first or second aspect.
The beneficial effect of this application is:
the application provides a text processing method, a model training method, a device, equipment and a storage medium. The text processing method comprises the following steps: acquiring a text to be processed; obtaining a text editing operation sequence corresponding to the text to be processed by adopting a text processing model obtained by pre-training according to the text to be processed, wherein the text editing operation sequence is a sequence formed by characters in the text to be processed and comprises an identifier of an editing operation required to be executed by each character, the text processing model is obtained by using a sample text marked with a text editing operation sequence label, and the text editing operation sequence label is obtained according to the sample text and a marked target text corresponding to the sample text; and obtaining a target text corresponding to the text to be processed according to the text editing operation sequence, wherein the target text corresponding to the text to be processed comprises a compressed text or a rewritten text corresponding to the text to be processed. In the scheme, the text to be processed is processed by pre-training the obtained text processing model, so as to obtain the text editing operation sequence corresponding to the text to be processed, wherein the text processing model is obtained by training the sample text marked with the text editing operation sequence label, the text editing operation sequence label is obtained by marking the target text corresponding to the sample text, and the marking target text corresponding to the sample text retains the semantic information of the sample text, so that the semantic information of the sample text is fully considered by the obtained text editing operation sequence label, the accuracy of the text editing operation sequence label is higher, and the text editing operation sequence corresponding to the text to be processed can be accurately obtained by processing the obtained text processing model trained on the basis of the text editing operation sequence label, so that the target file readability of the text to be processed obtained according to the text editing operation sequence is higher, the accuracy of the processing result of the text to be processed is higher.
Secondly, before the text to be processed is processed through the text processing model, the text to be processed can be preprocessed through a field word replacement and text copying processing mode, so that the field words can be completely reserved in a text editing operation sequence of the preprocessed text to be processed obtained through the text processing model, the readability of the obtained target text is improved, the learning difficulty of the model is greatly reduced when the preprocessed text to be processed is processed through the text processing model, and the processing efficiency is improved.
In addition, the semantic vector of the preprocessed text to be processed can be obtained by coding the preprocessed text to be processed through the coding layer, and the semantic vector can accurately express semantic information contained in the preprocessed text to be processed, so that the accuracy of the obtained text editing operation sequence can be higher based on the decoding of the semantic information, and the target text of the preprocessed text obtained based on the preprocessed text to be processed obtained by the text editing operation sequence has clearer semantics and higher readability.
The training method of the text processing model comprises the following steps: collecting a sample text data set, wherein the sample text data set comprises a plurality of sample texts, and each sample text is marked with: the text editing operation sequence labels are used for identifying the editing operation required to be executed by each character in the sample texts, and the text editing operation sequence labels are obtained according to the sample texts and the labeled target texts corresponding to the sample texts; preprocessing each sample text in the sample text data set to obtain a preprocessed sample text data set; training by adopting a preprocessed sample text data set to obtain a text processing model, wherein the text processing model is used for obtaining a text editing operation sequence corresponding to a text to be processed, the text editing operation sequence is a sequence formed by characters in the text to be processed, and the sequence comprises an identifier of an editing operation required to be executed by each character. In the scheme, a text processing model is obtained by training through a sample text data set marked with a text editing operation sequence label, compared with the prior art, each character in a training sample is marked based on simple character consideration, the text editing operation sequence label in the scheme is calculated according to each sample text and a marked target text corresponding to each sample text, because the marked target text corresponding to each sample text is a more accurate target text obtained by marking the sample text under the condition of keeping sample text semantic information, the text editing operation sequence label of the sample text obtained based on the marked target text corresponding to the sample text fully considers the semantic information of the sample text, and the text processing model is trained through the sample text data set marked with the text editing operation sequence label, the obtained text processing model can accurately predict the text editing operation sequence of the text to be processed, so that the readability of the target text of the text to be processed obtained by restoring based on the text editing operation sequence is high.
In addition, the field word replacement processing is carried out on the sample text data set, so that the field words in the text to be processed can be well processed by the trained text processing model, and the model only needs to be subjected to learning, retaining and deleting operations by carrying out text copying processing on the sample text data set, so that the learning difficulty is relatively low, and the processing efficiency of the model is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a first flowchart of a text processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a text processing method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a text processing method according to an embodiment of the present application;
fig. 4 is a first flowchart illustrating a training method of a text processing model according to an embodiment of the present application;
fig. 5 is a flowchart illustrating a second method for training a text processing model according to an embodiment of the present application;
fig. 6 is a third schematic flowchart of a training method for a text processing model according to an embodiment of the present application;
fig. 7 is a fourth flowchart illustrating a training method of a text processing model according to an embodiment of the present application;
fig. 8 is a schematic flowchart of a fifth method for training a text processing model according to an embodiment of the present application;
fig. 9 is a sixth schematic flowchart of a training method for a text processing model according to an embodiment of the present application;
fig. 10 is a seventh flowchart illustrating a training method of a text processing model according to an embodiment of the present application;
fig. 11 is a schematic diagram of a text processing apparatus according to an embodiment of the present application;
FIG. 12 is a schematic diagram of a training apparatus for a text processing model according to an embodiment of the present application;
fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.
In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.
First, a description will be given of a related background art according to the present embodiment:
with the rapid development of the internet and AI (Artificial intelligence) technology, in many fields such as social media, there are a large number of commodity descriptions with words of about 20-300 (long texts for distinguishing document classes, referred to as short texts herein), and how to compress the contents of these large number of short texts into phrases or phrases with fewer words and more comprehensible words is some of the problems to be solved in these fields. Furthermore, in a multi-turn dialog scenario of a robot, a problem often arises in how to rewrite short text currently input by a user into a plurality of turns of existing historical dialog information that can be better fused to help the robot respond to the user. In the field of NLP (Natural Language processing), there are two main types of tasks that are better suited to the above-mentioned scenarios, text summarization and text compression (text reduction). Text summarization is generally applied to long texts at a document level, and the short summarized summary information is formed by compressing the long texts. The text is simplified, a segment of characters is input, and the system is expected to output a small segment of expression with unchanged core semantics and more concise expression.
The text summarization technology has more corresponding achievements in the industry, but most of the text summarization technology is applied to long texts at chapter level, and the effect is not good in the aspect of short texts facing social media. The definition of the text reduction task is similar to the requirement of compressing short text, and can be equivalent.
However, a great deal of research on text compression is not focused on small-scale data sets, and most models adopt a sequence-to-sequence (seq 2seq) framework which is the most common. The Seq2Seq framework uses the markov assumption of order C, i.e. generating a new word yi +1 depends only on the previous C words that have been generated. Seq2 Seq-based framework models existing in the industry are mostly trained on the scale of more than 20 ten thousand data sets. The effect of directly migrating the existing mature model to a small-scale data set is not ideal, and the problem of poor readability is easily caused when decoding is generated due to the fact that the training data amount is small. In addition, since data is from social media, the social media often have some network new words, and it is a difficult problem how to enable the network new words or the words strongly related to the domain in the original text to normally appear in the model prediction result.
For the task of text compression, there are currently several possible solutions available in the industry:
the method comprises the following steps: the method comprises the steps of utilizing RNN (Recurrent Neural Network) or LSTM (Long Short-Term Memory) as a coding end of a Seq2Seq frame to carry out semantic extraction on a text, utilizing RNN or LSTM at a decoding end of the Seq2Seq frame, taking an autoregressive language model as a training target, and utilizing an attribute layer to comprehensively consider semantic hidden layer vector output of the coding end and a vector obtained by the previous decoding output to carry out the next decoding output during decoding, so that a section of text information is obtained. In addition, in order to generate some vocabularies (OOV) which do not appear in the model training phase in the model decoding phase, a Copy mechanism is also used in the model training phase, so that the model can Copy and output some vocabularies appearing in the original text in the decoding process implicitly in the prediction phase.
The second method comprises the following steps: the text compression task is regarded as a sequence labeling task, each character in a training sample is labeled for model training based on the consideration of simple characters, after the training is finished, a segment of sequence is output based on user input, which words in the segment of text should be reserved and which words should be deleted, and a segment of compressed text can be obtained after the sequence is restored.
However, the above two methods have the following disadvantages, respectively:
the method comprises the following steps: in a text simplification task based on the Seq2Seq framework, the model needs to be learned more each time the model is output in the size of a vocabulary, and the generation effect of the model depends strongly on data. When the amount of the labeled training data is large, a short text with high readability can be generated. However, when the scale of the annotation training data is small, the quality of the generated text is very low. Moreover, even if a Copy mechanism is adopted in the training process, when a new word appearing in some fields related words or social media, such as 'black people lift coffin', is faced, only 'black people' or 'lift coffin' may be reserved in the generation process, and the phrase 'black people lift coffin' cannot be completely decoded and generated in the generation process.
The second method comprises the following steps: based on the task of sequence marking, the semantic control on the whole sentence is lacked, so that the semantic representation is not clear under partial conditions; on the other hand, similar to the above Seq2Seq, since it is determined whether the character needs to be preserved according to the character level of the input text in the sequence annotation, it cannot be completely ensured that some network new words or domain related words can be completely preserved.
Based on the technical problems in the prior art, the text processing method provided by the application has the following inventive concept: the method comprises the steps of acquiring a text editing operation sequence label corresponding to a sample text according to an acquired sample text and a labeled target text corresponding to the sample text, wherein the labeled target text corresponding to the sample text reserves semantic information of the sample text, so that the semantic information of the sample text is fully considered in the acquired text editing operation sequence label, the accuracy of the text editing operation sequence label is higher, and a text processing model obtained by training based on the text editing operation sequence label can accurately process and obtain a text editing operation sequence corresponding to the text to be processed, so that the target text restored according to the text editing operation sequence of the text to be processed has better readability and the accuracy of a text processing result is higher.
In addition, field word replacement is carried out on the collected training sample data to obtain the preprocessed training sample data, so that when the text editing operation sequence of the text to be processed is predicted by the trained text processing model, the field words can be completely reserved, and the integrity of the acquired text processing result is improved.
The method steps involved in the present application, and the advantageous effects produced thereby, will be described in detail below with specific examples.
Fig. 1 is a first flowchart of a text processing method according to an embodiment of the present application; the execution subject of the method may be a processing device such as a terminal device or a server, and as shown in fig. 1, the method may include:
s101, obtaining a text to be processed.
Optionally, the method of the present application may be applied to scenes involving commodity description, such as social media, and text information, such as commodity description, may be processed by the method of the present application to obtain the target text. Or can also be applied to scenes such as intelligent conversations and the like which relate to conversation interaction, such as: in the robot dialogue system, the method can be used for processing the dialogue input by the user, so that the robot can accurately reply to the user according to the processed target text. Of course, the application scenario of the scheme of the present application is not limited to the above-mentioned example, and in practical applications, the scheme may be applied to scenarios that involve related processing of a text.
The text to be processed may be text extracted from social media software, for example: commodity description information obtained from a shopping application program, or audio/video brief introduction information obtained from an audio/video playing application program, and the like. Or the text information input by the user is acquired from the terminal device, wherein the terminal device may be a conversation robot, a vehicle-mounted terminal, a smart phone and the like. Or information search text input by a user, crawled from a web page, or the like. And will not be described in detail.
S102, according to a text to be processed, obtaining a text editing operation sequence corresponding to the text to be processed by adopting a text processing model obtained by pre-training, wherein the text editing operation sequence is a sequence formed by characters in the text to be processed and comprises an identifier of an editing operation required to be executed by each character, the text processing model is obtained by using a sample text marked with a text editing operation sequence label, and the text editing operation sequence label is obtained according to the sample text and a labeled target text corresponding to the sample text.
In this embodiment, the text processing model obtained by training may be applied to process the text to be processed, so as to obtain the text editing operation sequence corresponding to the text to be processed.
The text editing operation sequence comprises a text to be processed, and the difference between the text to be processed and the text to be processed is that each character contained in the text to be processed is added with an identifier of an editing operation corresponding to the character. The identifier of the editing operation corresponding to each character can be understood as an identifier of the editing operation that needs to be executed by each character in the text to be processed when the target text is obtained from the text to be processed. For example: the text to be processed is "i really want to rest", the target text corresponding to the text to be processed is "i want to rest", then the text editing operation sequence of the text to be processed may be "KEEP (i) DEL (really) DEL (good) KEEP (want) KEEP (rest) KEEP (information)", where KEEP and DEL are both identifications of editing operations, and the identifications of different editing operations correspond to different meanings, for example: KEEP is reserved and DEL is deleted.
Optionally, the labeling target text corresponding to the sample text is obtained by labeling the sample text on the premise that the semantic information of the sample text is reserved according to the semantic information of the sample text, and the original semantic information of the sample text is not changed by the labeling target text. Therefore, the semantic information of the sample text is fully considered by the text editing operation sequence label obtained based on the labeled target text, and the accuracy of the text editing operation sequence of the text to be processed obtained according to the text processing model is higher.
S103, obtaining a target text corresponding to the text to be processed according to the text editing operation sequence, wherein the target text corresponding to the text to be processed comprises a compressed text or a rewritten text corresponding to the text to be processed.
Optionally, the editing operation to be executed by each character may be determined according to the identifier of the editing operation to be executed by each character in the text editing operation sequence, so that each character in the text editing operation sequence is processed according to the editing operation to be executed by each character, and thus the target text corresponding to the text to be processed is obtained. Due to the fact that the accuracy of the text editing operation sequence corresponding to the obtained text to be processed is high, the readability of the target text corresponding to the obtained text to be processed is high according to the text editing operation sequence, and the accuracy is high.
In some embodiments, the target text may be a compressed text corresponding to the text to be processed, or may also be a rewritten text corresponding to the text to be processed, and corresponding processing is performed specifically according to the requirement of the actual application. For the same text to be processed, the corresponding compressed text or rewritten text may be different, and in step S102, the text editing operation sequence of the text to be processed obtained through the text processing model is also different, so that the target text corresponding to the text to be processed can be obtained based on the obtained text editing operation sequence.
In summary, the text processing method provided in this embodiment includes: acquiring a text to be processed; obtaining a text editing operation sequence corresponding to the text to be processed by adopting a text processing model obtained by pre-training according to the text to be processed, wherein the text editing operation sequence is a sequence formed by characters in the text to be processed and comprises an identifier of an editing operation required to be executed by each character, the text processing model is obtained by using a sample text marked with a text editing operation sequence label, and the text editing operation sequence label is obtained according to the sample text and a marked target text corresponding to the sample text; and obtaining a target text corresponding to the text to be processed according to the text editing operation sequence, wherein the target text corresponding to the text to be processed comprises a compressed text or a rewritten text corresponding to the text to be processed. In the scheme, the text to be processed is processed by pre-training the obtained text processing model, so as to obtain the text editing operation sequence corresponding to the text to be processed, wherein the text processing model is obtained by training the sample text marked with the text editing operation sequence label, the text editing operation sequence label is obtained by marking the target text corresponding to the sample text, and the marking target text corresponding to the sample text retains the semantic information of the sample text, so that the semantic information of the sample text is fully considered by the obtained text editing operation sequence label, the accuracy of the text editing operation sequence label is higher, and the text editing operation sequence corresponding to the text to be processed can be accurately obtained by processing the obtained text processing model trained on the basis of the text editing operation sequence label, so that the target file readability of the text to be processed obtained according to the text editing operation sequence is higher, the accuracy of the processing result of the text to be processed is higher.
Fig. 2 is a schematic flowchart of a text processing method according to an embodiment of the present application; optionally, as shown in fig. 2, before the step S102, according to the text to be processed, obtaining a text editing operation sequence corresponding to the text to be processed by using a text processing model obtained by pre-training, the method of the present application may further include:
s201, performing field word replacement on the text to be processed by adopting a pre-constructed field word dictionary to obtain the replaced text to be processed.
In some embodiments, the obtained text to be processed may include some domain words, where the domain words may be understood as some specific entity information appearing in the text, or some specific words related to the domain, such as: name of a person, name of a song, name of a movie, black words of certain game playing methods or network new words and the like.
For the domain words appearing in the text to be processed, due to the fact that the updating development of the domain words is fast, when the text to be processed is processed by adopting the text processing model, the processing effect of the model on the domain words is not ideal. In this embodiment, the field words in the text to be processed may be replaced to obtain the replaced text to be processed.
The field words in the text to be processed can be replaced by adopting a preset identifier. Optionally, the preset identifier may be a uniform placeholder, for example: PLACE _ HOLD _ NUM _1, PLACE _ HOLD _ NUM _2, PLACE _ HOLD _ NUM _3, etc., where specific numbers are used to distinguish different domain words.
For example: the text to be processed is: after field word replacement is carried out on the two-dimensional code, the obtained text to be processed after replacement can be a # PLACE _ HOLD _ NUM _1# PLACE _ HOLD _ NUM _1PLACE _ HOLD _ NUM _ 2-dimensional code _ HOLD _ NUM _3 evaluation match, and the two-dimensional code can be a devil wherein PLACE _ HOLD _ NUM _1 is used for representing the field word; PLACE _ HOLD _ NUM _ 2: for representing a anchor; PLACE _ HOLD _ NUM _ 3: used for representing Chinese clothes.
Of course, the preset identifier may be other types of symbols, and may be used to uniquely represent different domain words.
S202, copying the replaced text to be processed, and splicing the copied text and the text to be processed to obtain the preprocessed text to be processed.
Optionally, for the replaced text to be processed, further performing a copy operation, that is, copying a copy of the replaced text to be processed, and splicing the copy of the text to be processed and the replaced text to be processed together by using a preset splicing character, so as to finally obtain the pre-processed text to be processed.
For example: for the above-mentioned replaced text to be processed, "tile _ HOLD _ NUM _1# tile _ HOLD _ NUM _1 _ tile _ NUM _2 governs tile _ HOLD _ NUM _3 evaluation, you are ghost, after the copy operation, the obtained preprocessed text to be processed may be" tile _ HOLD _ NUM _1# tile _ HOLD _ NUM _1 _ tile _ NUM _3 evaluation, you are ghost ([ MID ]) tile _ HOLD _ NUM _1# tile _ HOLD _ NUM _2 governs tile _ HOLD _ NUM _3 evaluation, you are ghost ([ MID ]) tile _ HOLD _ NUM _1# tile _ HOLD _ NUM _1 _ tile _ NUM _2, and "ghost is a ghost, wherein" ghost is a concatenation of "ghost.
For example, if text compression is performed on a text to be processed, the text to be processed is "what is called demon hunter" for short, and is compressed to "demon hunter" for short. It can be seen that there is a reversal in the word order of the partial words of the text to be processed and the words of the compressed text.
If the text copying operation is not performed, an operation of replacing blank characters with 'abbreviation' is required when a compressed text corresponding to the text to be processed is obtained through the text editing operation. When the amount of data is large, the contents of the model to be predicted last may be increased intangibly, and the result of model learning may be undesirable.
If the copy operation is performed, the text to be processed becomes: "what is used for abbreviator [ MID ] and what is used for abbreviator" are perfectly processed by only retaining the 'abbreviator' in the first sentence and the 'abbreviation' in the second sentence and deleting the rest of the characters. Therefore, when the text processing model processes the preprocessed text to be processed, the learning difficulty of the model is greatly reduced, and the processing efficiency is improved.
Optionally, in step S102, obtaining a text editing operation sequence corresponding to the text to be processed by using a text processing model obtained by pre-training according to the text to be processed, where the text editing operation sequence may include:
and S203, obtaining a text editing operation sequence corresponding to the preprocessed to-be-processed text by adopting a text processing model obtained by pre-training according to the preprocessed to-be-processed text.
Optionally, the preprocessed text to be processed is input into a text processing model, so that a text editing operation sequence corresponding to the preprocessed text to be processed can be obtained. The preset identifiers used for representing the field words are reserved in the text editing operation sequence, so that the field words in the text to be processed are completely reserved in a result output by the model, readability of a target text of the text to be processed obtained according to the text editing operation sequence is higher, and semantic misinterpretation or omission of important semantics cannot occur.
Fig. 3 is a schematic flowchart of a text processing method according to an embodiment of the present application; optionally, the text processing model in step S102 may include: an encoding layer and a decoding layer.
In step S203, obtaining a text editing operation sequence corresponding to the preprocessed to-be-processed text by using a text processing model obtained by pre-training according to the preprocessed to-be-processed text, which may include:
s301, inputting the preprocessed to-be-processed text into a coding layer for semantic coding to obtain a semantic vector of the preprocessed to-be-processed text.
In this embodiment, the coding layer may adopt a Bert (Bidirectional Encoder Representation from transforms) model. By adopting the Bert, more data can be better utilized to construct a sentence vector which is more consistent with semantic information. After the preprocessed text to be processed is input into the Bert for processing, the preprocessed text to be processed can be mapped into a matrix vector, a sentence vector corresponding to the preprocessed text to be processed is output, and a hidden layer coding representation is performed on the sentence vector, so that a semantic vector of the preprocessed text to be processed is obtained.
S302, inputting the semantic vector of the preprocessed to-be-processed text into a decoding layer for decoding to obtain a text editing operation sequence corresponding to the preprocessed to-be-processed text.
In this embodiment, the decoding layer may use CRF (Conditional Random Fields), which is a Conditional probability distribution model of another set of output sequences given a set of input sequences. The method can decode the semantic vector of the preprocessed to-be-processed text obtained by encoding the coding layer, so as to obtain the probability distribution condition of the editing operation corresponding to each character in the preprocessed to-be-processed text, and further output the text editing operation sequence corresponding to the preprocessed to-be-processed text according to the probability distribution condition.
Of course, the above is only an exemplary form of the coding layer and the decoding layer, and in practical applications, the coding layer and the decoding layer may be formed by other models, which is not specifically limited in this application.
The method comprises the steps that a pre-processed text is coded through a coding layer, a semantic vector of the pre-processed text can be obtained, the semantic vector can accurately express semantic information contained in the pre-processed text, therefore, based on the decoding of the semantic information, the accuracy of an obtained text editing operation sequence is higher, and based on the pre-processed text obtained by the text editing operation sequence, the target text of the obtained text to be processed is clearer in semantics and higher in readability.
Optionally, in step S103, obtaining a target text corresponding to the text to be processed according to the text editing operation sequence may include: and obtaining a target text corresponding to the text to be processed according to the identifier of the editing operation required to be executed by each character in the text editing operation sequence and the mapping relation between the identifier of the editing operation and the preset editing operation.
Optionally, the editing operations required to be performed by each character may include: delete operations, reserve operations, or replace operations.
Optionally, the identification of the editing operation is used to uniquely distinguish between different editing operations. In this embodiment, the identifier DEL of the editing operation may be used to indicate a delete operation, KEEP indicates a reserve operation, and KEEP | indicates a replace operation. Of course, in practical applications, the editing operation that needs to be performed for each character may not be limited to the above-mentioned one, but may also include an operation of inserting a character while deleting the character, and the like. The identifier for indicating the editing operation of different editing operations may also be in other forms, and is not limited to the example.
Assuming that the text to be processed is "what is used for abbreviator", the sequence of text editing operations of the preprocessed text to be processed obtained by the text processing model is "DEL (sh) DEL (how) DEL (to) DEL (short) DEL (dislike) KEEP (game) KEEP (hand) DEL ([ MID ]) DEL (for) DEL (how) DEL (to) DEL (short) KEEP (dislike) DEL (hand)", then the target text corresponding to the text to be processed is "abbreviator" according to the mapping relationship between the identity of the editing operations and the preset editing operations. Here, [ MID ] has been described above as a splicer for splicing the copied text, and the splicer [ MID ] may be deleted by default when the target text is obtained by the text editing operation sequence.
In summary, the text processing method provided in this embodiment includes: acquiring a text to be processed; obtaining a text editing operation sequence corresponding to the text to be processed by adopting a text processing model obtained by pre-training according to the text to be processed, wherein the text editing operation sequence is a sequence formed by characters in the text to be processed and comprises an identifier of an editing operation required to be executed by each character, the text processing model is obtained by using a sample text marked with a text editing operation sequence label, and the text editing operation sequence label is obtained according to the sample text and a marked target text corresponding to the sample text; and obtaining a target text corresponding to the text to be processed according to the text editing operation sequence, wherein the target text corresponding to the text to be processed comprises a compressed text or a rewritten text corresponding to the text to be processed. In the scheme, the text to be processed is processed by pre-training the obtained text processing model, so as to obtain the text editing operation sequence corresponding to the text to be processed, wherein the text processing model is obtained by training the sample text marked with the text editing operation sequence label, the text editing operation sequence label is obtained by marking the target text corresponding to the sample text, and the marking target text corresponding to the sample text retains the semantic information of the sample text, so that the semantic information of the sample text is fully considered by the obtained text editing operation sequence label, the accuracy of the text editing operation sequence label is higher, and the text editing operation sequence corresponding to the text to be processed can be accurately obtained by processing the obtained text processing model trained on the basis of the text editing operation sequence label, so that the target file readability of the text to be processed obtained according to the text editing operation sequence is higher, the accuracy of the processing result of the text to be processed is higher.
Secondly, before the text to be processed is processed through the text processing model, the text to be processed can be preprocessed through a field word replacement and text copying processing mode, so that the field words can be completely reserved in a text editing operation sequence of the preprocessed text to be processed obtained through the text processing model, the readability of the obtained target text is improved, the learning difficulty of the model is greatly reduced when the preprocessed text to be processed is processed through the text processing model, and the processing efficiency is improved.
In addition, the semantic vector of the preprocessed text to be processed can be obtained by coding the preprocessed text to be processed through the coding layer, and the semantic vector can accurately express semantic information contained in the preprocessed text to be processed, so that the accuracy of the obtained text editing operation sequence can be higher based on the decoding of the semantic information, and the target text of the preprocessed text obtained based on the preprocessed text to be processed obtained by the text editing operation sequence has clearer semantics and higher readability.
As follows, a specific training process of the text processing model applied in the above embodiments will be explained through a plurality of embodiments.
Fig. 4 is a first flowchart illustrating a training method of a text processing model according to an embodiment of the present application; the execution subject of the method can also be a terminal device or a processing device such as a server or a computer. As shown in fig. 4, the training method of the text processing model may include:
s401, collecting a sample text data set, wherein the sample text data set comprises a plurality of sample texts, and each sample text is marked with: the text editing operation sequence labels are used for identifying the editing operation required to be executed by each character in the sample texts, and the text editing operation sequence labels are obtained according to the sample texts and the labeled target texts corresponding to the sample texts.
Alternatively, a crawler system may be relied on to collect sample texts from a large number of social media such as microblogs, posts, etc., or from dialog log data, and a sample text data set may be formed by a plurality of collected sample texts.
In this embodiment, the number of characters included in each sample text is smaller than a preset number, so as to distinguish a long text of a document type, that is, each sample text is a short text. Wherein. The preset number can be 300, and the preset number can be flexibly adjusted according to actual conditions.
Wherein each sample text is further labeled with: and the text editing operation sequence label is used for training and acquiring a text processing model according to the sample text data set marked with the text editing operation sequence label, so that the text processing model can process the text to be processed to obtain a text editing operation sequence corresponding to the text to be processed.
S402, preprocessing each sample text in the sample text data set to obtain a preprocessed sample text data set.
Similar to the application process of the above model, when the text processing model is obtained by training, the collected sample text data set needs to be preprocessed to obtain the preprocessed sample text data set. Wherein, can include: replacement of domain words and text copying.
And S403, training by using the preprocessed sample text data set to obtain a text processing model, wherein the text processing model is used for obtaining a text editing operation sequence corresponding to the text to be processed, the text editing operation sequence is a sequence formed by characters in the text to be processed, and the sequence comprises an identifier of an editing operation required to be executed by each character.
Optionally, the obtained preprocessed sample text data set is input into a pre-trained text processing model to train to obtain the text processing model.
By the text editing operation sequence label marked by the sample text, the text processing model obtained by training can be accurately processed to obtain a text editing operation sequence corresponding to the text to be processed, so that the readability of the target text obtained based on the text editing operation sequence is better. Moreover, because the sample texts are preprocessed, the trained text processing model can better process the field words, and the processing efficiency of the model is improved.
In summary, the training method of the text processing model provided in this embodiment includes: collecting a sample text data set, wherein the sample text data set comprises a plurality of sample texts, and each sample text is marked with: the text editing operation sequence labels are used for identifying the editing operation required to be executed by each character in the sample texts, and the text editing operation sequence labels are obtained according to the sample texts and the labeled target texts corresponding to the sample texts; preprocessing each sample text in the sample text data set to obtain a preprocessed sample text data set; training by adopting a preprocessed sample text data set to obtain a text processing model, wherein the text processing model is used for obtaining a text editing operation sequence corresponding to a text to be processed, the text editing operation sequence is a sequence formed by characters in the text to be processed, and the sequence comprises an identifier of an editing operation required to be executed by each character. In the scheme, a text processing model is obtained by training through a sample text data set marked with a text editing operation sequence label, compared with the prior art, each character in a training sample is marked based on simple character consideration, the text editing operation sequence label in the scheme is calculated according to each sample text and a marked target text corresponding to each sample text, because the marked target text corresponding to each sample text is a more accurate target text obtained by marking the sample text under the condition of keeping sample text semantic information, the text editing operation sequence label of the sample text obtained based on the marked target text corresponding to the sample text fully considers the semantic information of the sample text, and the text processing model is trained through the sample text data set marked with the text editing operation sequence label, the obtained text processing model can accurately predict the text editing operation sequence of the text to be processed, so that the readability of the target text of the text to be processed obtained by restoring based on the text editing operation sequence is high.
Fig. 5 is a flowchart illustrating a second method for training a text processing model according to an embodiment of the present application; optionally, as shown in fig. 5, in the step S402, preprocessing each sample text in the sample text data set to obtain a preprocessed sample text data set, which may include:
s501, according to the knowledge graph information and/or the field guidance information, field words are obtained from each sample text in the sample text data set, and the field words are used for representing entity information.
It should be noted that the knowledge map is also called scientific knowledge map, is called knowledge domain visualization or knowledge domain mapping map in the book intelligence field, is a series of different graphs for displaying the relation between the knowledge development process and the structure, describes knowledge resources and their carriers by using visualization technology, and excavates, analyzes, constructs, draws and displays knowledge and the mutual relation between them.
In this embodiment, the collected sample text data set may be mined to obtain domain word information that has appeared in the public knowledge map information, depending on the public knowledge map information and the domain expert guidance information. Thereby resulting in the domain words included in the sample text dataset.
S502, constructing the corresponding relation between each sample text and each field word to form an initial field word data set.
Optionally, each sample text in the sample text data set may include a field word, and then, a corresponding relationship between the obtained field word and the sample text corresponding to the field word may be constructed, so that an initial field word data set is formed according to the constructed corresponding relationship. The initial domain word data set may include a plurality of groups of key-value pairs, and each group of key-value pairs is used to record a correspondence between a domain word and a sample text.
And S503, obtaining a domain word dictionary according to the initial domain word data set.
In consideration of the existing public knowledge graph, the existing public knowledge graph has defects in timeliness, and only the domain words appearing in the knowledge graph can be mined from the sample text data set, namely, the obtained domain words are not complete enough. Then, the sequence tagging model may be further trained according to the initial domain word data set, so as to identify and acquire domain words from the sample text data set according to the trained sequence tagging model, so as to supplement the domain words obtained based on the knowledge graph information and/or the domain guidance information, thereby obtaining the domain word dictionary.
And S504, obtaining a preprocessed sample text data set according to the domain word dictionary.
Optionally, the two parts of domain words are obtained, so that the obtained domain word dictionary is relatively complete, and the accuracy of the preprocessed sample text data set is relatively high when the sample text is preprocessed according to the domain word dictionary.
In this scheme, the preprocessing of the sample text data set may include: domain word replacement and text replication. In practical application, other pretreatment methods can be added or exchanged according to training requirements.
Fig. 6 is a third schematic flowchart of a training method for a text processing model according to an embodiment of the present application; alternatively, as shown in fig. 6, in step S503, obtaining a domain word dictionary according to the initial domain word data set may include:
s601, training and obtaining a sequence labeling model according to the initial field word data set, wherein the sequence labeling model is used for identifying field words in the text.
Sequence annotation is optionally a major task in the field of NLP (natural language processing) at the sentence level, predicting the tags in a sequence that need to be annotated on a given text sequence. Then only one sequence tagging model is needed here to accept the entire text length as input and the output as extracted domain words.
When the training sequence is labeled with the model, the training data is the formed initial field word data set. Considering that the model has generalization, the domain words predicted by the model can be output in the model prediction stage.
S602, inputting the sample text data set into the sequence labeling model, and identifying and obtaining the field words contained in the sample text data set.
In some embodiments, the trained sequence tagging model may be used to identify the domain words included in the sample text data set to obtain a plurality of domain words. The recognized domain words may include the domain words obtained according to the knowledge graph information and/or the domain guidance information.
And S603, obtaining a domain word dictionary according to the domain words contained in the identified and obtained sample text data set and the domain words contained in the initial domain word data set.
Optionally, the domain words included in the initial domain word data set may be supplemented according to the domain words included in the sample text data set identified by the sequence tagging model. That is, the domain words identified by the model and the domain words in the initial domain word data set together form a domain word dictionary.
And only one field word is reserved for the field word identified by the model and the field word included in the initial field word data set, wherein the field word has repeated field words.
By the method, a relatively complete domain word dictionary can be obtained for preprocessing the sample text data set in the following embodiments.
Fig. 7 is a fourth flowchart illustrating a training method of a text processing model according to an embodiment of the present application; optionally, as shown in fig. 7, in the step S504, obtaining the preprocessed sample text data set according to the domain word dictionary may include:
and S701, performing field word replacement on each sample text in the sample text data set according to the field word dictionary to obtain a replaced sample text data set.
Optionally, domain word replacement is performed on each sample text in the sample text data set, which can be understood by referring to the example in step S201, and details are not repeated here.
S702, copying each sample text in the replaced sample text data set to obtain a copied text corresponding to each sample text, and splicing the copied texts with each sample text to obtain a plurality of preprocessed sample texts.
Optionally, a copy operation is further required for each sample text in the replaced sample text data set obtained after the replacement of the domain word, and similarly, the example of step S202 may be referred to for understanding, and details thereof are not repeated here.
After the preprocessing of domain word replacement and text copying, a plurality of preprocessed sample texts can be obtained.
And S703, obtaining a preprocessed sample text data set according to the preprocessed sample texts.
Optionally, a plurality of preprocessed sample texts may be combined to obtain a preprocessed sample text data set. And taking the preprocessed sample text data set as a training sample of the text processing model to train the text processing model.
Fig. 8 is a schematic flowchart of a fifth method for training a text processing model according to an embodiment of the present application; optionally, as shown in fig. 8, before preprocessing each sample text in the sample text data set in step S402 to obtain a preprocessed sample text data set, the method of the present application may further include:
s801, obtaining an initial target text corresponding to each sample text in the sample text data set.
Optionally, the initial target text corresponding to each sample text is determined according to a specific text processing task, for example: when the trained text processing model is used for text compression, the obtained initial target text corresponding to each sample text may be a compressed text corresponding to the sample text. Such as: the sample text is ' the # stoechite legend and the # stoechite legend stewardship Chinese uniform selection contest ', and the two are devil's, and the corresponding initial target text is ' the # stoechite legend stewardship Chinese uniform selection contest '. When the trained text processing model is used for text rewriting, the obtained initial target text corresponding to each sample text may be a rewritten text corresponding to the sample text. For example, the sample text is "user a: i feel very boring; and a user B: then do not chat me with me; the user A: that can do nothing else than this ", which corresponds to the initial target text" can do nothing else than chatting with me ".
And S802, replacing the field words of the initial target text according to the field word dictionary to obtain a labeled target text corresponding to each sample text.
Similarly, for the initial target text corresponding to each sample text, field word replacement is also required to obtain the labeled target text corresponding to each sample text.
The domain dictionary used in this embodiment may be the domain dictionary obtained in steps S601 to S603, or may be a domain dictionary pre-constructed in another manner, which is not specifically limited in this application.
When the domain word is replaced for the initial target text corresponding to each sample text according to the domain word dictionary, the replacement method may be the same as that listed in the above embodiments, and details are not repeated here.
Fig. 9 is a sixth schematic flowchart of a training method for a text processing model according to an embodiment of the present application; optionally, as shown in fig. 9, in step S403, training to obtain a text processing model by using the preprocessed sample text data set may include:
and S901, obtaining a text editing operation sequence label marked by each sample text in the preprocessed sample text data set by adopting a text editing algorithm according to the preprocessed sample text data set and the labeled target text corresponding to each sample text.
In some embodiments, the text editing operation sequence label marked by each sample text in the sample text data set may be obtained according to the preprocessed sample text data set and the labeled target text corresponding to each sample text in the data set.
Optionally, a text editing algorithm may be adopted to automatically convert each sample text in the preprocessed sample text data set into a text editing operation sequence according to the corresponding labeled target text, so that the text editing operation sequence label of each sample text is determined according to the text editing operation sequence converted from each sample text, so as to label each sample text.
The text editing algorithm adopted in the embodiment is an existing mature algorithm, and is only simply applied here, and specific principles of the algorithm are not specifically described.
S902, training and obtaining a text processing model according to the preprocessed sample text data set and the text editing operation sequence label marked by each sample text in the preprocessed sample text data set.
The text editing operation sequence label of each sample text is obtained through each sample text and the labeled target text corresponding to each sample text, the labeled target text corresponding to each sample text is the more accurate target text obtained under the condition that the labeled target text accords with the semantic information of the sample text, so that the obtained text editing operation sequence label of the sample text is the label obtained by considering the semantic information of the sample text, the training of the text processing model is carried out through the sample text data set labeled with the text editing operation sequence label, and the obtained text processing model can accurately predict the text editing operation sequence of the text to be processed, so that the readability of the target text of the text to be processed obtained through reduction based on the text editing operation sequence is higher.
Fig. 10 is a seventh flowchart illustrating a training method of a text processing model according to an embodiment of the present application; optionally, in step S902, training to obtain a text processing model according to the preprocessed sample text data set and the text editing operation sequence label of each sample text label in the preprocessed sample text data set, which may include:
s1001, inputting the preprocessed sample text data set into a pre-trained text processing model to obtain a training text editing operation sequence of the pre-trained text processing model.
Optionally, the pre-trained text processing model may predict and output a training text editing operation sequence corresponding to each sample text according to the input pre-processed sample text data set, where the training text editing operation sequence may also be understood as an actual prediction result of the model.
S1002, determining the cross entropy of the pre-trained text processing model according to the training text editing operation sequence and the text editing operation sequence label.
Optionally, the actual prediction result "training text editing operation sequence" output by the model may be compared with the expected result "text editing operation sequence tag" corresponding to each sample text, and the cross entropy of the training text editing operation sequence and the text editing operation sequence tag may be calculated.
The cross entropy is one of the loss functions, and is used for measuring the distance between two distributions, and the smaller the cross entropy, the closer the cross entropy is to the distribution, which means the better the model learns.
S1003, correcting the pre-trained text processing model according to the cross entropy to obtain a text processing model.
Optionally, whether the cross entropy is smaller than a preset threshold value or not can be judged according to the cross entropy obtained by each round of training, if not, the iterative training is continued to correct the model, and the training is stopped until the cross entropy is smaller than the preset threshold value, so that the trained text processing model is obtained.
Optionally, based on the trained text processing model, the text to be processed may be processed to obtain a text editing operation sequence corresponding to the text to be processed, so as to obtain a target text corresponding to the text to be processed according to the text editing operation sequence, and when an identifier for representing a field word exists in the target text, the field word may be restored according to the identifier to obtain a final target text.
In summary, the training method of the text processing model provided in the embodiment of the present application includes: collecting a sample text data set, wherein the sample text data set comprises a plurality of sample texts, and each sample text is marked with: the text editing operation sequence labels are used for identifying the editing operation required to be executed by each character in the sample texts, and the text editing operation sequence labels are obtained according to the sample texts and the labeled target texts corresponding to the sample texts; preprocessing each sample text in the sample text data set to obtain a preprocessed sample text data set; training by adopting a preprocessed sample text data set to obtain a text processing model, wherein the text processing model is used for obtaining a text editing operation sequence corresponding to a text to be processed, the text editing operation sequence is a sequence formed by characters in the text to be processed, and the sequence comprises an identifier of an editing operation required to be executed by each character.
In the scheme, a text processing model is obtained by training through a sample text data set marked with a text editing operation sequence label, compared with the prior art, each character in a training sample is marked based on simple character consideration, the text editing operation sequence label in the scheme is calculated according to each sample text and a marked target text corresponding to each sample text, because the marked target text corresponding to each sample text is a more accurate target text obtained by marking the sample text under the condition of keeping sample text semantic information, the text editing operation sequence label of the sample text obtained based on the marked target text corresponding to the sample text fully considers the semantic information of the sample text, and the text processing model is trained through the sample text data set marked with the text editing operation sequence label, the obtained text processing model can accurately predict the text editing operation sequence of the text to be processed, so that the readability of the target text of the text to be processed obtained by restoring based on the text editing operation sequence is high.
In addition, the field word replacement processing is carried out on the sample text data set, so that the field words in the text to be processed can be well processed by the trained text processing model, and the model only needs to be subjected to learning, retaining and deleting operations by carrying out text copying processing on the sample text data set, so that the learning difficulty is relatively low, and the processing efficiency of the model is improved.
The following describes apparatuses, devices, and storage media for executing the text processing method and the training method for a text processing model provided in the present application, and specific implementation processes and technical effects thereof are referred to above, and will not be described again below.
Fig. 11 is a schematic diagram of a text processing apparatus according to an embodiment of the present application, and as shown in fig. 11, the apparatus may include: an acquisition module 110 and a processing module 120;
an obtaining module 110, configured to obtain a text to be processed;
the processing module 120 is configured to obtain, according to a to-be-processed text, a text editing operation sequence corresponding to the to-be-processed text by using a text processing model obtained through pre-training, where the text editing operation sequence is a sequence formed by characters in the to-be-processed text, the sequence includes an identifier of an editing operation that needs to be executed by each character, the text processing model is obtained by using a sample text labeled with a text editing operation sequence label, and the text editing operation sequence label is obtained according to the sample text and a labeled target text corresponding to the sample text;
the obtaining module 110 is further configured to obtain a target text corresponding to the text to be processed according to the text editing operation sequence, where the target text corresponding to the text to be processed includes a compressed text or an rewritten text corresponding to the text to be processed.
Optionally, the apparatus further comprises: a preprocessing module;
the preprocessing module is used for replacing the field words of the text to be processed by adopting a pre-constructed field word dictionary to obtain the replaced text to be processed; and copying the replaced text to be processed, and splicing the copied text and the text to be processed to obtain the preprocessed text to be processed.
The processing module 120 is specifically configured to obtain, according to the preprocessed to-be-processed text, a text editing operation sequence corresponding to the preprocessed to-be-processed text by using a text processing model obtained through pre-training.
Optionally, the text processing model comprises: an encoding layer and a decoding layer;
the processing module 120 is specifically configured to input the preprocessed to-be-processed text into a coding layer for semantic coding, so as to obtain a semantic vector of the preprocessed to-be-processed text; and inputting the semantic vector of the preprocessed text to be processed into a decoding layer for decoding to obtain a text editing operation sequence corresponding to the preprocessed text to be processed.
Optionally, the obtaining module 110 is specifically configured to obtain a target text corresponding to the text to be processed according to the identifier of the editing operation that needs to be executed by each character in the text editing operation sequence and a mapping relationship between the identifier of the editing operation and a preset editing operation.
Optionally, the editing operations required to be performed by each character include: delete operations, reserve operations, or replace operations.
Fig. 12 is a schematic diagram of a training apparatus for a text processing model according to an embodiment of the present application, and as shown in fig. 12, the training apparatus may include: an acquisition module 210, a preprocessing module 220, and a training module 230;
an acquiring module 210, configured to acquire a sample text data set, where the sample text data set includes a plurality of sample texts, and each sample text is marked with: the text editing operation sequence labels are used for identifying the editing operation required to be executed by each character in the sample texts, and the text editing operation sequence labels are obtained according to the sample texts and the labeled target texts corresponding to the sample texts;
the preprocessing module 220 is configured to preprocess each sample text in the sample text data set to obtain a preprocessed sample text data set;
the training module 230 is configured to train to obtain a text processing model by using the preprocessed sample text data set, where the text processing model is used to obtain a text editing operation sequence corresponding to the text to be processed, the text editing operation sequence is a sequence formed by characters in the text to be processed, and the sequence includes an identifier of an editing operation that needs to be executed by each character.
Optionally, the preprocessing module 220 is specifically configured to obtain a domain word from each sample text in the sample text data set according to the knowledge graph information and/or the domain guidance information, where the domain word is used to represent entity information; constructing a corresponding relation between each sample text and each field word to form an initial field word data set; obtaining a domain word dictionary according to the initial domain word data set; and obtaining a preprocessed sample text data set according to the domain word dictionary.
Optionally, the preprocessing module 220 is specifically configured to train and obtain a sequence labeling model according to the initial domain word data set, where the sequence labeling model is used to identify a domain word in a text; inputting the sample text data set into a sequence labeling model, and identifying and acquiring field words contained in the sample text data set; and obtaining a domain word dictionary according to the domain words contained in the identified and obtained sample text data set and the domain words contained in the initial domain word data set.
Optionally, the preprocessing module 220 is specifically configured to perform domain word replacement on each sample text in the sample text data set according to the domain word dictionary to obtain a replaced sample text data set; copying each sample text in the replaced sample text data set to obtain a copied text corresponding to each sample text, and splicing the copied text with each sample text to obtain a plurality of preprocessed sample texts; and obtaining a preprocessed sample text data set according to the preprocessed sample texts.
Optionally, the apparatus further comprises: an acquisition module;
the acquisition module is used for acquiring an initial target text corresponding to each sample text in the sample text data set;
the preprocessing module 220 is further configured to perform domain word replacement on the initial target text according to the domain word dictionary to obtain a labeled target text corresponding to each sample text.
Optionally, the training module 230 is specifically configured to obtain, by using a text editing algorithm, a text editing operation sequence tag marked by each sample text in the preprocessed sample text data set according to the preprocessed sample text data set and the labeled target text corresponding to each sample text; and training to obtain a text processing model according to the preprocessed sample text data set and the text editing operation sequence label marked by each sample text in the preprocessed sample text data set.
Optionally, the training module 230 is specifically configured to input the preprocessed sample text data set into the pre-trained text processing model, so as to obtain a training text editing operation sequence of the pre-trained text processing model; determining the cross entropy of a pre-trained text processing model according to the training text editing operation sequence and the text editing operation sequence label; and correcting the pre-trained text processing model according to the cross entropy to obtain a text processing model.
The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).
The modules may be connected or in communication with each other via a wired or wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, etc., or any combination thereof. The wireless connection may comprise a connection over a LAN, WAN, bluetooth, ZigBee, NFC, or the like, or any combination thereof. Two or more modules may be combined into a single module, and any one module may be divided into two or more units. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to corresponding processes in the method embodiments, and are not described in detail in this application.
It should be noted that the above modules may be one or more integrated circuits configured to implement the above methods, for example: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, the modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).
Fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device may be a computing device with a data processing function.
The apparatus may include: a processor 801 and a memory 802.
The memory 802 is used for storing programs, and the processor 801 calls the programs stored in the memory 802 to execute the above-mentioned method embodiments. The specific implementation and technical effects are similar, and are not described herein again.
The memory 802 stores therein program code, which, when executed by the processor 801, causes the processor 801 to perform various steps of a text processing method or a training method of a text processing model according to various exemplary embodiments of the present application described in the above-mentioned "exemplary methods" section of the present specification.
The Processor 801 may be a general-purpose Processor, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.
Memory 802, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 802 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
Optionally, the present application also provides a program product, such as a computer readable storage medium, comprising a program which, when being executed by a processor, is adapted to carry out the above-mentioned method embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to perform some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims (16)

1. A method of text processing, comprising:
acquiring a text to be processed;
obtaining a text editing operation sequence corresponding to the text to be processed by adopting a text processing model obtained by pre-training according to the text to be processed, wherein the text editing operation sequence is a sequence formed by characters in the text to be processed, the sequence comprises an identifier of an editing operation required to be executed by each character, the text processing model is obtained by using a sample text marked with a text editing operation sequence label, and the text editing operation sequence label is obtained according to the sample text and a marked target text corresponding to the sample text;
and obtaining a target text corresponding to the text to be processed according to the text editing operation sequence, wherein the target text corresponding to the text to be processed comprises a compressed text or an rewritten text corresponding to the text to be processed.
2. The method according to claim 1, wherein before obtaining the text editing operation sequence corresponding to the text to be processed by using a text processing model obtained by pre-training according to the text to be processed, the method further comprises:
performing field word replacement on the text to be processed by adopting a pre-constructed field word dictionary to obtain a replaced text to be processed;
copying the replaced text to be processed, and splicing the copied text and the text to be processed to obtain a preprocessed text to be processed;
the obtaining of the text editing operation sequence corresponding to the text to be processed by adopting a text processing model obtained by pre-training according to the text to be processed includes:
and obtaining a text editing operation sequence corresponding to the preprocessed to-be-processed text by adopting a text processing model obtained by pre-training according to the preprocessed to-be-processed text.
3. The method of claim 2, wherein the text processing model comprises: an encoding layer and a decoding layer;
the obtaining of the text editing operation sequence corresponding to the preprocessed to-be-processed text by adopting a text processing model obtained by pre-training according to the preprocessed to-be-processed text comprises:
inputting the preprocessed to-be-processed text into the coding layer for semantic coding to obtain a semantic vector of the preprocessed to-be-processed text;
and inputting the semantic vector of the preprocessed text to be processed into the decoding layer for decoding to obtain a text editing operation sequence corresponding to the preprocessed text to be processed.
4. The method according to claim 3, wherein obtaining the target text corresponding to the text to be processed according to the text editing operation sequence comprises:
and obtaining a target text corresponding to the text to be processed according to the identifier of the editing operation required to be executed by each character in the text editing operation sequence and the mapping relation between the identifier of the editing operation and the preset editing operation.
5. The method according to any one of claims 1-4, wherein the editing operations required to be performed on each character include: delete operations, reserve operations, or replace operations.
6. A method for training a text processing model, comprising:
collecting a sample text data set, the sample text data set comprising a plurality of sample texts, each sample text labeled with: the text editing operation sequence labels are used for identifying the editing operation required to be executed by each character in the sample text, and the text editing operation sequence labels are obtained according to the sample text and the labeled target text corresponding to the sample text;
preprocessing each sample text in the sample text data set to obtain a preprocessed sample text data set;
and training to obtain the text processing model by adopting the preprocessed sample text data set, wherein the text processing model is used for obtaining a text editing operation sequence corresponding to the text to be processed, the text editing operation sequence is a sequence formed by all characters in the text to be processed, and the sequence comprises an identifier of an editing operation required to be executed by each character.
7. The method of claim 6, wherein the pre-processing each sample text in the sample text data set to obtain a pre-processed sample text data set comprises:
acquiring a domain word from each sample text in the sample text data set according to knowledge graph information and/or domain guide information, wherein the domain word is used for representing entity information;
constructing a corresponding relation between each sample text and each field word to form an initial field word data set;
obtaining a domain word dictionary according to the initial domain word data set;
and obtaining the preprocessed sample text data set according to the domain word dictionary.
8. The method of claim 7, wherein deriving a domain word dictionary from the initial domain word dataset comprises:
training and acquiring a sequence labeling model according to the initial field word data set, wherein the sequence labeling model is used for identifying field words in a text;
inputting the sample text data set into the sequence labeling model, and identifying and acquiring the field words contained in the sample text data set;
and obtaining the domain word dictionary according to the domain words contained in the sample text data set and the domain words contained in the initial domain word data set.
9. The method of claim 7, wherein obtaining the preprocessed sample text data set according to the domain word dictionary comprises:
according to the domain word dictionary, performing domain word replacement on each sample text in the sample text data set to obtain a replaced sample text data set;
copying each sample text in the replaced sample text data set to obtain a copied text corresponding to each sample text, and splicing the copied text with each sample text to obtain a plurality of preprocessed sample texts;
and obtaining the preprocessed sample text data set according to the preprocessed sample texts.
10. The method according to any of claims 6-9, wherein before preprocessing each of the sample texts in the sample text data set to obtain a preprocessed sample text data set, further comprising:
acquiring an initial target text corresponding to each sample text in the sample text data set;
and performing field word replacement on the initial target text according to a field word dictionary to obtain a labeled target text corresponding to each sample text.
11. The method of claim 10, wherein training the text processing model using the preprocessed sample text dataset comprises:
obtaining a text editing operation sequence label marked by each sample text in the preprocessed sample text data set by adopting a text editing algorithm according to the preprocessed sample text data set and the marked target text corresponding to each sample text;
and training to obtain the text processing model according to the preprocessed sample text data set and the text editing operation sequence label marked by each sample text in the preprocessed sample text data set.
12. The method of claim 11, wherein training the text processing model according to the preprocessed sample text data set and the text editing operation sequence label of each sample text label in the preprocessed sample text data set comprises:
inputting the preprocessed sample text data set into a pre-trained text processing model to obtain a training text editing operation sequence of the pre-trained text processing model;
determining the cross entropy of the pre-trained text processing model according to the training text editing operation sequence and the text editing operation sequence label;
and correcting the pre-trained text processing model according to the cross entropy to obtain the text processing model.
13. A text processing apparatus, comprising: the device comprises an acquisition module and a processing module;
the acquisition module is used for acquiring a text to be processed;
the processing module is configured to obtain a text editing operation sequence corresponding to the to-be-processed text by using a text processing model obtained through pre-training according to the to-be-processed text, where the text editing operation sequence is a sequence formed by characters in the to-be-processed text, the sequence includes an identifier of an editing operation that needs to be executed by each character, the text processing model is obtained by using a sample text labeled with a text editing operation sequence label, and the text editing operation sequence label is obtained according to the sample text and a labeled target text corresponding to the sample text;
the obtaining module is further configured to obtain a target text corresponding to the text to be processed according to the text editing operation sequence, where the target text corresponding to the text to be processed includes a compressed text or an rewritten text corresponding to the text to be processed.
14. An apparatus for training a text processing model, comprising: the device comprises an acquisition module, a preprocessing module and a training module;
the acquisition module is used for acquiring a sample text data set, the sample text data set comprises a plurality of sample texts, and each sample text is marked with: the text editing operation sequence labels are used for identifying the editing operation required to be executed by each character in the sample text, and the text editing operation sequence labels are obtained according to the sample text and the labeled target text corresponding to the sample text;
the preprocessing module is used for preprocessing each sample text in the sample text data set to obtain a preprocessed sample text data set;
the training module is configured to train and acquire the text processing model by using the preprocessed sample text data set, where the text processing model is used to acquire a text editing operation sequence corresponding to a text to be processed, the text editing operation sequence is a sequence formed by characters in the text to be processed, and the sequence includes an identifier of an editing operation that needs to be executed by each character.
15. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing program instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is running, the processor executing the program instructions to perform the steps of the method according to any one of claims 1 to 12 when executed.
16. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 12.
CN202011479376.7A 2020-12-14 2020-12-14 Text processing method, training device, training equipment and training equipment for model and storage medium Active CN112528674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011479376.7A CN112528674B (en) 2020-12-14 2020-12-14 Text processing method, training device, training equipment and training equipment for model and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011479376.7A CN112528674B (en) 2020-12-14 2020-12-14 Text processing method, training device, training equipment and training equipment for model and storage medium

Publications (2)

Publication Number Publication Date
CN112528674A true CN112528674A (en) 2021-03-19
CN112528674B CN112528674B (en) 2023-06-30

Family

ID=75000189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011479376.7A Active CN112528674B (en) 2020-12-14 2020-12-14 Text processing method, training device, training equipment and training equipment for model and storage medium

Country Status (1)

Country Link
CN (1) CN112528674B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115859121A (en) * 2023-01-29 2023-03-28 有米科技股份有限公司 Text processing model training method and device

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
HU0900432D0 (en) * 2009-07-10 2009-09-28 Metall Print Kft Procedure, system, computer program and computer program product for the compression of short messages
US20140303959A1 (en) * 2013-02-08 2014-10-09 Machine Zone, Inc. Systems and Methods for Multi-User Multi-Lingual Communications
US20180018303A1 (en) * 2016-07-15 2018-01-18 Sap Se Design time user interface with intelligent text reduction
US20180018302A1 (en) * 2016-07-15 2018-01-18 Sap Se Intelligent text reduction for graphical interface elements
CN107705784A (en) * 2017-09-28 2018-02-16 百度在线网络技术(北京)有限公司 Text regularization model training method and device, text regularization method and device
US20190065446A1 (en) * 2017-08-22 2019-02-28 Microsoft Technology Licensing, Llc Reducing text length while preserving meaning
CN109670035A (en) * 2018-12-03 2019-04-23 科大讯飞股份有限公司 A kind of text snippet generation method
US20190273707A1 (en) * 2017-12-29 2019-09-05 Titus Deac Brevity - codified messaging system and process with pre-composed messages made of prefabricated icons, and methods of use
CN110334186A (en) * 2019-07-08 2019-10-15 北京三快在线科技有限公司 Data query method, apparatus, computer equipment and computer readable storage medium
CN111026861A (en) * 2019-12-10 2020-04-17 腾讯科技(深圳)有限公司 Text abstract generation method, text abstract training method, text abstract generation device, text abstract training device, text abstract equipment and text abstract training medium
US20200257757A1 (en) * 2019-02-07 2020-08-13 Adobe Inc. Machine Learning Techniques for Generating Document Summaries Targeted to Affective Tone
US20200302016A1 (en) * 2019-03-20 2020-09-24 Adobe Inc. Classifying Structural Features of a Digital Document by Feature Type using Machine Learning
CN111967224A (en) * 2020-08-18 2020-11-20 深圳市欢太科技有限公司 Method and device for processing dialog text, electronic equipment and storage medium
CN111985229A (en) * 2019-05-21 2020-11-24 腾讯科技(深圳)有限公司 Sequence labeling method and device and computer equipment

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
HU0900432D0 (en) * 2009-07-10 2009-09-28 Metall Print Kft Procedure, system, computer program and computer program product for the compression of short messages
US20140303959A1 (en) * 2013-02-08 2014-10-09 Machine Zone, Inc. Systems and Methods for Multi-User Multi-Lingual Communications
US20180018303A1 (en) * 2016-07-15 2018-01-18 Sap Se Design time user interface with intelligent text reduction
US20180018302A1 (en) * 2016-07-15 2018-01-18 Sap Se Intelligent text reduction for graphical interface elements
US20190065446A1 (en) * 2017-08-22 2019-02-28 Microsoft Technology Licensing, Llc Reducing text length while preserving meaning
CN107705784A (en) * 2017-09-28 2018-02-16 百度在线网络技术(北京)有限公司 Text regularization model training method and device, text regularization method and device
US20190273707A1 (en) * 2017-12-29 2019-09-05 Titus Deac Brevity - codified messaging system and process with pre-composed messages made of prefabricated icons, and methods of use
CN109670035A (en) * 2018-12-03 2019-04-23 科大讯飞股份有限公司 A kind of text snippet generation method
US20200257757A1 (en) * 2019-02-07 2020-08-13 Adobe Inc. Machine Learning Techniques for Generating Document Summaries Targeted to Affective Tone
US20200302016A1 (en) * 2019-03-20 2020-09-24 Adobe Inc. Classifying Structural Features of a Digital Document by Feature Type using Machine Learning
CN111985229A (en) * 2019-05-21 2020-11-24 腾讯科技(深圳)有限公司 Sequence labeling method and device and computer equipment
CN110334186A (en) * 2019-07-08 2019-10-15 北京三快在线科技有限公司 Data query method, apparatus, computer equipment and computer readable storage medium
CN111026861A (en) * 2019-12-10 2020-04-17 腾讯科技(深圳)有限公司 Text abstract generation method, text abstract training method, text abstract generation device, text abstract training device, text abstract equipment and text abstract training medium
CN111967224A (en) * 2020-08-18 2020-11-20 深圳市欢太科技有限公司 Method and device for processing dialog text, electronic equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ERIC MALMI ET AL.: "Encode, Tag, Realize: High-Precision Text Editing", 《ARXIV:1909.01187V1》 *
NASSAR I ET AL.: "Neural Versus Non-Neural Text Simplification: A Case Study", 《AUSTRALASIAN LANGUAGE TECHOLOGY》 *
刘震;陈晶;郑建宾;华锦芝;肖淋峰;: "中文短文本聚合模型研究", 软件学报 *
王帅;赵翔;李博;葛斌;汤大权;: "TP-AS:一种面向长文本的两阶段自动摘要方法", 中文信息学报 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115859121A (en) * 2023-01-29 2023-03-28 有米科技股份有限公司 Text processing model training method and device

Also Published As

Publication number Publication date
CN112528674B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
CN111488931B (en) Article quality evaluation method, article recommendation method and corresponding devices
CN110688854B (en) Named entity recognition method, device and computer readable storage medium
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
CN112613306B (en) Method, device, electronic equipment and storage medium for extracting entity relationship
CN112100332A (en) Word embedding expression learning method and device and text recall method and device
CN113449528B (en) Address element extraction method and device, computer equipment and storage medium
CN111241209B (en) Method and device for generating information
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN115952791A (en) Chapter-level event extraction method, device and equipment based on machine reading understanding and storage medium
CN112825114A (en) Semantic recognition method and device, electronic equipment and storage medium
CN116245097A (en) Method for training entity recognition model, entity recognition method and corresponding device
CN114218940B (en) Text information processing and model training method, device, equipment and storage medium
CN114398480B (en) Financial public opinion subdivision aspect detection method and equipment based on key information extraction
CN113421551A (en) Voice recognition method and device, computer readable medium and electronic equipment
CN115374786A (en) Entity and relationship combined extraction method and device, storage medium and terminal
CN116306974A (en) Model training method and device of question-answering system, electronic equipment and storage medium
CN115798661A (en) Knowledge mining method and device in clinical medicine field
CN110851597A (en) Method and device for sentence annotation based on similar entity replacement
CN110852103A (en) Named entity identification method and device
CN114676705B (en) Dialogue relation processing method, computer and readable storage medium
CN112580368B (en) Method, device, equipment and storage medium for identifying intention sequence of conversation text
CN112528674B (en) Text processing method, training device, training equipment and training equipment for model and storage medium
CN111858911A (en) Work order description information generation method and device, electronic equipment and storage medium
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN114139610A (en) Traditional Chinese medicine clinical literature data structuring method and device based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant