CN110442871A - Text message processing method, device and equipment - Google Patents
Text message processing method, device and equipment Download PDFInfo
- Publication number
- CN110442871A CN110442871A CN201910720434.1A CN201910720434A CN110442871A CN 110442871 A CN110442871 A CN 110442871A CN 201910720434 A CN201910720434 A CN 201910720434A CN 110442871 A CN110442871 A CN 110442871A
- Authority
- CN
- China
- Prior art keywords
- task
- word
- text information
- training
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 15
- 239000013598 vector Substances 0.000 claims abstract description 158
- 238000012545 processing Methods 0.000 claims abstract description 82
- 238000000034 method Methods 0.000 claims abstract description 44
- 238000012549 training Methods 0.000 claims description 135
- 230000010365 information processing Effects 0.000 claims description 37
- 238000002372 labelling Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 17
- 230000011218 segmentation Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 5
- 230000008451 emotion Effects 0.000 description 5
- 241000219000 Populus Species 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 235000019580 granularity Nutrition 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- JJWKPURADFRFRB-UHFFFAOYSA-N carbonyl sulfide Chemical class O=C=S JJWKPURADFRFRB-UHFFFAOYSA-N 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the present invention provides a kind of text message processing method, device and equipment, this method comprises: obtaining the first text information;Word class sequence mark is carried out to the first text information, obtain corresponding first word class sequence of the first text information, the corresponding words vector of the first text information is obtained according to the first word class sequence, words vector is handled, the corresponding task processing result of the first text information is obtained;It wherein, include the vocabulary classification of multiple vocabulary and each vocabulary in the first word class sequence, multiple vocabulary are the vocabulary in the first text information;Words vector includes word vector and/or term vector.Improve the accuracy of text task processing.
Description
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a text information processing method, a text information processing device and text information processing equipment.
Background
At present, machine learning algorithms are widely used in text processing tasks, which may include text classification tasks, information extraction tasks, emotion analysis tasks, intelligent question and answer tasks, and the like.
In the practical application process, before executing the text processing task, the text information is obtained, the word vector and/or the word vector corresponding to the text information are obtained according to the word/word co-occurrence information related to the text processing task in the text information and the word/word co-occurrence information, and the obtained word vector and/or word vector are processed to obtain a task processing result. However, the word vectors and/or word vectors obtained by the above method cannot accurately express knowledge characteristics and semantic information of text information, so that text processing results cannot be accurately obtained.
Disclosure of Invention
The embodiment of the invention provides a text information processing method, a text information processing device and text information processing equipment, and the accuracy of text task processing is improved.
In a first aspect, an embodiment of the present invention provides a text information processing method, including:
acquiring first text information;
performing word class sequence labeling on the first text information to obtain a first word class sequence corresponding to the first text information, acquiring a word vector corresponding to the first text information according to the first word class sequence, and processing the word vector to obtain a task processing result corresponding to the first text information;
the first part of speech sequence comprises the plurality of vocabularies and the vocabulary category of each vocabulary, and the vocabularies are vocabularies in the first text information; the word vectors include word vectors and/or word vectors.
In a possible implementation manner, the performing part-of-speech sequence tagging on the first text information to obtain a first part-of-speech sequence corresponding to the first text information includes:
performing word segmentation processing on the first text information to obtain a plurality of words;
acquiring the vocabulary category of each vocabulary;
and determining the first word class sequence according to the plurality of words and the word class of each word.
In a possible implementation manner, the performing part-of-speech sequence tagging on the first text information to obtain a first part-of-speech sequence corresponding to the first text information, obtaining a word vector corresponding to the first text information according to the first part-of-speech sequence, and processing the word vector to obtain a task processing result corresponding to the first text information includes:
acquiring a task model corresponding to a current task;
and performing word class sequence labeling on the first text information through the task model corresponding to the current task to obtain a first word class sequence corresponding to the first text information, acquiring a word vector corresponding to the first text information according to the first word class sequence, and processing the word vector to obtain a task processing result corresponding to the first text information.
In a possible implementation manner, obtaining a task model corresponding to a current task includes:
the method comprises the steps of obtaining a pre-training model, wherein the pre-training model is used for obtaining a word class sequence of text information and obtaining a word vector of the text information according to the word class sequence, the word vector is used for indicating knowledge characteristics and semantic information of the text information, and the word vector comprises a word vector and/or a word vector;
and training the pre-training model according to the current task to obtain the task model.
In one possible embodiment, the obtaining the pre-training model includes:
determining a training task, wherein the training task comprises a basic task and a word class sequence labeling task, the basic task comprises a word vector task, or the basic task comprises a word vector prediction task and a context prediction task, and the word vector prediction task comprises a word vector prediction task and/or a word vector prediction task;
and learning multiple groups of first samples corresponding to the basic task and multiple groups of second samples corresponding to the word sequence tagging task to obtain the pre-training model, wherein each group of first samples comprises a first sample text and a corresponding sample word vector, and each group of second samples comprises a second sample text and a corresponding sample word sequence.
In one possible embodiment, the plurality of sets of second samples are samples corresponding to a full data set.
In a possible implementation manner, the learning multiple groups of first samples corresponding to the base task and multiple groups of second samples corresponding to the part of speech sequence tagging task to obtain the pre-training model includes:
performing joint training on a preset model according to the multiple groups of first samples, the multiple groups of second samples, the multiple groups of basic tasks and the multiple groups of word sequence tagging tasks to obtain the pre-training model;
or,
training a preset model according to the multiple groups of first samples and the basic tasks to obtain a first model, and training the first model according to the multiple groups of second samples and the word sequence tagging tasks to obtain the pre-training model.
In a possible implementation manner, the training the pre-training model according to the current task to obtain the task model includes:
acquiring multiple groups of third samples corresponding to the current task, wherein each group of third samples comprises a third sample text and a corresponding sample word sequence;
training the pre-training model according to the multiple groups of third samples to obtain an updated pre-training model;
and training the updated pre-training model according to the current task to obtain the task model.
In a second aspect, an embodiment of the present invention provides a text information processing apparatus, including: a first obtaining module and a processing module, wherein,
the first obtaining module is used for obtaining first text information;
the processing module is used for performing word class sequence tagging on the first text information to obtain a first word class sequence corresponding to the first text information, acquiring a word vector corresponding to the first text information according to the first word class sequence, and processing the word vector to obtain a task processing result corresponding to the first text information;
the first part of speech sequence comprises the plurality of vocabularies and the vocabulary category of each vocabulary, and the vocabularies are vocabularies in the first text information; the word vectors include word vectors and/or word vectors.
In a possible implementation, the processing module is specifically configured to:
performing word segmentation processing on the first text information to obtain a plurality of words;
acquiring the vocabulary category of each vocabulary;
and determining the first word class sequence according to the plurality of words and the word class of each word.
In a possible embodiment, the apparatus further comprises a second obtaining module, wherein,
the second acquisition module is used for acquiring a task model corresponding to the current task;
the processing module is specifically configured to perform word class sequence tagging on the first text information through a task model corresponding to the current task to obtain a first word class sequence corresponding to the first text information, obtain a word vector corresponding to the first text information according to the first word class sequence, and process the word vector to obtain a task processing result corresponding to the first text information.
In a possible implementation manner, the second obtaining module is specifically configured to:
the method comprises the steps of obtaining a pre-training model, wherein the pre-training model is used for obtaining a word class sequence of text information and obtaining a word vector of the text information according to the word class sequence, the word vector is used for indicating knowledge characteristics and semantic information of the text information, and the word vector comprises a word vector and/or a word vector;
and training the pre-training model according to the current task to obtain the task model.
In a possible implementation manner, the second obtaining module is specifically configured to:
determining a training task, wherein the training task comprises a basic task and a word class sequence labeling task, the basic task comprises a word vector task, or the basic task comprises a word vector prediction task and a context prediction task, and the word vector prediction task comprises a word vector prediction task and/or a word vector prediction task;
and learning multiple groups of first samples corresponding to the basic task and multiple groups of second samples corresponding to the word sequence tagging task to obtain the pre-training model, wherein each group of first samples comprises a first sample text and a corresponding sample word vector, and each group of second samples comprises a second sample text and a corresponding sample word sequence.
In one possible embodiment, the plurality of sets of second samples are samples corresponding to a full data set.
In a possible implementation manner, the second obtaining module is specifically configured to:
performing joint training on a preset model according to the multiple groups of first samples, the multiple groups of second samples, the multiple groups of basic tasks and the multiple groups of word sequence tagging tasks to obtain the pre-training model;
or,
training a preset model according to the multiple groups of first samples and the basic tasks to obtain a first model, and training the first model according to the multiple groups of second samples and the word sequence tagging tasks to obtain the pre-training model.
In a possible implementation manner, the second obtaining module is specifically configured to:
acquiring multiple groups of third samples corresponding to the current task, wherein each group of third samples comprises a third sample text and a corresponding sample word sequence;
training the pre-training model according to the multiple groups of third samples to obtain an updated pre-training model;
and training the updated pre-training model according to the current task to obtain the task model.
In a third aspect, an embodiment of the present invention provides a text information processing apparatus, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the method of processing textual information according to any of the first aspects.
In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the text information processing method according to any one of the first aspect is implemented.
The text information processing method, the text information processing device and the text information processing equipment, which are provided by the embodiment of the application, are used for acquiring first text information, performing part-of-speech sequence tagging on the first text information to acquire a first part-of-speech sequence corresponding to the first text information, acquiring word vectors corresponding to the first text information according to the first part-of-speech sequence, and processing the word vectors to acquire task processing results corresponding to the first text information. In the process, the word sequence can express the knowledge characteristics and semantic information of the text information more abundantly and accurately, so that the processing result of the text processing task can be accurately obtained according to the word sequence, and the accuracy of text task processing is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is an architecture diagram of a text message processing provided by an embodiment of the present invention;
fig. 2 is a schematic flowchart of a text information processing method according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating a method for generating a task model according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a text information processing apparatus according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of another text information processing apparatus according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of a text information processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is an architecture diagram of a text information processing according to an embodiment of the present invention. Referring to fig. 1, the task model is a text processing model, and the task model is used for processing text information to obtain a task processing result. The task model and the text processing task have a corresponding relationship, for example, the task model corresponding to the text classification task is a text classification model, and the text classification model has a text classification function and can output text classification of text information. For example, the task model corresponding to the intelligent question-answering task is an intelligent question-answering model, and the intelligent question-answering model has a function of determining answer information and can output the answer information corresponding to the text information.
In the process of determining the task processing result by the task model, the task model firstly obtains a part of speech sequence tagging to the text information to obtain a part of speech sequence corresponding to the text information, processes the part of speech sequence to obtain a word vector (word vector and/or word vector), and processes the word vector to obtain the task processing result. In the process, the word sequence can express the knowledge characteristics and semantic information of the text information more abundantly and more accurately, so that the processing result of the text processing task can be accurately obtained according to the word sequence.
The technical means shown in the present application will be described in detail below with reference to specific examples. It should be noted that the following embodiments may be combined with each other, and the description of the same or similar contents in different embodiments is not repeated.
Fig. 2 is a schematic flowchart of a text information processing method according to an embodiment of the present application. Referring to fig. 2, the method may include:
s201, acquiring first text information.
The execution main body of the embodiment of the invention is terminal equipment or a text information processing device arranged in the terminal equipment. Alternatively, the text information processing apparatus may be realized by software, or may be realized by a combination of software and hardware.
Optionally, the first text information is information to be used for obtaining a task processing result of the task, that is, the task processing result of the task needs to be obtained according to the first text information. The tasks can be text classification tasks, information extraction tasks, emotion analysis tasks, intelligent question and answer tasks and the like, and the types of the tasks are not particularly limited in the application.
Alternatively, the terminal device may receive text information input by a user and determine the text information input by the user as the first text information. Or, the terminal device may further receive voice information input by a user, convert the voice information into text information, and determine the text information obtained by conversion as the first text information.
S202, performing word class sequence labeling on the first text information to obtain a first word class sequence corresponding to the first text information.
The first part of speech sequence comprises a plurality of vocabularies and the vocabulary category of each vocabulary, and the vocabularies are the vocabularies in the first text information.
Optionally, word segmentation processing may be performed on the first text information to obtain a plurality of words, a word category of each word is obtained, and the first word category sequence is determined according to the plurality of words and the word category of each word.
In practice, a word may have multiple meanings and the meaning of the same word may be different in different contexts. For example, for the word "accounting," accounting means accounting, in some contexts, and revenge, in some contexts. For example, for the term "bundle," a bundle means, in some contexts, a package that holds an object, and a bundle means, in some contexts, a burden.
The vocabulary category of the vocabulary may indicate a specific meaning of the vocabulary in the first textual information. Optionally, the categories of vocabulary may include: time class, region class, character class, name class, scene event class, article class, etc.
Optionally, after obtaining the plurality of vocabularies, part-of-speech tagging and/or named entity recognition may be performed on the plurality of vocabularies, and a vocabulary category of each vocabulary is obtained according to a part-of-speech tagging and/or named entity recognition result. Part-of-speech tags are used to obtain parts-of-speech of the vocabulary, which may include verbs, nouns, adjectives, adverbs, and so on, for example. Named entity recognition can recognize entities with specific meanings in text information, and mainly comprises names of people, places, names of organizations, dates, proper nouns and the like.
For example, assume that the first text information is: in 2003, the Chinese space hero Yang Liwei takes the Shenzhou No. five airship to enter the space for the first time. And performing part-of-speech sequence tagging on the first text information to obtain the following first part-of-speech sequence:
in 2003 [ time class | year ], [ punctuation mark ] china [ world region class | country ] space hero [ character class ] poplar viaw [ name ] ride [ scene event ] Shenzhou fifth airship [ article | spacecraft ] enter [ scene event ] space [ cosmic class ] for the first time [ modifier | time ]. [ punctuation marks ]) "
In the above word class sequence, "2003" is a word, "time class | year" is a word class of the word "2003," china "is a word, and [ world region class | country ] is a word class of the word" china ".
It should be noted that, in the practical application process, the vocabulary category for labeling the vocabulary may use a predefined universal vocabulary category system, the category system includes categories with different granularities, and the category is selected according to the requirement during labeling. For example, "china" may be labeled as coarse-grained "world region class" or fine-grained "country". When the part of speech sequence of the text information is obtained through the model, the adopted labeling granularity is consistent with the labeling granularity for labeling the sample during model training.
S203, obtaining a word vector corresponding to the first text message according to the first word class sequence.
Wherein the word vectors include word vectors and/or word vectors.
A word vector refers to a vector comprising a single word, a word vector refers to a vector of words, one word comprising at least two words.
The word vector may express knowledge characteristics of the first text information as well as semantic information. Because the first word class sequence comprises the content in the first text information and the word class of each word in the first text information, the word vector determined according to the first word class sequence can more accurately express the knowledge characteristics and semantic information of the first text information.
And S204, processing the word vectors to obtain a task processing result corresponding to the first text information.
Optionally, the task processing result is one of a text classification result, an information extraction result, an emotion analysis result, or an intelligent question and answer result.
Optionally, when the current tasks are different, the processing of the word vectors is also different, and the obtained task processing results are also different. For example, assuming that the current task is a text classification task, text classification processing may be performed on the word vectors to obtain a text classification result. For example, assuming that the current task is a self-energy question-and-answer task, intelligent question-and-answer processing may be performed on the word vectors to obtain intelligent answers.
The text information processing method provided by the embodiment of the application obtains the first text information, performs word class sequence labeling on the first text information to obtain a first word class sequence corresponding to the first text information, obtains a word vector corresponding to the first text information according to the first word class sequence, and processes the word vector to obtain a task processing result corresponding to the first text information. In the process, the word sequence can express the knowledge characteristics and semantic information of the text information more abundantly and accurately, so that the processing result of the text processing task can be accurately obtained according to the word sequence, and the accuracy of text task processing is improved.
Optionally, in an actual application process, S202 to S204 in the embodiment of fig. 2 may be processed through a task model, so as to obtain a task processing result. Next, a process of generating the task model will be described in detail.
Fig. 3 is a schematic flowchart of a method for generating a task model according to an embodiment of the present invention. Referring to fig. 3, the method may include:
s301, determining a training task, wherein the training task comprises a basic task and a word sequence labeling task.
Wherein the base task comprises a word vector task. Alternatively, the base tasks include a word vector prediction task and a context prediction task, and the word vector prediction task includes a word vector prediction task and/or a word vector prediction task.
The word vector task may also be referred to as a word vector prediction task, and the word vector task is a task of determining a word vector corresponding to text information. After model training is performed through the word vector task, the obtained model has a function of word vector determination, that is, the obtained model can determine a word vector corresponding to the text message.
The context prediction task can also be called as a next sentence prediction task, and the context prediction task refers to a task which determines a word vector corresponding to the text information by combining the context information. After model training is performed through the word vector task and the context prediction task, the obtained model has a function of determining a word vector corresponding to the text information in combination with the context information, that is, the obtained model can determine the word vector corresponding to the text information in combination with the context information. The word vector corresponding to the text information can be more accurately determined and obtained by combining the context information.
The word class sequence tagging task is a task for determining a word class sequence corresponding to the text information. After model training is performed through the word class sequence tagging task, the obtained model has a function of determining a word class sequence corresponding to the text information, that is, the obtained model can determine the word class sequence corresponding to the text information.
S302, learning multiple groups of first samples corresponding to the basic task and multiple groups of second samples corresponding to the part of speech sequence tagging task to obtain a pre-training model.
Wherein each group of first samples comprises a first sample text and a corresponding sample word vector.
Optionally, the sample word vector in each group of first samples may be artificially labeled, and the sample word vector may accurately represent knowledge characteristics and semantic information of the first samples.
Optionally, the multiple groups of first samples are samples corresponding to a full data set. The full data set comprises data sets corresponding to a plurality of text tasks, for example, the full data set comprises a data set corresponding to a text classification task, a data set corresponding to an emotion analysis task, a data set corresponding to an intelligent question and answer task, and the like. Because the multiple groups of first samples are samples corresponding to the full data set, word vectors of text information corresponding to multiple tasks can be determined by using a pre-training model obtained by learning the multiple groups of first samples corresponding to the basic task.
And each group of second samples comprises second sample texts and corresponding sample word class sequences.
Optionally, the sequence of sample parts of speech in each group of second samples may be manually labeled. In order to make the learning of the sample part-of-speech sequence better, when the part-of-speech tagging is performed on the vocabulary in the sample text, a category label can be added in the words in the vocabulary, the category label is used for indicating the position of the word in the vocabulary,
optionally, the category label may include the following two forms:
one form is: b (class start), I (class middle), E (class end), S (individual class).
In another form: b (class start), I (class middle), S (individual class).
For example, for a text message: yanlibei [ names ], after adding category labels according to the first form: populus name B Liren name I Wei name E. Adding category labels in the second form is followed by: populus name B Liren name I Wei name I.
Optionally, the plurality of groups of second samples are samples corresponding to the full data set. The full data set comprises data sets corresponding to a plurality of text tasks, for example, the full data set comprises a data set corresponding to a text classification task, a data set corresponding to an emotion analysis task, a data set corresponding to an intelligent question and answer task, and the like. Because the multiple groups of second samples are samples corresponding to the full data set, the pre-training model obtained by learning the multiple groups of second samples corresponding to the part of speech sequence tagging task can be used for carrying out part of speech sequence tagging on the text information corresponding to multiple tasks.
Optionally, the pre-training model may be obtained through two possible implementations as follows:
one possible implementation is:
and performing joint training on the preset model according to the multiple groups of first samples, second samples, basic tasks and word sequence tagging tasks to obtain a pre-training model.
In this feasible implementation manner, a joint training task of the word vector prediction task and the word class sequence tagging task may be used to perform joint training on the preset model according to the plurality of groups of first samples and the plurality of groups of second samples, so as to obtain a pre-training model. Or, performing joint training on the preset model by using a joint training task of 'word vector prediction task + context prediction task + word sequence tagging task' according to the plurality of groups of first samples and the plurality of groups of second samples to obtain a pre-training model.
Optionally, the preset model may be a text training model, for example, a BERT model, a GPT model, or the like.
Another possible implementation:
and training the preset model according to the multiple groups of first samples and the basic tasks to obtain a first model, and training the first model according to the multiple groups of second samples and the word sequence tagging tasks to obtain a pre-training model.
Optionally, in this feasible implementation manner, a preset model is trained according to multiple groups of first samples and basic tasks to obtain a first model, the first model has a function of outputting word vectors of text information, a pre-training model is obtained by training the first model according to multiple groups of second samples and word sequence tagging tasks, and the pre-training model has a function of performing word sequence tagging on the text information. The pre-training model can acquire the word class sequence of the text information, and the word vector of the text information is determined according to the word class sequence.
S303, training the pre-training model according to the current task to obtain a task model.
Optionally, the pre-training model may be trained according to the current task through the following two feasible implementation manners to obtain a task model:
one possible implementation is:
the pre-training model is directly applied to the current task, namely, the pre-training model is directly trained according to the current task to obtain the task model, so that the task model has the function of determining the task processing result corresponding to the current task.
Because the pre-training model has the functions of determining the part of speech sequence corresponding to the text information and determining the word vector according to the part of speech sequence, the task model obtained after the pre-training model is trained according to the current task has at least the following functions: the function of determining a word class sequence corresponding to the text information, the function of determining word vectors according to the word class sequence, and the function of determining task processing results according to the word vectors.
Another possible implementation:
the pre-training model is trained again according to a plurality of groups of third samples corresponding to the current task to obtain an updated pre-training model, so that the updated pre-training model is more suitable for determining the word sequence of the text information corresponding to the current task and determining the word vector according to the word sequence corresponding to the current task, and the determined word vector can more accurately represent the knowledge characteristics and semantic information of the text information corresponding to the current task. And then, training the updated pre-training model according to the current task to obtain a task model.
Optionally, each group of third samples includes a third sample text and a corresponding sample part of speech sequence, where the third sample text and the sample part of speech sequence are both samples corresponding to the current task.
In the embodiment shown in fig. 3, the pre-training model is determined first, and in the process of learning the pre-training model, the training task includes a basic task and a word sequence tagging task, so that the pre-training model can learn not only the co-occurrence characteristics of various characters/words in the sentence, but also semantic characteristics such as co-occurrence of words and the like which do not directly occur in the sentence, and the text information output by the pre-training model can more accurately represent the knowledge characteristics and semantic information of the text information. The pre-training model is trained according to the current task, so that a task model can be obtained, the task model can determine a task processing result according to the accurate word vector, and the accuracy of text task processing is improved.
Fig. 4 is a schematic structural diagram of a text information processing apparatus according to an embodiment of the present invention. Referring to fig. 4, the text information processing apparatus 10 may include: a first acquisition module 11 and a processing module 12, wherein,
the first obtaining module 11 is configured to obtain first text information;
the processing module 12 is configured to perform word class sequence tagging on the first text information to obtain a first word class sequence corresponding to the first text information, obtain a word vector corresponding to the first text information according to the first word class sequence, and process the word vector to obtain a task processing result corresponding to the first text information;
the first part of speech sequence comprises the plurality of vocabularies and the vocabulary category of each vocabulary, and the vocabularies are vocabularies in the first text information; the word vectors include word vectors and/or word vectors.
The text information processing apparatus provided in the embodiment of the present invention may execute the technical solutions shown in the above method embodiments, and the implementation principles and beneficial effects thereof are similar and will not be described herein again.
In a possible implementation, the processing module 12 is specifically configured to:
performing word segmentation processing on the first text information to obtain a plurality of words;
acquiring the vocabulary category of each vocabulary;
and determining the first word class sequence according to the plurality of words and the word class of each word.
Fig. 5 is a schematic structural diagram of another text information processing apparatus according to an embodiment of the present invention. In addition to the embodiment shown in fig. 4, referring to fig. 5, the text information processing apparatus 10 further includes a second obtaining module 13, wherein,
the second obtaining module 13 is configured to obtain a task model corresponding to the current task;
the processing module 12 is specifically configured to perform word class sequence tagging on the first text information through the task model corresponding to the current task to obtain a first word class sequence corresponding to the first text information, obtain a word vector corresponding to the first text information according to the first word class sequence, and process the word vector to obtain a task processing result corresponding to the first text information.
In a possible implementation manner, the second obtaining module 13 is specifically configured to:
the method comprises the steps of obtaining a pre-training model, wherein the pre-training model is used for obtaining a word class sequence of text information and obtaining a word vector of the text information according to the word class sequence, the word vector is used for indicating knowledge characteristics and semantic information of the text information, and the word vector comprises a word vector and/or a word vector;
and training the pre-training model according to the current task to obtain the task model.
In a possible implementation manner, the second obtaining module 13 is specifically configured to:
determining a training task, wherein the training task comprises a basic task and a word class sequence labeling task, the basic task comprises a word vector task, or the basic task comprises a word vector prediction task and a context prediction task, and the word vector prediction task comprises a word vector prediction task and/or a word vector prediction task;
and learning multiple groups of first samples corresponding to the basic task and multiple groups of second samples corresponding to the word sequence tagging task to obtain the pre-training model, wherein each group of first samples comprises a first sample text and a corresponding sample word vector, and each group of second samples comprises a second sample text and a corresponding sample word sequence.
In one possible embodiment, the plurality of sets of second samples are samples corresponding to a full data set.
In a possible implementation manner, the second obtaining module 13 is specifically configured to:
performing joint training on a preset model according to the multiple groups of first samples, the multiple groups of second samples, the multiple groups of basic tasks and the multiple groups of word sequence tagging tasks to obtain the pre-training model;
or,
training a preset model according to the multiple groups of first samples and the basic tasks to obtain a first model, and training the first model according to the multiple groups of second samples and the word sequence tagging tasks to obtain the pre-training model.
In a possible implementation manner, the second obtaining module 13 is specifically configured to:
acquiring multiple groups of third samples corresponding to the current task, wherein each group of third samples comprises a third sample text and a corresponding sample word sequence;
training the pre-training model according to the multiple groups of third samples to obtain an updated pre-training model;
and training the updated pre-training model according to the current task to obtain the task model.
The text information processing apparatus provided in the embodiment of the present invention may execute the technical solutions shown in the above method embodiments, and the implementation principles and beneficial effects thereof are similar and will not be described herein again.
Fig. 6 is a schematic diagram of a hardware structure of a text information processing apparatus according to an embodiment of the present invention, and as shown in fig. 6, the text information processing apparatus 20 includes: at least one processor 21 and a memory 22. The processor 21 and the memory 22 are connected by a bus 23.
In a specific implementation, the at least one processor 21 executes computer-executable instructions stored in the memory 22, so that the at least one processor 21 executes the text information processing method as described above.
For a specific implementation process of the processor 21, reference may be made to the above method embodiments, which implement similar principles and technical effects, and this embodiment is not described herein again.
In the embodiment shown in fig. 6, it should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise high speed RAM memory and may also include non-volatile storage NVM, such as at least one disk memory.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The present application also provides a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the text information processing method as described above is implemented.
The computer-readable storage medium may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the readable storage medium may also reside as discrete components in the apparatus.
The division of the units is only a logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (18)
1. A text information processing method, comprising:
acquiring first text information;
performing word class sequence labeling on the first text information to obtain a first word class sequence corresponding to the first text information, acquiring a word vector corresponding to the first text information according to the first word class sequence, and processing the word vector to obtain a task processing result corresponding to the first text information;
the first part of speech sequence comprises a plurality of vocabularies and the vocabulary category of each vocabulary, wherein the vocabularies are vocabularies in the first text information; the word vectors include word vectors and/or word vectors.
2. The method according to claim 1, wherein said performing part-of-speech sequence tagging on the first text information to obtain a first part-of-speech sequence corresponding to the first text information comprises:
performing word segmentation processing on the first text information to obtain a plurality of words;
acquiring the vocabulary category of each vocabulary;
and determining the first word class sequence according to the plurality of words and the word class of each word.
3. The method according to claim 1 or 2, wherein the performing word class sequence tagging on the first text information to obtain a first word class sequence corresponding to the first text information, obtaining a word vector corresponding to the first text information according to the first word class sequence, and processing the word vector to obtain a task processing result corresponding to the first text information includes:
acquiring a task model corresponding to a current task;
and performing word class sequence labeling on the first text information through the task model corresponding to the current task to obtain a first word class sequence corresponding to the first text information, acquiring a word vector corresponding to the first text information according to the first word class sequence, and processing the word vector to obtain a task processing result corresponding to the first text information.
4. The method of claim 3, wherein obtaining the task model corresponding to the current task comprises:
the method comprises the steps of obtaining a pre-training model, wherein the pre-training model is used for obtaining a word class sequence of text information and obtaining a word vector of the text information according to the word class sequence, the word vector is used for indicating knowledge characteristics and semantic information of the text information, and the word vector comprises a word vector and/or a word vector;
and training the pre-training model according to the current task to obtain the task model.
5. The method of claim 4, wherein the obtaining a pre-trained model comprises:
determining a training task, wherein the training task comprises a basic task and a word class sequence labeling task, the basic task comprises a word vector task, or the basic task comprises a word vector prediction task and a context prediction task, and the word vector prediction task comprises a word vector prediction task and/or a word vector prediction task;
and learning multiple groups of first samples corresponding to the basic task and multiple groups of second samples corresponding to the word class sequence tagging task to obtain the pre-training model, wherein each group of first samples comprises a first sample text and a corresponding sample word vector, and each group of second samples comprises a second sample text and a corresponding sample word class sequence.
6. The method of claim 5, wherein the plurality of sets of second samples are samples corresponding to a full dataset.
7. The method according to claim 5 or 6, wherein learning the plurality of groups of first samples corresponding to the base task and the plurality of groups of second samples corresponding to the part of speech sequence tagging task to obtain the pre-training model comprises:
performing joint training on a preset model according to the multiple groups of first samples, the multiple groups of second samples, the multiple groups of basic tasks and the multiple groups of word sequence tagging tasks to obtain the pre-training model;
or,
training a preset model according to the multiple groups of first samples and the basic tasks to obtain a first model, and training the first model according to the multiple groups of second samples and the word sequence tagging tasks to obtain the pre-training model.
8. The method according to any one of claims 4-6, wherein the training the pre-trained model according to the current task to obtain the task model comprises:
acquiring multiple groups of third samples corresponding to the current task, wherein each group of third samples comprises a third sample text and a corresponding sample word sequence;
training the pre-training model according to the multiple groups of third samples to obtain an updated pre-training model;
and training the updated pre-training model according to the current task to obtain the task model.
9. A text information processing apparatus characterized by comprising: a first obtaining module and a processing module, wherein,
the first obtaining module is used for obtaining first text information;
the processing module is used for performing word class sequence tagging on the first text information to obtain a first word class sequence corresponding to the first text information, acquiring a word vector corresponding to the first text information according to the first word class sequence, and processing the word vector to obtain a task processing result corresponding to the first text information;
the first part of speech sequence comprises a plurality of vocabularies and the vocabulary category of each vocabulary, wherein the vocabularies are vocabularies in the first text information; the word vectors include word vectors and/or word vectors.
10. The apparatus of claim 9, wherein the processing module is specifically configured to:
performing word segmentation processing on the first text information to obtain a plurality of words;
acquiring the vocabulary category of each vocabulary;
and determining the first word class sequence according to the plurality of words and the word class of each word.
11. The apparatus of claim 9 or 10, further comprising a second acquisition module, wherein,
the second acquisition module is used for acquiring a task model corresponding to the current task;
the processing module is specifically configured to perform word class sequence tagging on the first text information through a task model corresponding to the current task to obtain a first word class sequence corresponding to the first text information, obtain a word vector corresponding to the first text information according to the first word class sequence, and process the word vector to obtain a task processing result corresponding to the first text information.
12. The apparatus of claim 11, wherein the second obtaining module is specifically configured to:
the method comprises the steps of obtaining a pre-training model, wherein the pre-training model is used for obtaining a word class sequence of text information and obtaining a word vector of the text information according to the word class sequence, the word vector is used for indicating knowledge characteristics and semantic information of the text information, and the word vector comprises a word vector and/or a word vector;
and training the pre-training model according to the current task to obtain the task model.
13. The apparatus of claim 12, wherein the second obtaining module is specifically configured to:
determining a training task, wherein the training task comprises a basic task and a word class sequence labeling task, the basic task comprises a word vector task, or the basic task comprises a word vector prediction task and a context prediction task, and the word vector prediction task comprises a word vector prediction task and/or a word vector prediction task;
and learning multiple groups of first samples corresponding to the basic task and multiple groups of second samples corresponding to the word class sequence tagging task to obtain the pre-training model, wherein each group of first samples comprises a first sample text and a corresponding sample word vector, and each group of second samples comprises a second sample text and a corresponding sample word class sequence.
14. The apparatus of claim 13, wherein the plurality of second sets of samples are samples corresponding to a full dataset.
15. The apparatus according to claim 13 or 14, wherein the second obtaining module is specifically configured to:
performing joint training on a preset model according to the multiple groups of first samples, the multiple groups of second samples, the multiple groups of basic tasks and the multiple groups of word sequence tagging tasks to obtain the pre-training model;
or,
training a preset model according to the multiple groups of first samples and the basic tasks to obtain a first model, and training the first model according to the multiple groups of second samples and the word sequence tagging tasks to obtain the pre-training model.
16. The apparatus according to any one of claims 12 to 14, wherein the second obtaining module is specifically configured to:
acquiring multiple groups of third samples corresponding to the current task, wherein each group of third samples comprises a third sample text and a corresponding sample word sequence;
training the pre-training model according to the multiple groups of third samples to obtain an updated pre-training model;
and training the updated pre-training model according to the current task to obtain the task model.
17. A text information processing apparatus characterized by comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of processing textual information according to any of claims 1 to 8.
18. A computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the text information processing method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910720434.1A CN110442871A (en) | 2019-08-06 | 2019-08-06 | Text message processing method, device and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910720434.1A CN110442871A (en) | 2019-08-06 | 2019-08-06 | Text message processing method, device and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110442871A true CN110442871A (en) | 2019-11-12 |
Family
ID=68433401
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910720434.1A Pending CN110442871A (en) | 2019-08-06 | 2019-08-06 | Text message processing method, device and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110442871A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111539209A (en) * | 2020-04-15 | 2020-08-14 | 北京百度网讯科技有限公司 | Method and apparatus for entity classification |
CN111611808A (en) * | 2020-05-22 | 2020-09-01 | 北京百度网讯科技有限公司 | Method and apparatus for generating natural language model |
CN112466472A (en) * | 2021-02-03 | 2021-03-09 | 北京伯仲叔季科技有限公司 | Case text information retrieval system |
CN112749256A (en) * | 2020-12-30 | 2021-05-04 | 北京知因智慧科技有限公司 | Text processing method, device, equipment and storage medium |
CN113159921A (en) * | 2021-04-23 | 2021-07-23 | 上海晓途网络科技有限公司 | Overdue prediction method and device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095190A (en) * | 2015-08-25 | 2015-11-25 | 众联数据技术(南京)有限公司 | Chinese semantic structure and finely segmented word bank combination based emotional analysis method |
CN105404632A (en) * | 2014-09-15 | 2016-03-16 | 深港产学研基地 | Deep neural network based biomedical text serialization labeling system and method |
CN108460150A (en) * | 2018-03-23 | 2018-08-28 | 北京奇虎科技有限公司 | The processing method and processing device of headline |
CN109190110A (en) * | 2018-08-02 | 2019-01-11 | 厦门快商通信息技术有限公司 | A kind of training method of Named Entity Extraction Model, system and electronic equipment |
CN109190123A (en) * | 2018-09-14 | 2019-01-11 | 北京字节跳动网络技术有限公司 | Method and apparatus for output information |
CN109299472A (en) * | 2018-11-09 | 2019-02-01 | 天津开心生活科技有限公司 | Text data processing method, device, electronic equipment and computer-readable medium |
-
2019
- 2019-08-06 CN CN201910720434.1A patent/CN110442871A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105404632A (en) * | 2014-09-15 | 2016-03-16 | 深港产学研基地 | Deep neural network based biomedical text serialization labeling system and method |
CN105095190A (en) * | 2015-08-25 | 2015-11-25 | 众联数据技术(南京)有限公司 | Chinese semantic structure and finely segmented word bank combination based emotional analysis method |
CN108460150A (en) * | 2018-03-23 | 2018-08-28 | 北京奇虎科技有限公司 | The processing method and processing device of headline |
CN109190110A (en) * | 2018-08-02 | 2019-01-11 | 厦门快商通信息技术有限公司 | A kind of training method of Named Entity Extraction Model, system and electronic equipment |
CN109190123A (en) * | 2018-09-14 | 2019-01-11 | 北京字节跳动网络技术有限公司 | Method and apparatus for output information |
CN109299472A (en) * | 2018-11-09 | 2019-02-01 | 天津开心生活科技有限公司 | Text data processing method, device, electronic equipment and computer-readable medium |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111539209A (en) * | 2020-04-15 | 2020-08-14 | 北京百度网讯科技有限公司 | Method and apparatus for entity classification |
CN111539209B (en) * | 2020-04-15 | 2023-09-15 | 北京百度网讯科技有限公司 | Method and apparatus for entity classification |
CN111611808A (en) * | 2020-05-22 | 2020-09-01 | 北京百度网讯科技有限公司 | Method and apparatus for generating natural language model |
CN111611808B (en) * | 2020-05-22 | 2023-08-01 | 北京百度网讯科技有限公司 | Method and apparatus for generating natural language model |
CN112749256A (en) * | 2020-12-30 | 2021-05-04 | 北京知因智慧科技有限公司 | Text processing method, device, equipment and storage medium |
CN112466472A (en) * | 2021-02-03 | 2021-03-09 | 北京伯仲叔季科技有限公司 | Case text information retrieval system |
CN112466472B (en) * | 2021-02-03 | 2021-05-18 | 北京伯仲叔季科技有限公司 | Case text information retrieval system |
CN113159921A (en) * | 2021-04-23 | 2021-07-23 | 上海晓途网络科技有限公司 | Overdue prediction method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110705302B (en) | Named entity identification method, electronic equipment and computer storage medium | |
US20190287142A1 (en) | Method, apparatus for evaluating review, device and storage medium | |
CN110442871A (en) | Text message processing method, device and equipment | |
CN109165384A (en) | A kind of name entity recognition method and device | |
US11232263B2 (en) | Generating summary content using supervised sentential extractive summarization | |
CN108304387B (en) | Method, device, server group and storage medium for recognizing noise words in text | |
CN113158656B (en) | Ironic content recognition method, ironic content recognition device, electronic device, and storage medium | |
WO2020199600A1 (en) | Sentiment polarity analysis method and related device | |
CN112434520A (en) | Named entity recognition method and device and readable storage medium | |
CN111930792A (en) | Data resource labeling method and device, storage medium and electronic equipment | |
US20220139386A1 (en) | System and method for chinese punctuation restoration using sub-character information | |
CN114757176A (en) | Method for obtaining target intention recognition model and intention recognition method | |
CN116432646A (en) | Training method of pre-training language model, entity information identification method and device | |
CN112148862A (en) | Question intention identification method and device, storage medium and electronic equipment | |
CN109657127B (en) | Answer obtaining method, device, server and storage medium | |
CN112559711A (en) | Synonymous text prompting method and device and electronic equipment | |
CN113254814A (en) | Network course video labeling method and device, electronic equipment and medium | |
CN112527967A (en) | Text matching method, device, terminal and storage medium | |
CN112530402A (en) | Voice synthesis method, voice synthesis device and intelligent equipment | |
CN109902309B (en) | Translation method, device, equipment and storage medium | |
CN111814433B (en) | Uygur language entity identification method and device and electronic equipment | |
CN115292492A (en) | Method, device and equipment for training intention classification model and storage medium | |
CN112989003B (en) | Intention recognition method, device, processing equipment and medium | |
CN114676699A (en) | Entity emotion analysis method and device, computer equipment and storage medium | |
CN113177406A (en) | Text processing method and device, electronic equipment and computer readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |