WO2024114335A1 - 主题识别模型的训练方法及装置 - Google Patents

主题识别模型的训练方法及装置 Download PDF

Info

Publication number
WO2024114335A1
WO2024114335A1 PCT/CN2023/130802 CN2023130802W WO2024114335A1 WO 2024114335 A1 WO2024114335 A1 WO 2024114335A1 CN 2023130802 W CN2023130802 W CN 2023130802W WO 2024114335 A1 WO2024114335 A1 WO 2024114335A1
Authority
WO
WIPO (PCT)
Prior art keywords
topic
training
text
recognized
target
Prior art date
Application number
PCT/CN2023/130802
Other languages
English (en)
French (fr)
Inventor
阎覃
孙子钧
张天宇
赵薇
柳景明
Original Assignee
北京猿力未来科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京猿力未来科技有限公司 filed Critical 北京猿力未来科技有限公司
Publication of WO2024114335A1 publication Critical patent/WO2024114335A1/zh

Links

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular to a method for training a topic identification model.
  • the present application also relates to a training device for a topic identification model, a computing device, and a computer-readable storage medium.
  • the deep neural network method based on supervised learning requires a large amount of labeled data. Since there are many communication scenarios between teachers and parents and the corresponding topic types are also large, the cost of data labeling will be greatly increased.
  • the topic clustering method based on unsupervised learning does not require training data, the automatic clustering effect is not good, and a large number of meaningless topic categories will be generated, which requires secondary manual intervention, and the recognition accuracy of topic categories is low.
  • the embodiment of the present application provides a training method for a topic identification model.
  • the present application also relates to a training device for a topic identification model, a computing device, and a computer-readable storage medium to solve the above problems existing in the prior art.
  • a method for training a topic identification model comprising:
  • the first training data includes a first training text and a first subject category corresponding to the first training text
  • a topic recognition model is trained based on the first training data and the second training data.
  • a training device for a topic identification model comprising:
  • a first acquisition module is configured to acquire first training data, wherein the first training data includes a first training text and a first subject category corresponding to the first training text;
  • a first training module is configured to train a reference topic identification model set based on the first training text and the first topic category;
  • a second acquisition module is configured to acquire second initial training data, wherein the second initial training data includes a second training text;
  • a data prediction module configured to input the second training text into the reference topic recognition model set, obtain a predicted topic category corresponding to the second training text, and obtain second training data based on the second training text and the predicted topic category;
  • the second training module is configured to train a topic recognition model according to the first training data and the second training data.
  • a topic identification method including:
  • Target subtext to be recognized includes a target sentence to be recognized and a target object corresponding to the target sentence to be recognized;
  • a topic identification device including:
  • a receiving module configured to receive text to be recognized
  • a determination module is configured to determine a target sub-text to be recognized in the text to be recognized, wherein the target sub-text to be recognized includes a target sentence to be recognized and a target object corresponding to the target sentence to be recognized;
  • a construction module configured to construct an input text to be recognized based on the context information of the target sentence to be recognized
  • An input module is configured to input the input text to be recognized into a topic recognition model, wherein the topic recognition model is trained by the training method of the topic recognition model;
  • the acquisition module is configured to obtain the topic category output by the topic identification model.
  • a computing device comprising a memory, a processor, and computer instructions stored in the memory and executable on the processor, wherein the processor implements the training method of the topic identification model or the steps of the topic identification method when executing the computer instructions.
  • a computer-readable storage medium which stores computer instructions, and when the computer instructions are executed by a processor, the training method of the topic identification model or the steps of the topic identification method are implemented.
  • the training method of the topic identification model includes: obtaining first training data, wherein the first training data includes a first training text and a first topic category corresponding to the first training text; training a reference topic identification model set based on the first training text and the first topic category; obtaining second initial training data, wherein the second initial training data includes a second training text; inputting the second training text into the reference topic identification model set to obtain a predicted topic category corresponding to the second training text, and obtaining second training data based on the second training text and the predicted topic category; and training the topic identification model according to the first training data and the second training data.
  • An embodiment of the present application realizes training the topic recognition model with less labeled training data, which reduces the labeling cost and further improves the accuracy of topic recognition.
  • the training process of the topic recognition model no manual intervention is required, which reduces the labor cost and improves the processing efficiency of the training process.
  • the labeling cost is greatly reduced, and compared with the topic clustering method based on unsupervised learning, the accuracy of topic recognition is higher.
  • FIG1 is a schematic diagram of a scenario of a method for training a topic recognition model provided by an embodiment of the present application
  • FIG2 is a flow chart of a method for training a topic recognition model provided in an embodiment of the present application
  • FIG3 is a schematic diagram of constructing a first training text provided by an embodiment of the present application.
  • FIG4 is a processing flow chart of a method for training a topic recognition model applied to a dialogue communication scenario provided by an embodiment of the present application
  • FIG5 is a schematic diagram of the structure of a training device for a topic recognition model provided by an embodiment of the present application.
  • FIG6 is a flow chart of a topic identification method provided by an embodiment of the present application.
  • FIG7 is a schematic diagram of the structure of a subject identification device provided by an embodiment of the present application.
  • FIG8 is a structural block diagram of a computing device provided in an embodiment of the present application.
  • first, second, etc. may be used to describe various information in one or more embodiments of the present application, these information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • the first may also be referred to as the second, and similarly, the second may also be referred to as the first.
  • word "if” as used herein may be interpreted as "at the time of” or "when” or "in response to determining”.
  • Conversation topic identification refers to using a model to predict the topic of each sentence in a conversation.
  • Semi-supervised learning is a machine learning method that combines a small amount of labeled data with a large amount of unlabeled data during training to enhance the performance of the model.
  • BERT Bidirectional Encoder Representation from Transformers, which is a pre-trained language model for natural language processing.
  • BERT is a self-encoding language model that can obtain a vector representation of a text by encoding the bidirectional information of the text.
  • a training method for a topic identification model is provided.
  • the present application also relates to a training device for a topic identification model, a topic identification method, a topic identification device, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
  • Figure 1 shows a scenario diagram of a method for training a topic identification model provided according to an embodiment of the present application.
  • the method for training a topic identification model provided by the present application is divided into two stages. In the first stage, it includes two processes: data preparation and model training.
  • the first stage it includes two processes: data preparation and model training.
  • the first training data By annotating the received text to be identified, the topic category corresponding to the text to be identified is obtained.
  • the text to be identified and the topic category corresponding to the text to be identified constitute the first training data.
  • the reference topic identification model set is trained based on the first training data.
  • the second initial training data is obtained, and the second initial training data is input into the reference topic recognition model set to predict the predicted topic category corresponding to the second initial training data.
  • the predicted topic category corresponding to the second initial training data and the second initial training data constitute the second training data.
  • the first training data and the second training data are used as sample training data of the topic recognition model to train the topic recognition model to obtain the trained topic recognition model.
  • Other models are used as sample training data of the topic recognition model to train the topic recognition model to obtain the trained topic recognition model.
  • the training method of the topic identification model provided in this application trains the topic identification model through a small amount of labeled data, which reduces the labeling cost while further improving the accuracy of topic identification and improving the processing efficiency of the training process.
  • FIG2 shows a flow chart of a method for training a topic recognition model according to an embodiment of the present application, which specifically includes the following steps:
  • Step 202 Acquire first training data, wherein the first training data includes a first training text and a first topic category corresponding to the first training text.
  • the first training text refers to sample training text data used to train the reference topic recognition model set, which may be dialogue text, dialogue voice, etc.
  • the voice of the teacher and the parent may be the first training text
  • the content of the WeChat chat between the teacher and the parent may be the first training text.
  • the first subject category refers to the actual subject category corresponding to the first training text. For example, if the first training text is "Teacher: Hello, hello, hello parents, I am the academic affairs teacher of Zebra.”, then the first subject category corresponding to the first training text is "opening remarks". Accordingly, the first training data is the first training text and the first subject category corresponding to the first training text, that is, the first training data is the training data after the subject category is annotated.
  • obtaining the first training data is obtaining sample training text data for training a reference topic recognition model set and actual topic categories corresponding to the sample training text data.
  • not all the acquired training text data can be used to train the reference topic recognition model set.
  • Some training text data may affect the training of the model set. Therefore, before acquiring the first training data, it is necessary to pre-screen the acquired training text data and use the qualified training text data as the first training text.
  • obtaining first training data includes:
  • a first topic category corresponding to the first training text is obtained based on the first training text.
  • the text to be recognized specifically refers to the text obtained by the communication voice or communication field generated between the communication objects. Specifically, receiving the text to be recognized is receiving the communication voice or communication field generated between the communication objects.
  • the communication object refers to the object that initiates the communication in the text to be recognized. For example, in the communication between the teacher and the parents, the teacher and the parents are the communication objects.
  • the filtering rule refers to the rule used to filter out the text to be recognized that does not meet the filtering conditions.
  • the filtering rule can be to filter out the unconnected calls that are busy, turned off or have no signal, the conversations with short call content, etc.
  • the target text to be recognized refers to the text to be recognized after filtering based on the filtering rules in the text to be recognized.
  • the text to be recognized includes subtext 1 to be recognized, subtext 2 to be recognized, subtext 3 to be recognized, subtext 4 to be recognized and subtext 5 to be recognized, among which subtext 2 to be recognized and subtext 3 to be recognized do not meet the filtering conditions. Therefore, subtext 2 to be recognized and subtext 3 to be recognized are filtered out from the text to be recognized, and the obtained subtext 1 to be recognized, subtext 4 to be recognized and subtext 5 to be recognized constitute the target text to be recognized.
  • the target sub-text to be identified can be understood as the sub-text that needs to determine the subject category in the target text to be identified, which is composed of the target sentence to be identified and the target object corresponding to the target sentence to be identified.
  • the target sub-text to be identified is the target sub-text to be identified, among which "Hello, because I am responsible for the baby's future studies.” is the target sentence to be identified, and "teacher" is the target object.
  • the context information of the target sub-text to be identified can be understood as The previous subtext and the following subtext adjacent to the target subtext to be identified.
  • the target subtext to be identified is "Teacher: It's like this, because we are old users of Zebra, and now we have a student aid action plan for old users.”
  • the previous subtext adjacent to the target subtext to be identified is "Parents: Ah, hello.”
  • the following subtext adjacent to the target subtext to be identified is "Teacher: Is our baby going to the first grade?".
  • the previous subtext and the following subtext adjacent to the target subtext to be identified are "Parents: Ah, hello.” and “Teacher: Is our baby going to the first grade?" respectively.
  • a text to be recognized is received, and the text to be recognized is filtered and screened according to filtering rules, the filtered text to be recognized is determined as a target text to be recognized, a target sub-text to be recognized that needs to be determined as a subject category is determined in the target text to be recognized, and then the corresponding preceding sub-text and following sub-text are obtained based on the target sub-text to be recognized, the target sub-text to be recognized and the preceding sub-text and following sub-text are concatenated to obtain a first training text, and finally, the actual subject category corresponding to the first training text is obtained based on the first training text.
  • the text to be recognized is received, and based on the filtering rules, subtext 2 and subtext 3 in the text to be recognized are filtered out to obtain the target text to be recognized consisting of subtext 1, subtext 4 and subtext 5.
  • the target text to be recognized "Teacher: This is the case, because we are old users of Zebra, and now we have a student aid action plan for old users.” is determined as the target subtext to be recognized, and based on the target subtext to be recognized, the corresponding upper subtext "Parent: Ah, hello.” and the lower subtext "Teacher: Is our baby going to the first grade?" are obtained.
  • FIG3 a schematic diagram of constructing a first training text provided by an embodiment of the present application.
  • the target text to be recognized contains N target sub-texts to be recognized, and each target sub-text to be recognized is composed of a target sentence to be recognized and a target object corresponding to the target sentence to be recognized, expressed as "r1: s1, r2: s2, ..., rN: sN", wherein s is the target sentence to be recognized, expressed as "s1, s2, ..., sN", and r is the target object corresponding to the target sentence to be recognized, expressed as "r1, r2, ..., rN", and the format of the first training text obtained by splicing the target sub-text to be recognized and the context sub-text adjacent to it is "[CLS]r1: s1r2: s2 ...
  • the first training text obtained by concatenating the target sub-text to be identified and its adjacent context sub-text is "[CLS] Parent: Ah, hello. [SEP] Teacher: It's like this, because we are old users of Zebra, and now we have a school aid action plan for old users. [SEP] Teacher: Is our baby going to the first grade? [SEP]", among which the [CLS] symbol is a special classification embedding character, which is a sentence start mark.
  • the [CLS] symbol is inserted before the text and is used in text classification tasks; the [SEP] symbol can be understood as a field separator in the file, which is a segmentation mark and a sentence end mark, and is used to segment text files.
  • the training method of the topic recognition model provided in the present application pre-processes the acquired text to be recognized by the above process before training the topic recognition model, so that the first training text used for training does not contain redundant data that is not conducive to training, thereby improving the training efficiency.
  • a target subtext to be recognized is determined in a target text to be recognized, and a first training text is constructed based on context information of the target subtext to be recognized, including:
  • a first training text is constructed according to the target sentence to be recognized, the target object corresponding to the target sentence to be recognized, the context information, and the target objects respectively corresponding to the context information.
  • the target sentence to be recognized refers to the text sentence contained in the target sub-text to be recognized.
  • the target sub-text to be recognized is "Teacher: It is like this, because we are old users of Zebra, and now we have a student aid action plan for old users.”
  • the target sentence to be recognized is "It is like this, because we are old users of Zebra, and now we have a student aid action plan for old users.”
  • the context information of the target sentence to be recognized is the previous sentence and the following sentence adjacent to the target sentence to be recognized.
  • the target sentence to be recognized "It's like this, because we are old users of Zebra, and now we have a student aid action plan for old users.”
  • the previous sentence adjacent to it is "Ah, hello.”
  • the following sentence adjacent to it is "Is our baby going to the first grade?".
  • the target object specifically refers to the communication object that initiates the target sentence to be recognized.
  • the target object of the target sentence to be recognized is the teacher. Accordingly, based on this method, the target object corresponding to the context information of the target sentence to be recognized can be determined.
  • a target sentence to be recognized is determined in the target text to be recognized, and based on the target sentence to be recognized, adjacent preceding and following sentences are obtained, and a target object corresponding to each sentence is determined according to its initiating object, and each sentence is concatenated with its corresponding target object to construct a first training text.
  • the target sentence to be recognized is not necessarily the middle sentence in the target text to be recognized, but can also be the first sentence or the last sentence in the target text to be recognized.
  • the target sentence to be recognized is the first sentence in the target text to be recognized, it is only necessary to obtain the following sentence of the target sentence to be recognized and continue to perform subsequent operations; when the target sentence to be recognized is the last sentence in the target text to be recognized, it is only necessary to obtain the preceding sentence of the target sentence to be recognized and continue to perform subsequent operations.
  • the training method of the topic identification model provided in the present application not only obtains the topic category of the target sentence to be identified, but also further determines the topic category of the context information of the target sentence to be identified based on the context information of the target sentence to be identified, thereby improving the accuracy of determining the target sentence to be identified.
  • receiving a text to be recognized includes:
  • the speech to be recognized is converted into the text to be recognized.
  • the speech to be recognized can be understood as an audio file generated during the voice communication between the communication objects, which can be a voice recording, a telephone recording, etc.
  • receiving the speech to be recognized refers to receiving the communication speech generated between the communication objects.
  • the telephone communication dialogue between the teacher and the parents is the speech to be recognized.
  • the speech recognition model After receiving the speech to be recognized, the speech recognition model is used to convert the speech to be recognized into text to be recognized, and then continue the subsequent processing Alternatively, other methods may be used to convert the speech to be recognized into text to be recognized, which is not limited in this application.
  • the text to be recognized between the communication objects be used as the content for subsequent processing, but also the speech to be recognized between the communication objects can be received, and then the speech to be recognized can be converted into text to be recognized for subsequent processing.
  • the scope of application of the embodiments provided by the present application can be expanded, without being limited by the communication text, and the user experience can be improved.
  • the training method of the topic recognition model provided in the present application can not only use text files as training data, but also obtain audio data as training data, thereby expanding the scope of application.
  • the obtained training data is preprocessed to remove data that does not meet the conditions, reduce the redundancy of the training data, and thus improve the accuracy of the subsequent training topic recognition model.
  • Step 204 training a reference topic identification model set based on the first training text and the first topic category.
  • the reference topic identification model is a topic identification model trained based on the first training text and the first topic category corresponding to the first training text, and the reference topic identification model is not used as the final topic identification model.
  • training a reference topic identification model set based on the first training text and the first topic category includes:
  • Each reference topic identification model in the reference topic identification model set is trained based on the first training text and the first topic category.
  • the initial reference topic recognition model specifically refers to a topic recognition model that has not been trained. Specifically, the initial reference topic recognition model is obtained, and a plurality of different hyperparameters are set for the initial reference topic recognition model for training. Different reference topic recognition models are obtained by training according to different hyperparameters. These different reference topic recognition models constitute a reference topic recognition model set. Then, based on the above acquisition method, the first training text and the first topic category corresponding to the first training text are obtained, and each reference topic recognition model is trained using the first training text and the first topic category.
  • the training method of the topic identification model provided in the present application obtains an initial reference identification model, sets different hyperparameters for the initial reference identification model, obtains a reference topic identification model set, and trains the reference topic identification model using a first training text and a first topic category.
  • training each reference topic identification model in the reference topic identification model set based on the first training text and the first topic category includes:
  • the model parameters of the target reference topic recognition model are adjusted according to the reference loss value, and the target reference topic recognition model is continued to be trained until a training stop condition is reached.
  • the target reference topic recognition model refers to the reference topic recognition model that needs to be trained and is determined in the reference topic recognition model set;
  • the first training text refers to the communication text obtained in the sample communication text set, which is the training sample of the target reference topic recognition model;
  • the sample communication text set refers to a set of communication texts obtained by collecting the text content in the communication voice or communication text;
  • the first topic category refers to the actual topic category corresponding to the first training text;
  • the first predicted topic category refers to the topic category output by inputting the first training text into the target reference topic recognition model;
  • the reference loss value refers to the difference between the first topic category and the first The difference value between the predicted topic categories is used to measure the difference between the first topic category and the first predicted topic category.
  • the first training text is obtained by the above-mentioned method of obtaining the first training text, and the first training text is input into the target reference topic recognition model.
  • the target reference topic recognition model is used to identify the topic category of the first training text.
  • the target reference topic recognition model is a model that has not been trained yet. There will be a deviation between the identified first predicted topic category and the actual first topic category, and the model parameters of the target reference topic recognition model need to be adjusted accordingly.
  • the reference loss value of the target reference topic recognition model is calculated according to the output first predicted topic category and the first topic category.
  • the loss function for calculating the reference loss value can be a 0-1 loss function, an absolute value loss function, a square loss function, a cross entropy loss function, etc.
  • the cross entropy function is selected as the loss function for calculating the reference loss value, and the model parameters of the target reference topic recognition model are adjusted according to the reference loss value.
  • the adjusted model parameters are used for the next batch of first training texts to continue training the target reference topic recognition model until the stop condition of the model training is reached.
  • the model training stopping conditions include that the model reference loss value is less than a preset threshold and/or the training rounds reach a preset round.
  • the preset threshold is 0.3.
  • the preset number of training rounds is 30 rounds.
  • the training rounds of the first training text reach 30 rounds, it is considered that the training of the target reference topic recognition model is completed.
  • two training stop conditions a preset threshold and a preset training round
  • the reference loss value and the training round are monitored at the same time.
  • the training of the target reference subject recognition model is considered to be completed.
  • the target reference topic recognition model in the process of training the target reference topic recognition model, is supervisedly trained based on the first topic category corresponding to the first training text to obtain a vector representation of each word, which is then average-pooled to obtain a vector representation of each sentence, that is, a vector representation of the target sentence to be recognized combined with the context, that is, a vector representation of the first training text, and the vector representation of each sentence is linearly transformed to obtain the probability of each category corresponding to the first training text, and the topic category with the highest output probability is used as the topic category of the first training text, that is, the first predicted topic category.
  • the topic category with the highest output probability is used as the topic category of the first training text, that is, the first predicted topic category.
  • the unprocessed communication text is annotated through data annotation so that each communication text corresponds to a labeled topic category.
  • the process of data annotation is: first, a trial annotation is performed on a small-scale data set. If the trial annotation is correct and reasonable, the annotated data is used as a test set, and large-scale data annotation is started to annotate more training data.
  • each communication text is annotated by at least two annotators, and then the consistency between different annotated texts is calculated. If the annotation result does not meet the annotation conditions, it needs to be re-annotated.
  • the training method of the topic recognition model provided in the present application can obtain multiple different reference topic recognition models by training a reference topic recognition model set through a first training text and a first topic category corresponding to the first training text. In the subsequent process of predicting topic categories, randomness of results can be avoided and the accuracy of results can be improved.
  • Step 206 Acquire second initial training data, wherein the second initial training data includes a second training text.
  • the second training text refers to sample training text data used to train the topic recognition model, which can be dialogue text, dialogue voice, etc.
  • the telephone communication voice between the teacher and the parent can be the second training text
  • the WeChat chat content between the teacher and the parent can be the second training text.
  • the second training text is training text data without subject category annotation, so the second initial training data only includes the second training text.
  • the method process for obtaining the second initial training data is the same as the method process for obtaining the first training text.
  • the specific acquisition process can refer to the acquisition process for obtaining the first training text, and this application will not repeat it here.
  • Step 208 Input the second training text into the reference topic identification model set, obtain the predicted topic category corresponding to the second training text, and obtain second training data based on the second training text and the predicted topic category.
  • the predicted topic category specifically refers to inputting the second training text into the topic category output by the reference topic recognition model set, and the second training data specifically includes the second training text and the predicted topic category corresponding to the second training text.
  • the second training text is input into the reference topic recognition model set, and then the predicted topic category corresponding to the second training text output by the reference topic recognition model set can be obtained, and the second training data can be obtained based on the second training text and the predicted topic category corresponding to the second training text.
  • inputting the second training text into the reference topic identification model set to obtain a predicted topic category corresponding to the second training text includes:
  • the predicted topic category corresponding to the second training text is determined in the set of topic categories to be determined.
  • the set of subject categories to be determined refers to a set of subject categories consisting of subject categories corresponding to the second training text output by each reference subject recognition model. For example, if the reference subject recognition model set includes 10 reference subject recognition models, then the subject categories corresponding to 10 second training texts will be obtained by inputting the second training text into the reference subject recognition model set.
  • the set of subject categories consisting of the subject categories corresponding to these 10 second training texts is the set of subject categories to be determined. It should be noted that among these 10 subject categories to be determined, there can be 10 different subject categories or different repetitive subject categories.
  • the second training text is respectively input into each reference topic recognition model in the reference topic recognition model set to obtain the topic category corresponding to the second training text output by each reference topic recognition model
  • the topic categories corresponding to the second training text output by each reference topic recognition model constitute a set of to-be-determined topic categories
  • a to-be-determined topic category is determined in the set of to-be-determined topic categories as the predicted topic category corresponding to the second training text.
  • the second training text is input into the reference topic recognition model set, that is, the second training text is respectively input into the 10 reference topic recognition models in the reference topic recognition model set, based on which 10 output results of the 10 reference topic recognition models can be obtained, that is, 10 undetermined topic categories corresponding to the second training texts, and one undetermined topic category is determined among these 10 undetermined topic categories as the predicted topic category corresponding to the second training text.
  • the training method of the topic identification model provided in the present application inputs the unlabeled second training text into the reference topic identification model set, and determines the predicted topic category corresponding to the second training text in the results output by the reference topic identification model set. By synthesizing the results of multiple topic categories to be determined, the predicted topic category corresponding to the second training text is further determined, so that the accuracy of the predicted topic category of the second training text is higher.
  • the set of subject categories to be determined includes at least one subject category to be determined and quantity information of each subject category to be determined;
  • determining the predicted topic category corresponding to the second training text in the set of topic categories to be determined includes:
  • the to-be-determined topic category with the largest amount of information is the predicted topic category corresponding to the second training text.
  • the quantity information of each subject category to be determined can be understood as the statistical quantity corresponding to each subject category to be determined.
  • the subject category set to be determined includes 10 subject categories to be determined, including 5 A categories, 3 B categories and 2 C categories.
  • the corresponding representation of the subject category set to be determined is ⁇ A-5, B-3, C-2 ⁇ .
  • the corresponding quantity information determines that the subject category to be determined with the largest quantity information is the predicted subject category corresponding to the second training text.
  • the predicted subject category corresponding to the second training text is determined among the 5 A categories, 3 B categories, and 2 C categories included in the subject category set to be determined. Since the A category has the largest quantity information among the 10 subject categories to be determined, the A category can be determined as the predicted subject category corresponding to the second training text.
  • a to-be-determined topic category with the largest amount of information can be randomly selected as the predicted topic category of the second training text.
  • the to-be-determined topic category set is ⁇ A-4, B-4, C-1, D-1 ⁇ , and among the four to-be-determined topic categories A, B, C, and D, category A and category B have the largest amount of information and are the same, then category A can be determined as the predicted topic category of the second training text, and category B can also be determined as the predicted topic category of the second training text.
  • the training method of the topic identification model provided in the present application can, after determining each to-be-determined topic category in the to-be-determined topic category set and the quantity information corresponding to each to-be-determined topic category, determine the to-be-determined topic category with the largest quantity information corresponding to the to-be-determined topic category as the predicted topic category of the second training text.
  • the above method can improve the accuracy of predicting the topic category corresponding to the second training text.
  • Step 210 Train a topic recognition model based on the first training data and the second training data.
  • the topic recognition model refers to the topic recognition model that needs to be trained. At this time, the topic recognition model is a topic recognition model that has not been trained. After obtaining the first training data and the second training data, the topic recognition model can be trained according to the first training data and the second training data.
  • the specific training method is as follows:
  • training a topic identification model according to the first training data and the second training data includes:
  • the model parameters of the topic identification model are adjusted according to the loss value, and the topic identification model is continuously trained until a training stop condition is reached.
  • the sample training text includes the first training text and the second training text, which are the training samples of the topic identification model;
  • the topic category label includes the first topic category and the predicted topic category, which are the actual topic categories corresponding to the first training text and the second training text;
  • the predicted category label refers to the topic category output by inputting the sample training text into the topic identification model;
  • the loss value refers to the difference value between the topic category label and the predicted category label, which is used to measure the difference between the topic category label and the predicted category label.
  • the first training text and the second training text obtained are determined as sample training texts, and the sample training texts are input into the topic identification model.
  • the topic identification model is used to identify the topic category of the sample training text.
  • the topic identification model is an untrained model, and there will be a deviation between the identified predicted category label and the actual topic category label.
  • the model parameters of the topic identification model need to be adjusted accordingly.
  • the loss value of the topic identification model is calculated according to the output predicted category label and the topic category label.
  • the loss function for calculating the loss value can be a 0-1 loss function, an absolute value loss function, a square loss function, a cross entropy loss function, etc. in actual applications.
  • the cross entropy function is selected as the loss function for calculating the loss value, and the model parameters of the topic identification model are adjusted according to the loss value.
  • the adjusted model parameters are used for the next batch of sample training texts to continue training the topic identification model until the stop condition of the model training is reached.
  • the model training stopping conditions include that the model loss value is less than a preset threshold and/or the training rounds reach a preset round.
  • the preset threshold is 0.3.
  • the preset training round as the training stop condition as an example, the preset training round The number of rounds is 30. When the training rounds of the sample training text reach 30, the topic identification model training is considered completed.
  • two training stop conditions a preset threshold and a preset training round
  • the loss value and the training round are monitored at the same time.
  • the topic identification model in the process of training the topic identification model, is supervisedly trained based on the topic category labels corresponding to the sample training text to obtain a vector representation of each word, which is then averaged and pooled to obtain a vector representation of each sentence, i.e., a vector representation of the sample training text, and the vector representation of each sentence is linearly transformed to obtain the probability of each category corresponding to the sample training text, and the topic category with the highest output probability is used as the topic category of the sample training text, i.e., the topic category label.
  • the training method of the topic identification model includes: obtaining first training data, wherein the first training data includes a first training text and a first topic category corresponding to the first training text; training a reference topic identification model set based on the first training text and the first topic category; obtaining second initial training data, wherein the second initial training data includes a second training text; inputting the second training text into the reference topic identification model set to obtain a predicted topic category corresponding to the second training text, and obtaining second training data based on the second training text and the predicted topic category; and training the topic identification model according to the first training data and the second training data.
  • An embodiment of the present application realizes training the topic recognition model with less labeled training data, which reduces the labeling cost and further improves the accuracy of topic recognition.
  • the training process of the topic recognition model no manual intervention is required, which reduces the labor cost and improves the processing efficiency of the training process.
  • the labeling cost is greatly reduced, and compared with the topic clustering method based on unsupervised learning, the accuracy of topic recognition is higher.
  • Figure 4 shows a processing flow chart of a training method of a topic identification model applied to a dialogue communication scenario provided by an embodiment of the present application, which specifically includes the following steps:
  • Step 402 receiving text to be recognized, filtering the text to be recognized according to filtering rules, and obtaining target text to be recognized.
  • chat texts generated in WeChat chats between parents and teachers are received, chat texts that do not meet the filtering conditions are filtered out according to the filtering rules, and chat texts that meet the filtering rules are obtained and used as target texts to be identified.
  • Step 404 determining a target sub-text to be recognized in the target text to be recognized, wherein the target sub-text to be recognized includes a target sentence to be recognized and a target object corresponding to the target sentence to be recognized.
  • the sub-text that needs to determine the subject category in the target text to be identified is determined as the target sub-text to be identified.
  • the target sub-text to be identified is "Teacher: It's like this, because we are old users of Zebra, and now we have a student aid action plan for old users.”
  • "teacher” is the target object, and "It's like this, because we are old users of Zebra, and now we have a student aid action plan for old users.” is the target sentence to be identified.
  • Step 406 Obtain the context information of the target sentence to be recognized and the target object corresponding to the context information.
  • the target sentence to be recognized "It's like this, because we are old users of Zebra, and now we have a learning assistance action plan for old users.”
  • the preceding information of the target sentence to be recognized is obtained as "Parent: Ah, hello.”
  • the following information is obtained as "Teacher: Is our baby going to the first grade?".
  • "Parent” is the target object corresponding to the preceding information of the target sentence to be recognized
  • "Teacher” is the target object corresponding to the following information of the target sentence to be recognized.
  • Step 408 construct a first training text based on the target sentence to be recognized, the target object corresponding to the target sentence to be recognized, the context information, and the target object corresponding to the context information.
  • the target sentence to be recognized "It's like this, because we are old users of Zebra, and now we have a learning aid action plan for old users.”
  • the previous information of the target sentence to be recognized "Parents: Ah, hello.”
  • the following information "Teacher: Is our baby going to the first grade?" are spliced together, and the first training text after splicing is "[CLS] Parents: Ah, hello.
  • Step 410 Acquire a first topic category corresponding to the first training text based on the first training text.
  • the first theme category corresponding to the first training text has been annotated, and the specific annotation process is as described above. Therefore, the first theme category corresponding to the first training text can be directly obtained according to the first training text.
  • the first theme category corresponding to the first training text is obtained as "opening remarks, free class-free class inquiry, grade confirmation".
  • Step 412 Obtain an initial reference topic recognition model, and set different hyperparameters for the initial reference topic recognition model to obtain a reference topic recognition model set.
  • an initial reference topic recognition model is obtained, and multiple different hyperparameters are set for the initial reference topic recognition model for training. Based on the different hyperparameters, multiple reference topic recognition models can be obtained.
  • the topic recognition model set composed of these reference topic recognition models is the reference topic recognition model set.
  • Step 414 Input the first training text into each reference topic identification model in the reference topic identification model set, and train each reference topic identification model based on the first topic category.
  • Step 416 Obtain a second training text, and input the second training text into each reference topic identification model in the reference topic identification model set.
  • the second training text is obtained based on the same acquisition method as the first training text, and the second training text is input into each reference topic recognition model.
  • Step 418 Based on the output results of each reference topic identification model, a set of to-be-determined topic categories corresponding to the second training text is obtained.
  • the set of to-be-determined topic categories corresponding to the second training text is obtained as ⁇ A-5, B-3, C-2 ⁇ .
  • Step 420 Determine the to-be-determined topic category with the largest amount of information in the to-be-determined topic category set as the predicted topic category corresponding to the second training text.
  • category A is determined to be the predicted subject category corresponding to the second training text.
  • Step 422 Determine the first training text and the second training text as sample training texts, and determine the first topic category and the predicted topic category as topic category labels.
  • Step 424 Input the sample training text into a topic identification model, and train the topic identification model based on the topic category label.
  • the training method of the topic identification model realizes the training of the topic identification model through less labeled training data, while reducing the labeling cost, further improving the accuracy of topic identification, and in the training process of the topic identification model, no manual secondary intervention is required, which reduces the labor cost and can improve the processing efficiency of the training process.
  • the labeling cost is greatly reduced, and compared with the topic clustering method based on unsupervised learning, the accuracy of topic identification is higher.
  • FIG5 shows a schematic diagram of the structure of a training device for a topic identification model provided in an embodiment of the present application.
  • the device includes:
  • a first acquisition module 502 is configured to acquire first training data, wherein the first training data includes a first training text and a first subject category corresponding to the first training text;
  • a first training module 504 is configured to train a reference topic identification model set based on the first training text and the first topic category;
  • a second acquisition module 506 is configured to acquire second initial training data, wherein the second initial training data includes a second training text;
  • the data prediction module 508 is configured to input the second training text into the reference topic identification model set, obtain the predicted topic category corresponding to the second training text, and obtain second training data based on the second training text and the predicted topic category;
  • the second training module 510 is configured to train a topic recognition model according to the first training data and the second training data.
  • the first training module 504 is further configured to:
  • Each reference topic identification model in the reference topic identification model set is trained based on the first training text and the first topic category.
  • the first training module 504 is further configured to:
  • the model parameters of the target reference topic recognition model are adjusted according to the reference loss value, and the target reference topic recognition model is continued to be trained until a training stop condition is reached.
  • the data prediction module 508 is further configured to:
  • the predicted topic category corresponding to the second training text is determined in the set of topic categories to be determined.
  • the set of subject categories to be determined includes at least one subject category to be determined and quantity information of each subject category to be determined;
  • the data prediction module 508 is further configured to:
  • the to-be-determined topic category with the largest amount of information is the predicted topic category corresponding to the second training text.
  • the second training module 510 is further configured to:
  • the model parameters of the topic identification model are adjusted according to the loss value, and the topic identification model is continuously trained until a training stop condition is reached.
  • the first acquisition module 502 is further configured to:
  • a first topic category corresponding to the first training text is obtained based on the first training text.
  • the first acquisition module 502 is further configured to:
  • a first training text is constructed according to the target sentence to be recognized, the target object corresponding to the target sentence to be recognized, the context information, and the target objects respectively corresponding to the context information.
  • the first acquisition module 502 is further configured to:
  • the speech to be recognized is converted into the text to be recognized.
  • the training device of the topic identification model includes: a first acquisition module, configured to acquire first training data, wherein the first training data includes a first training text and a first topic category corresponding to the first training text; a first training module, configured to train a reference topic identification model set based on the first training text and the first topic category; a second acquisition module, configured to acquire second initial training data, wherein the second initial training data includes a second training text; a data prediction module, configured to input the second training text into the reference topic identification model set, obtain a predicted topic category corresponding to the second training text, and obtain second training data based on the second training text and the predicted topic category; and a second training module, configured to train the topic identification model according to the first training data and the second training data.
  • An embodiment of the present application realizes training the topic recognition model with less labeled training data, which reduces the labeling cost and further improves the accuracy of topic recognition.
  • the training process of the topic recognition model no manual secondary intervention is required, which reduces the labor cost and improves the processing efficiency of the training process.
  • the labeling cost is greatly reduced, and compared with the topic clustering method based on unsupervised learning, the accuracy of topic recognition is higher.
  • the above is a schematic scheme of a training device for a topic identification model of this embodiment. It should be noted that the technical scheme of the training device for the topic identification model and the technical scheme of the training method for the topic identification model mentioned above belong to the same concept, and the details of the technical scheme of the training device for the topic identification model that are not described in detail can be found in the description of the technical scheme of the training method for the topic identification model mentioned above.
  • FIG6 shows a flow chart of a topic identification method provided according to an embodiment of the present application, which specifically includes the following steps:
  • Step 602 Receive text to be recognized.
  • the text to be recognized specifically refers to the text obtained by the communication voice or communication field generated between the communication objects. Specifically, receiving the text to be recognized is receiving the communication voice or communication field generated between the communication objects.
  • the communication object refers to the object that initiates the communication in the text to be recognized. For example, in the communication between the teacher and the parents, the teacher and the parents are the communication objects.
  • Step 604 determining a target sub-text to be recognized in the text to be recognized, wherein the target sub-text to be recognized includes a target sentence to be recognized and a target object corresponding to the target sentence to be recognized.
  • the target subtext to be identified can be understood as the subtext that needs to determine the subject category in the text to be identified, which is composed of the target sentence to be identified and the target object corresponding to the target sentence to be identified.
  • the target subtext to be identified is the target subtext to be identified, among which "Hello, because I am responsible for the baby's future studies.” is the target sentence to be identified, and "Teacher” is the target object.
  • a subtext whose subject category needs to be determined is determined in the text to be identified, and the subtext is determined as a target subtext to be identified.
  • the target subtext to be identified includes a target sentence to be identified and a target object corresponding to the target sentence to be identified.
  • Step 606 construct the input text to be recognized based on the context information of the target sentence to be recognized.
  • the context information of the target sentence to be recognized can be understood as the previous sentence and the following sentence adjacent to the target sentence to be recognized.
  • the target sentence to be recognized is "It's like this, because we are old users of Zebra, and now we have a learning assistance action plan for old users.”
  • the previous sentence adjacent to the target sentence to be recognized is "Ah, hello.”
  • the following sentence adjacent to the target sentence to be recognized is "Is our baby going to the first grade?"
  • the previous sentence and the following sentence adjacent to the target sentence to be recognized are "Ah, hello.” and "Is our baby going to the first grade?" respectively.
  • the target objects corresponding to each sentence can be further obtained.
  • the target object of the target sentence to be recognized "It's like this, because we are old users of Zebra, and now we have a student aid action plan for old users.” is the teacher
  • the target object of the previous sentence "Ah, hello.” is the parent
  • the target object of the following sentence "Is our baby going to the first grade?” is the teacher.
  • the context sentence corresponding to the target sentence to be recognized and the target object corresponding to each sentence, the input text to be recognized is spliced as "[CLS]Parent: Ah, hello.
  • [SEP]Teacher It's like this, because we are old users of Zebra, and now we have a student aid action plan for old users.
  • [SEP]Teacher Is our baby going to the first grade? [SEP]”.
  • Step 608 Input the input text to be recognized into a topic recognition model.
  • the topic recognition model is obtained by training the topic recognition model training method in the above-mentioned embodiment.
  • the input text to be recognized constructed as above is input into the topic recognition model.
  • the input text to be recognized "[CLS] Parent: Ah, hello.
  • [SEP] Teacher: Is our baby going to the first grade? [SEP]" is input into the topic recognition model.
  • Step 610 Obtain the topic category output by the topic identification model.
  • the output result of the topic recognition model is obtained as "opening remarks, free class-free class inquiry, grade confirmation", and "opening remarks, free class-free class inquiry, grade confirmation" is used as the topic category of the above input text to be recognized. Then, it can be determined that the target sentence to be recognized "It's like this, because we are old users of Zebra, and now we have a student aid action plan for old users.”
  • the theme category is "free class-free class inquiry", the theme category of the previous sentence “Ah, hello.” is “opening remarks”, and the theme category of the following sentence "Is my baby going to the first grade?" is "grade confirmation".
  • the topic identification method provided in the present application includes: receiving a text to be identified; determining a target sub-text to be identified in the text to be identified, wherein the target sub-text to be identified includes a target sentence to be identified and a target object corresponding to the target sentence to be identified; constructing an input text to be identified based on context information of the target sentence to be identified; inputting the input text to be identified into a topic identification model, wherein the topic identification model is trained by the above-mentioned topic identification model training method; and obtaining a topic category output by the topic identification model.
  • An embodiment provided in the present application realizes obtaining the topic category of a target sentence to be identified by combining the context information of the target sentence to be identified, thereby improving the accuracy of topic identification.
  • FIG7 shows a schematic diagram of the structure of a subject identification device provided by an embodiment of the present application.
  • the device includes:
  • the receiving module 702 is configured to receive the text to be recognized
  • the determination module 704 is configured to determine a target sub-text to be recognized in the text to be recognized, wherein the target sub-text to be recognized includes a target sentence to be recognized and a target object corresponding to the target sentence to be recognized;
  • a construction module 706 is configured to construct an input text to be recognized based on the context information of the target sentence to be recognized;
  • An input module 708 is configured to input the input text to be recognized into a topic recognition model
  • the acquisition module 710 is configured to acquire the topic category output by the topic identification model.
  • the topic identification device includes: a receiving module, configured to receive a text to be identified; a determining module, configured to determine a target sub-text to be identified in the text to be identified, wherein the target sub-text to be identified includes a target sentence to be identified and a target object corresponding to the target sentence to be identified; a constructing module, configured to construct an input text to be identified based on the context information of the target sentence to be identified; an input module, configured to input the input text to be identified into a topic identification model, wherein the topic identification model is trained by the training method of the above-mentioned topic identification model; and an obtaining module, configured to obtain the topic category output by the topic identification model.
  • An embodiment provided in the present application realizes obtaining the topic category of a target sentence to be identified by combining the context information of the target sentence to be identified, thereby improving the accuracy of topic identification.
  • FIG. 8 shows a block diagram of a computing device 800 according to an embodiment of the present application.
  • the components of the computing device 800 include but are not limited to a memory 810 and a processor 820.
  • the processor 820 is connected to the memory 810 via a bus 830, and a database 850 is used to store data.
  • the computing device 800 also includes an access device 840 that enables the computing device 800 to communicate via one or more networks 860.
  • networks 860 include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • PSTN public switched telephone network
  • LAN local area network
  • WAN wide area network
  • PAN personal area network
  • the access device 840 may include one or more of any type of network interface (e.g., a network interface card (NIC)) that is wired or wireless, such as an IEEE 802.11 wireless local area network (WLAN) wireless interface, a World Wide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a universal serial bus (USB) interface, a cellular network interface, a Bluetooth interface, a near field communication (NFC) interface, and the like.
  • a network interface card e.g., a network interface card (NIC)
  • WLAN wireless local area network
  • Wi-MAX World Wide Interoperability for Microwave Access
  • Ethernet e.g., a USB interface, a universal serial bus (USB) interface, a cellular network interface, a Bluetooth interface, a near field communication (NFC) interface, and the like.
  • USB universal serial bus
  • NFC near field communication
  • the above components of the computing device 800 and other components not shown in FIG8 may also be connected to each other, for example, through a bus. It should be understood that the computing device structure block diagram shown in FIG8 is only for illustrative purposes, and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as needed.
  • the computing device 800 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., a tablet computer, a personal digital assistant, a laptop computer, a notebook computer, a netbook, etc.), a mobile phone (e.g., a smart phone), a wearable computing device (e.g., a smart watch, smart glasses, etc.), or other types of mobile devices, or a stationary computing device such as a desktop computer or PC.
  • the computing device 800 may also be a mobile or stationary server.
  • the processor 820 executes the computer instructions, it implements the training method of the topic identification model or the steps of the topic identification method.
  • the above is a schematic scheme of a computing device of this embodiment. It should be noted that the technical scheme of the computing device and the above-mentioned training method of the topic identification model or the technical scheme of the topic identification method belong to the same concept, and the details not described in detail in the technical scheme of the computing device can be found in the description of the above-mentioned training method of the topic identification model or the technical scheme of the topic identification method.
  • An embodiment of the present application also provides a computer-readable storage medium storing computer instructions, which, when executed by a processor, implement the training method of the topic identification model as described above or the steps of the topic identification method.
  • the above is a schematic scheme of a computer-readable storage medium of this embodiment. It should be noted that the technical scheme of the storage medium and the technical scheme of the training method of the topic identification model or the topic identification method mentioned above belong to the same concept, and the details not described in detail in the technical scheme of the storage medium can be found in the description of the technical scheme of the training method of the topic identification model or the topic identification method mentioned above.

Landscapes

  • Machine Translation (AREA)

Abstract

本申请提供主题识别模型的训练方法及装置,其中主题识别模型的训练方法包括:获取第一训练数据,其中,第一训练数据包括第一训练文本和第一训练文本对应的第一主题类别;基于第一训练文本和第一主题类别训练参考主题识别模型集合;获取第二初始训练数据,其中,第二初始训练数据包括第二训练文本;将第二训练文本输入至参考主题识别模型集合,获得第二训练文本对应的预测主题类别,并基于第二训练文本和预测主题类别获得第二训练数据;根据第一训练数据和第二训练数据训练主题识别模型。实现了通过少量的标注数据对主题识别模型进行训练,在减轻标注成本的同时,进一步提高主题识别的准确性,提高训练过程的处理效率。

Description

主题识别模型的训练方法及装置
本申请要求于2022年11月30日提交中国专利局、申请号为202211519609.0、发明名称为“主题识别模型的训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,特别涉及主题识别模型的训练方法。本申请同时涉及主题识别模型的训练装置,一种计算设备,以及一种计算机可读存储介质。
背景技术
随着电子科技的进步和发展,人工智能逐步进入大众的视野并广泛应用,影响着人们的生活。人们可以通过人工智能可以将语音转换为文本,还可以对文本进行分类、识别文本主题等,例如,在老师和家长通过电话或微信进行沟通时,需要预测老师或家长在沟通中的每句话的主题,对老师或家长的电话、微信聊天内容进行智能分析,可以快速对沟通内容进行结构化,分析结果可以帮助管理者快速判断老师的沟通质量,总结沟通技巧,从而提升管理效率,实现高质量的质检。
但基于监督学习的深度神经网络方法需要大量的标注数据,由于老师和家长的沟通场景较多,相对应的主题类型也较多,会导致数据标注的成本大大提高;而基于无监督学习的主题聚类方法虽然不需要训练数据,但是自动聚类的效果不好,会产生大量无意义的主题类别,需要二次人工介入,主题类别的识别准确率较低。
因此,基于上述问题,亟需一种主题识别模型的训练方法以解决上述技术问题。
发明内容
有鉴于此,本申请实施例提供了主题识别模型的训练方法。本申请同时涉及主题识别模型的训练装置,一种计算设备,以及一种计算机可读存储介质,以解决现有技术中存在的上述问题。
根据本申请实施例的第一方面,提供了一种主题识别模型的训练方法,包括:
获取第一训练数据,其中,所述第一训练数据包括第一训练文本和所述第一训练文本对应的第一主题类别;
基于所述第一训练文本和所述第一主题类别训练参考主题识别模型集合;
获取第二初始训练数据,其中,所述第二初始训练数据包括第二训练文本;
将所述第二训练文本输入至所述参考主题识别模型集合,获得所述第二训练文本对应的预测主题类别,并基于所述第二训练文本和所述预测主题类别获得第二训练数据;
根据所述第一训练数据和所述第二训练数据训练主题识别模型。
根据本申请实施例的第二方面,提供了一种主题识别模型的训练装置,包括:
第一获取模块,被配置为获取第一训练数据,其中,所述第一训练数据包括第一训练文本和所述第一训练文本对应的第一主题类别;
第一训练模块,被配置为基于所述第一训练文本和所述第一主题类别训练参考主题识别模型集合;
第二获取模块,被配置为获取第二初始训练数据,其中,所述第二初始训练数据包括第二训练文本;
数据预测模块,被配置为将所述第二训练文本输入至所述参考主题识别模型集合,获得所述第二训练文本对应的预测主题类别,并基于所述第二训练文本和所述预测主题类别获得第二训练数据;
第二训练模块,被配置为根据所述第一训练数据和所述第二训练数据训练主题识别模型。
根据本申请实施例的第三方面,提供了一种主题识别方法,包括:
接收待识别文本;
在所述待识别文本中确定目标待识别子文本,其中,所述目标待识别子文本包括目标待识别语句和所述目标待识别语句对应的目标对象;
基于所述目标待识别语句的上下文信息构建待识别输入文本;
将所述待识别输入文本输入至主题识别模型,其中,所述主题识别模型是通过上述主题识别模型的训练方法训练得到的;
获取所述主题识别模型输出的主题类别。
根据本申请实施例的第四方面,提供了一种主题识别装置,包括:
接收模块,被配置为接收待识别文本;
确定模块,被配置为在所述待识别文本中确定目标待识别子文本,其中,所述目标待识别子文本包括目标待识别语句和所述目标待识别语句对应的目标对象;
构建模块,被配置为基于所述目标待识别语句的上下文信息构建待识别输入文本;
输入模块,被配置为将所述待识别输入文本输入至主题识别模型,其中,所述主题识别模型是通过上述主题识别模型的训练方法训练得到的;
获取模块,被配置为获取所述主题识别模型输出的主题类别。
根据本申请实施例的第五方面,提供了一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机指令,所述处理器执行所述计算机指令时实现所述主题识别模型的训练方法或者所述主题识别方法的步骤。
根据本申请实施例的第六方面,提供了一种计算机可读存储介质,其存储有计算机指令,该计算机指令被处理器执行时实现所述主题识别模型的训练方法或者所述主题识别方法的步骤。
本申请提供的主题识别模型的训练方法,包括:获取第一训练数据,其中,所述第一训练数据包括第一训练文本和所述第一训练文本对应的第一主题类别;基于所述第一训练文本和所述第一主题类别训练参考主题识别模型集合;获取第二初始训练数据,其中,所述第二初始训练数据包括第二训练文本;将所述第二训练文本输入至所述参考主题识别模型集合,获得所述第二训练文本对应的预测主题类别,并基于所述第二训练文本和所述预测主题类别获得第二训练数据;根据所述第一训练数据和所述第二训练数据训练主题识别模型。
本申请一实施例实现了通过较少的带有标注的训练数据对主题识别模型进行训练,在减轻标注成本的同时,进一步提高主题识别的准确性,在主题识别模型的训练过程中,无需人工二次介入,降低人工成本,并可以提高训练过程的处理效率。相对于基于监督学习的深度神经网络方法来说,大大降低了标注成本,相对于基于无监督学习的主题聚类方法来说,主题识别的准确率更高。
附图说明
图1是本申请一实施例提供的一种主题识别模型的训练方法的场景示意图;
图2是本申请一实施例提供的一种主题识别模型的训练方法的流程图;
图3是本申请一实施例提供的一种构建第一训练文本的示意图;
图4是本申请一实施例提供的一种应用于对话沟通场景的主题识别模型的训练方法的处理流程图;
图5是本申请一实施例提供的一种主题识别模型的训练装置的结构示意图;
图6是本申请一实施例提供的一种主题识别方法的流程图;
图7是本申请一实施例提供的一种主题识别装置的结构示意图;
图8是本申请一实施例提供的一种计算设备的结构框图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。
在本申请一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请一个或多个实施例。在本申请一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本申请一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
首先,对本申请一个或多个实施例涉及的名词术语进行解释。
对话主题识别:对话主题识别指的是在一段通话中,使用模型预测每一句的主题。
半监督学习:半监督学习是一种机器学习方法,它在训练期间将少量标注数据与大量未标注数据结合起来,增强模型的效果。
BERT:BERT的全称为Bidirectional Encoder Representation from Transformers,是用于自然语言处理的预训练语言模型。BERT是一个自编码语言模型,能够通过编码文本的双向信息,获取文本的向量表示。
实际应用中,越来越多的场景需要应用到神经网络去处理并解决问题,但对于神经网络的训练方式各种各样,有利有弊。以对话主题识别为例,在目前的实现方式中,有基于无监督的主题聚类,有基于有监督学习的文本分类。基于无监督的主题聚类在训练过程中,不需要标注,不需要预先定义主题类别,但与其对应的,会出现主题识别准确率低,并需要人工二次介入的问题。基于有监督学习的文本分类,需要对大量的训练数据进行标注,从而使得标注成本大大提高。
在本申请中,提供了主题识别模型的训练方法,本申请同时涉及主题识别模型的训练装置,一种主题识别方法,一种主题识别装置,一种计算设备,以及一种计算机可读存储介质,在下面的实施例中逐一进行详细说明。
图1示出了根据本申请一实施例提供的一种主题识别模型的训练方法的场景示意图。如图1所示,本申请提供的主题识别模型的训练方法分为两个阶段。在第一阶段中,包括数据准备和模型训练两个过程,通过对接收的待识别文本进行数据标注,获得待识别文本对应的主题类别,待识别文本和待识别文对应的主题类别构成第一训练数据,基于第一训练数据对参考主题识别模型集合进行训练。
在第二阶段中,同样包括数据准备和模型训练两个过程,获取第二初始训练数据,将第二初始训练数据输入至上述的参考主题识别模型集合中预测第二初始训练数据对应的预测主题类别,第二初始训练数据和第二初始训练数据对应的预测主题类别构成第二训练数据。再将第一训练数据和第二训练数据作为主题识别模型的样本训练数据,对主题识别模型进行训练,以得到训练好的主题识 别模型。
本申请提供的主题识别模型的训练方法,通过少量的标注数据对主题识别模型进行训练,在减轻标注成本的同时,进一步提高主题识别的准确性,提高训练过程的处理效率。
图2示出了根据本申请一实施例提供的一种主题识别模型的训练方法的流程图,具体包括以下步骤:
步骤202:获取第一训练数据,其中,所述第一训练数据包括第一训练文本和所述第一训练文本对应的第一主题类别。
其中,第一训练文本是指,用于训练参考主题识别模型集合的样本训练文本数据,具体可以为对话文本、对话语音等。例如,在老师和家长的电话沟通场景中,老师和家长的电话沟通语音可以为第一训练文本,在老师和家长微信聊天的场景中,老师和家长的微信聊天内容可以为第一训练文本。
第一主题类别是指第一训练文本对应的实际主题类别,例如,第一训练文本为“老师:哎,您好,哎家长您好,我是斑马的教务老师哎,您好。”,则该第一训练文本对应的第一主题类别为“开场白”。相应的,第一训练数据为第一训练文本和第一训练文本对应的第一主题类别,即第一训练数据是进行标注主题类别后的训练数据。
具体的,获取第一训练数据为获取用于训练参考主题识别模型集合的样本训练文本数据和样本训练文本数据对应的实际主题类别。
在实际应用中,所获取到的训练文本数据并不是都可以用来进行训练参考主题识别模型集合,有的训练文本数据可能会影响模型集合的训练,因此,在获取第一训练数据之前,需要预先对获取到的训练文本数据进行筛选,将符合条件的训练文本数据作为第一训练文本。
本申请提供的一种具体实施方式中,获取第一训练数据,包括:
接收待识别文本;
根据过滤规则对所述待识别文本进行过滤,获得目标待识别文本;
在目标待识别文本中确定目标待识别子文本,基于所述目标待识别子文本的上下文信息构建第一训练文本;
基于所述第一训练文本获取所述第一训练文本对应的第一主题类别。
其中,待识别文本具体是指,由沟通对象之间产生的沟通语音或沟通字段所获取的文本。具体的,接收待识别文本为接收沟通对象之间产生的沟通语音或沟通字段。沟通对象是指在待识别文本中发起沟通的对象,例如,在老师与家长的沟通中,老师和家长属于沟通对象。
过滤规则是指用于将不符合过滤条件的待识别文本过滤掉的规则。例如,过滤规则可以是将处于忙音、关机或无信号等未接通的电话,通话内容较短的对话等等过滤掉。相应的,目标待识别文本是指待识别文本中基于过滤规则过滤之后的待识别文本,例如,在待识别文本中包括待识别子文本1、待识别子文本2、待识别子文本3、待识别子文本4和待识别子文本5,其中待识别子文本2和待识别子文本3不符合过滤条件,因此,将待识别子文本2和待识别子文本3从待识别文本中过滤掉,所得到的包括待识别子文本1、待识别子文本4和待识别子文本5则构成目标待识别文本。
目标待识别子文本可以理解为在目标待识别文本中确定的需要确定主题类别的子文本,由目标待识别语句和目标待识别语句对应的目标对象构成。例如,在老师和家长的对话沟通中,“老师:您好,因为宝贝之后的学习是我在负责嘛。”需要确定主题类别,“老师:您好,因为宝贝之后的学习是我在负责嘛。”则为目标待识别子文本,其中,“您好,因为宝贝之后的学习是我在负责嘛。”为目标待识别语句,“老师”为目标对象。相应的,目标待识别子文本的上下文信息,可以理解为 与目标待识别子文本邻近的上文子文本和下文子文本。例如,目标待识别子文本为“老师:是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”,获取得到的与该目标待识别子文本相邻近的上文子文本为“家长:啊,你好。”,与该目标待识别子文本相邻近的下文子文本为“老师:咱宝贝是不是要上一年级了呢。”,则与目标待识别子文本邻近的上文子文本和下文子文本分别为“家长:啊,你好。”和“老师:咱宝贝是不是要上一年级了呢。”。
具体的,接收待识别文本,并根据过滤规则对待识别文本进行过滤筛选,将过滤后的待识别文本确定为目标待识别文本,在目标待识别文本中确定需要进行确定主题类别的目标待识别子文本,进而根据目标待识别子文本获取与其对应的上文子文本和下文子文本,将目标待识别子文本和上文子文本、下文子文本进行拼接,以获得第一训练文本,最后,根据第一训练文本获取与第一训练文本对应的实际主题类别。
沿用上例,接收待识别文本,基于过滤规则将待识别文本中的待识别子文本2和待识别子文本3过滤掉,得到包括待识别子文本1、待识别子文本4和待识别子文本5所构成的目标待识别文本。在目标待识别文本中确定“老师:是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”为目标待识别子文本,基于该目标待识别子文本获取与其对应的上文子文本“家长:啊,你好。”,下文子文本“老师:咱宝贝是不是要上一年级了呢。”。
进一步的,以BERT模型为例进行说明,如图3所示为本申请一实施例提供的一种构建第一训练文本的示意图。目标待识别文本中包含N个目标待识别子文本,每个目标待识别子文本均由目标待识别语句和目标待识别语句对应的目标对象构成,表示为“r1:s1,r2:s2,……,rN:sN”,其中,s为目标待识别语句,表示为“s1,s2,……,sN”,r为目标待识别语句对应的目标对象,表示为“r1,r2,……,rN”,基于目标待识别子文本和与其邻近的上下文子文本进行拼接后得到的第一训练文本的格式为“[CLS]r1:s1r2:s2…[SEP]ri:si[SEP]…rN:sN[SEP]”。沿用上例,基于目标待识别子文本和与其邻近的上下文子文本进行拼接后得到的第一训练文本为“[CLS]家长:啊,你好。[SEP]老师:是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。[SEP]老师:咱宝贝是不是要上一年级了呢。[SEP]”,其中,[CLS]符号为特殊分类嵌入字符,为句子开始标志,将[CLS]符号插入至文本前,被用于文本分类任务中;[SEP]符号可以理解为文件中的字段分离符,为分割标志和句子结束标志,用于将文本文件进行分割。
最后,根据目标待识别子文本“老师:是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”的实际主题类别为“赠课-赠课询问”,上文子文本“家长:啊,你好。”的实际主题类别为“开场白”,下文子文本“老师:咱宝贝是不是要上一年级了呢。”的实际主题类别为“年级确认”,可以确定第一训练文本的主题类别,即第一主题类别为“开场白,赠课-赠课询问,年级确认”。
本申请提供的主题识别模型的训练方法,在进行训练主题识别模型之前,对获取到的待识别文本进行上述过程的预处理,可以使得用于进行训练的第一训练文本中不包含不利于进行训练的冗余数据,从而提高训练效率。
本申请提供的一种具体实施方式中,在目标待识别文本中确定目标待识别子文本,基于所述目标待识别子文本的上下文信息构建第一训练文本,包括:
在所述目标待识别文本中确定目标待识别语句;
获取所述目标待识别语句的上下文信息;
基于所述目标待识别语句确定所述目标待识别语句对应的目标对象,基于所述上下文信息确定所述上下文信息分别对应的目标对象;
根据所述目标待识别语句、所述目标待识别语句对应的目标对象、所述上下文信息和所述上下文信息分别对应的目标对象构建第一训练文本。
其中,目标待识别语句是指目标待识别子文本中所包含的文本语句,例如,目标待识别子文本为“老师:是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”,则目标待识别语句为“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”。
目标待识别语句的上下文信息则为与目标待识别语句邻近的上文语句和下文语句,沿用上例,基于目标待识别语句“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”,获取与其邻近的上文语句为“啊,你好。”,获取与其邻近的下文语句为“咱宝贝是不是要上一年级了呢。”。
目标对象具体是指与发起目标待识别语句的沟通对象,例如,在老师与家长的电话沟通场景中,“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”是老师在电话沟通中对家长说的,则目标待识别语句“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”的目标对象则为老师。相应的,基于该方法即可确定目标待识别语句的上下文信息所对应的目标对象。
具体的,在目标待识别文本中确定目标待识别语句,并基于目标待识别语句获取与其邻近的上文语句和下文语句,根据每条语句的发起对象确定与其对应的目标对象,将每条语句与其对应的目标对象进行拼接,构建第一训练文本。
沿用上例,在目标待识别文本中确定“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”为目标待识别语句,基于该语句获取与其邻近的上文语句为“啊,你好。”,获取与其邻近的下文语句为“咱宝贝是不是要上一年级了呢。”,根据这三条语句对应的发起对象,确定“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”的目标对象为老师,“啊,你好。”的目标对象为家长,“咱宝贝是不是要上一年级了呢。”的目标对象为老师。将上述语句以及与其对应的目标对象进行拼接,得到“[CLS]家长:啊,你好。[SEP]老师:是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。[SEP]老师:咱宝贝是不是要上一年级了呢。[SEP]”。
需要注意的是,在实际应用中,目标待识别语句并不一定是目标待识别文本中的中间语句,还可以是目标待识别文本中的第一句话或者是最后一句话。当目标待识别语句为目标待识别文本中第一句话时,只需要获取目标待识别语句的下文语句即可,并继续执行后续操作;当目标待识别语句为目标待识别文本中最后一句话时,只需要获取目标待识别语句的上文语句即可,并继续执行后续操作。
本申请提供的主题识别模型的训练方法,不仅是获取目标待识别语句的主题类别,还基于目标待识别语句的上下文信息进一步确定目标待识别语句上下文信息的主题类别,提高确定目标待识别语句的准确性。
本申请提供的一种具体实施方式中,接收待识别文本,包括:
接收待识别语音;
将所述待识别语音转换为所述待识别文本。
其中,待识别语音可以理解为,在沟通对象之间进行语音沟通的过程中生成的音频文件,可以是语音录音、电话录音等等,具体的,接收待识别语音是指接收沟通对象之间产生的沟通语音。例如,在老师和家长的电话沟通中,老师和家长的电话沟通对话为待识别语音。
在接收待识别语音后,利用语音识别模型将待识别语音转换为待识别文本,再继续后面的处理 操作。或者也可以采用其他方式将待识别语音转换为待识别文本,本申请中在此不做限定。
本申请提供的实施方式中,不仅可以将沟通对象之间的待识别文本作为后续处理的内容,还可以接收沟通对象之间的待识别语音,再将待识别语音转换为待识别文本进行后续处理过程。如此一来,可以扩大本申请提供的实施例的适用范围,不受沟通文本的局限,提高用户的使用体验。
本申请提供的主题识别模型的训练方法,不仅可以将文本文件作为训练数据,还可以获取音频数据作为训练数据,扩大了应用范围。在将训练数据进行训练模型之前,先对获取到的训练数据进行预处理,去除数据中不符合条件的数据,降低训练数据的冗杂性,进而提高后续训练主题识别模型的准确性。
步骤204:基于所述第一训练文本和所述第一主题类别训练参考主题识别模型集合。
其中,参考主题识别模型为基于第一训练文本和第一训练文本对应的第一主题类别进行训练的主题识别模型,参考主题识别模型并不作为最终的主题识别模型进行使用。
本申请提供的一种具体实施方式中,基于所述第一训练文本和所述第一主题类别训练参考主题识别模型集合,包括:
获取初始参考主题识别模型;
为所述初始参考主题识别模型设置不同的超参数,获得所述参考主题识别模型集合;
基于所述第一训练文本和所述第一主题类别训练所述参考主题识别模型集合中的每个参考主题识别模型。
其中,初始参考主题识别模型具体是指还未进行训练的主题识别模型。具体的,获取初始参考主题识别模型,并为初始参考主题识别模型设置多个不同的超参数进行训练,根据不同的超参数进行训练获得不同的参考主题识别模型,这些不同的参考主题识别模型构成参考主题识别模型集合。再基于上述获取方式获取第一训练文本和第一训练文本对应的第一主题类别,利用第一训练文本和第一主题类别对每个参考主题识别模型进行训练。
本申请提供的主题识别模型的训练方法,通过获取初始参考识别模型,为初始参考识别模型设置不同的超参数,以获得参考主题识别模型集合,并利用第一训练文本和第一主题类别对参考主题识别模型进行训练。通过设置不同的超参数可以获得多个不同的参考主题识别模型,从而避免只对一个参考主题识别模型进行训练所带来的误差性和偶然性。
本申请提供的一种具体实施方式中,基于所述第一训练文本和所述第一主题类别训练所述参考主题识别模型集合中的每个参考主题识别模型,包括:
在所述参考主题识别模型集合中确定目标参考主题识别模型;
将所述第一训练文本输入至所述目标参考主题识别模型,获得所述目标参考主题识别模型输出的第一预测主题类别;
根据所述第一主题类别和所述第一预测主题类别计算所述目标参考主题识别模型的参考损失值;
根据所述参考损失值调整所述目标参考主题识别模型的模型参数,并继续训练所述目标参考主题识别模型,直至达到训练停止条件。
其中,目标参考主题识别模型是指在参考主题识别模型集合中确定的需要进行训练的参考主题识别模型,第一训练文本是指在样本沟通文本集合中获取的沟通文本,是所述目标参考主题识别模型的训练样本;样本沟通文本集合是指通过采集沟通语音或沟通文本中的文本内容,获得的沟通文本组成的集合;第一主题类别是指第一训练文本对应的实际主题类别;第一预测主题类别是指将第一训练文本输入至目标参考主题识别模型所输出的主题类别;参考损失值是指第一主题类别与第一 预测主题类别之间的差异值,用于度量第一主题类别与第一预测主题类别之间的差异。
具体的,通过上述获取第一训练文本的获取方式获取第一训练文本,将第一训练文本输入至目标参考主题识别模型中,目标参考主题识别模型用于识别第一训练文本的主题类别,此时的目标参考主题识别模型是还未训练好的模型,识别出的第一预测主题类别与实际的第一主题类别之间会存在偏差,需要对目标参考主题识别模型的模型参数进行相应的调整,具体的,根据输出的第一预测主题类别和第一主题类别计算目标参考主题识别模型的参考损失值,计算参考损失值的损失函数在实际应用中可以为0-1损失函数、绝对值损失函数、平方损失函数、交叉熵损失函数等等,在本申请中,优选的,选择交叉熵函数作为计算参考损失值的损失函数,并根据参考损失值调整目标参考主题识别模型的模型参数,基于调整后的模型参数用于下一批次第一训练文本继续训练目标参考主题识别模型,直至达到模型训练的停止条件。
具体的,模型训练停止条件包括模型参考损失值小于预设阈值和/或训练轮次达到预设的轮次。
在本申请提供的一具体实施方式中,以通过模型参考损失值小于预设阈值为训练停止条件为例,预设阈值为0.3,当模型参考损失值小于0.3时,则认为目标参考主题识别模型训练完成。
在本申请提供的另一具体实施方式中,以预设的训练轮次作为训练停止条件为例,预设的训练轮次为30轮,当第一训练文本的训练轮次达到30轮后,则认为目标参考主题识别模型训练完成。
在本申请提供的又一具体实施方式中,设置预设阈值和预设训练轮次两个训练停止条件,同时监控参考损失值和训练轮次,当模型参考损失值或训练轮次中任意一项满足训练停止条件时,则认为目标参考主题识别模型训练完成。
进一步的,在本申请提供的一种实施方式中,在对目标参考主题识别模型进行训练的过程中,基于第一训练文本对应的第一主题类别对目标参考主题识别模型进行有监督训练,得到每个字的向量表示,再将其平均池化得到每句话的向量表示,即目标待识别语句结合上下文的向量表示,也即第一训练文本的向量表示,并将每句话的向量表示进行线性变换,得到第一训练文本对应的每个类别的概率,将输出概率最高的主题类别作为第一训练文本的主题类别,即第一预测主题类别。重复上述训练过程,得到每个参考主题识别模型输出的第一训练文本对应的主题类别,采用投票的方式,将输出的主题类别最多的主题类别作为第一训练文本的主题类别。
在获取第一训练文本对应的第一主题类别之前,通过数据标注对未经处理的沟通文本进行标注,使得每个沟通文本都对应一个标注的主题类别。具体的,数据标注的过程为:首先在小规模数据集上进行试标注,若试标注正确合理,则将标注的数据作为测试集合,并启动大规模数据标注,以标注更多的训练数据。为了提高标注的准确性,在标注的过程中,每个沟通文本被至少两个标注员进行标注,然后计算不同标注文本之间的一致性,若标注的结果不能满足标注条件,则需要重新标注。
本申请提供的主题识别模型的训练方法,通过第一训练文本和第一训练文本对应的第一主题类别训练参考主题识别模型集合,可以得到多个不同的参考主题识别模型,在后续预测主题类别的过程,避免结果偶然性,提高结果准确性。
步骤206:获取第二初始训练数据,其中,所述第二初始训练数据包括第二训练文本。
其中,第二训练文本是指,用于训练主题识别模型的样本训练文本数据,具体可以为对话文本、对话语音等。例如,在老师和家长的电话沟通场景中,老师和家长的电话沟通语音可以为第二训练文本,在老师和家长微信聊天的场景中,老师和家长的微信聊天内容可以为第二训练文本。需要注意的是,第二训练文本是没有进行主题类别标注的训练文本数据,因此,第二初始训练数据中只包括第二训练文本。
获取第二初始训练数据的方法过程与上述获取第一训练文本的方法过程相同,具体的获取过程可以参见上述获取第一训练文本的获取过程,本申请在此不再赘述。
步骤208:将所述第二训练文本输入至所述参考主题识别模型集合,获得所述第二训练文本对应的预测主题类别,并基于所述第二训练文本和所述预测主题类别获得第二训练数据。
其中,预测主题类别具体是指,将第二训练文本输入至参考主题识别模型集合所输出的主题类别,第二训练数据具体包括第二训练文本和第二训练文本对应的预测主题类别。
具体的,将第二训练文本输入至参考主题识别模型集合,进而可以获得参考主题识别模型集合输出的第二训练文本对应的预测主题类别,基于第二训练文本和第二训练文本对应的预测主题类别即可获取第二训练数据。
本申请提供的一种具体实施方式中,将所述第二训练文本输入至所述参考主题识别模型集合,获得所述第二训练文本对应的预测主题类别,包括:
将所述第二训练文本分别输入至所述参考主题识别模型集合中的每个参考主题识别模型;
基于所述每个参考主题识别模型的输出结果,获得所述第二训练文本对应的待确定主题类别集合;
在所述待确定主题类别集合中确定所述第二训练文本对应的预测主题类别。
其中,待确定主题类别集合是指包含每个参考主题识别模型输出的第二训练文本对应的主题类别所构成的主题类别的集合。例如,在参考主题识别模型集合中包含10个参考主题识别模型,那么,将第二训练文本输入至参考主题识别模型集合中则会获取到10个第二训练文本对应的主题类别,这10个第二训练文本对应的主题类别所构成的主题类别集合即为待确定主题类别集合。需要说明的是,在这10个待确定主题类别中,可以是10个不同的主题类别,也可以是具有重复性的不同的主题类别。
具体的,将第二训练文本分别输入至参考主题识别模型集合中每个参考主题识别模型,获得每个参考主题识别模型输出的第二训练文本对应的主题类别,将每个参考主题识别模型输出的第二训练文本对应的主题类别构成待确定主题类别集合,并在待确定主题类别集合中确定一个待确定主题类别为第二训练文本对应的预测主题类别。
以参考主题识别模型集合中包含10个参考主题识别模型为例进行说明,将第二训练文本输入至参考主题识别模型集合中,即为将第二训练文本分别输入至参考主题识别模型集合中的10个参考主题识别模型中,基于此可以得到10个参考主题识别模型输出的10个输出结果,即10个第二训练文本对应的待确定主题类别,在这10个待确定主题类别中确定一个待确定主题类别为第二训练文本对应的预测主题类别。
本申请提供的主题识别模型的训练方法,将未进行标注的第二训练文本输入至参考主题识别模型集合中,在参考主题识别模型集合输出的结果中确定第二训练文本对应的预测主题类别。通过综合多个待确定主题类别的结果,进一步确定第二训练文本对应的预测主题类别,使得第二训练文本的预测主题类别的准确性更高。
本申请提供的一种具体实施方式中,所述待确定主题类别集合包括至少一个待确定主题类别和每个待确定主题类别的数量信息;
相应的,在所述待确定主题类别集合中确定所述第二训练文本对应的预测主题类别,包括:
确定数量信息最多的待确定主题类别为所述第二训练文本对应的预测主题类别。
其中,每个待确定主题类别的数量信息,可以理解为每个待确定主题类别对应的统计数量。例如,待确定主题类别集合中包括10个待确定主题类别,其中,包括5个A类别,3个B类别和2个C类别。待确定主题类别集合对应的表示形式为{A-5,B-3,C-2}。
具体的,在获取待确定主题类别集合后,根据待确定主题类别集合中包括的待确定主题类别及 其对应的数量信息确定数量信息最多的待确定主题类别为第二训练文本对应的预测主题类别。沿用上例,在待确定主题类别集合中包含的5个A类别,3个B类别和2个C类别中,确定第二训练文本对应的预测主题类别,由于在10个待确定主题类别中,A类别的数量信息最多,因此,可以将A类别确定为第二训练文本对应的预测主题类别。
需要进行说明的是,若出现数量信息最多对应的待确定主题类别相同的情况,则可以随机选择一个数量信息最多的待确定主题类别为第二训练文本的预测主题类别。例如,待确定主题类别集合为{A-4,B-4,C-1,D-1},在A、B、C、D四个待确定主题类别中,A类别和B类别的数量信息最多且相同,那么,既可以将A类别确定为第二训练文本的预测主题类别,也可以将B类别确定为第二训练文本的预测主题类别。
本申请提供的主题识别模型的训练方法,在确定待确定主题类别集合中每个待确定主题类别以及每个待确定主题类别对应的数量信息后,可以将待确定主题类别对应的数量信息最多的待确定主题类别确定为第二训练文本的预测主题类别,通过上述方法可以提高预测第二训练文本对应的主题类别的准确性。
步骤210:根据所述第一训练数据和所述第二训练数据训练主题识别模型。
其中,主题识别模型是指需要进行训练的主题识别模型,此时的主题识别模型是还未训练好的主题识别模型。在获取第一训练数据和第二训练数据后,即可以根据第一训练数据和第二训练数据进行训练该主题识别模型。具体训练方法如下:
本申请提供的一种具体实施方式中,根据所述第一训练数据和所述第二训练数据训练主题识别模型,包括:
将所述第一训练文本和所述第二训练文本确定为样本训练文本,将所述第一主题类别和所述预测主题类别确定为主题类别标签;
将所述样本训练文本输入至所述主题识别模型,获得所述主题识别模型输出的预测类别标签;
根据所述主题类别标签和所述预测类别标签计算所述主题识别模型的损失值;
根据所述损失值调整所述主题识别模型的模型参数,并继续训练所述主题识别模型,直至达到训练停止条件。
其中,样本训练文本包括第一训练文本和第二训练文本,是所述主题识别模型的训练样本;主题类别标签包括第一主题类别和预测主题类别,是第一训练文本和第二训练文本对应的实际主题类别;预测类别标签是指将样本训练文本输入至主题识别模型所输出的主题类别;损失值是指主题类别标签与预测类别标签之间的差异值,用于度量主题类别标签与预测类别标签之间的差异。
具体的,将获取的第一训练文本和第二训练文本确定为样本训练文本,将样本训练文本输入至主题识别模型中,主题识别模型用于识别样本训练文本的主题类别,此时的主题识别模型是还未训练好的模型,识别出的预测类别标签与实际的主题类别标签之间会存在偏差,需要对主题识别模型的模型参数进行相应的调整,具体的,根据输出的预测类别标签和主题类别标签计算主题识别模型的损失值,计算损失值的损失函数在实际应用中可以为0-1损失函数、绝对值损失函数、平方损失函数、交叉熵损失函数等等,在本申请中,优选的,选择交叉熵函数作为计算损失值的损失函数,并根据损失值调整主题识别模型的模型参数,基于调整后的模型参数用于下一批次样本训练文本继续训练主题识别模型,直至达到模型训练的停止条件。
具体的,模型训练停止条件包括模型损失值小于预设阈值和/或训练轮次达到预设的轮次。
在本申请提供的一具体实施方式中,以通过模型损失值小于预设阈值为训练停止条件为例,预设阈值为0.3,当模型损失值小于0.3时,则认为主题识别模型训练完成。
在本申请提供的另一具体实施方式中,以预设的训练轮次作为训练停止条件为例,预设的训练 轮次为30轮,当样本训练文本的训练轮次达到30轮后,则认为主题识别模型训练完成。
在本申请提供的又一具体实施方式中,设置预设阈值和预设训练轮次两个训练停止条件,同时监控损失值和训练轮次,当模型损失值或训练轮次中任意一项满足训练停止条件时,则认为主题识别模型训练完成。
进一步的,在本申请提供的一种实施方式中,在对主题识别模型进行训练的过程中,基于样本训练文本对应的主题类别标签对主题识别模型进行有监督训练,得到每个字的向量表示,再将其平均池化得到每句话的向量表示,即样本训练文本的向量表示,并将每句话的向量表示进行线性变换,得到样本训练文本对应的每个类别的概率,将输出概率最高的主题类别作为样本训练文本的主题类别,即主题类别标签。
本申请提供的主题识别模型的训练方法,包括:获取第一训练数据,其中,所述第一训练数据包括第一训练文本和所述第一训练文本对应的第一主题类别;基于所述第一训练文本和所述第一主题类别训练参考主题识别模型集合;获取第二初始训练数据,其中,所述第二初始训练数据包括第二训练文本;将所述第二训练文本输入至所述参考主题识别模型集合,获得所述第二训练文本对应的预测主题类别,并基于所述第二训练文本和所述预测主题类别获得第二训练数据;根据所述第一训练数据和所述第二训练数据训练主题识别模型。
本申请一实施例实现了通过较少的带有标注的训练数据对主题识别模型进行训练,在减轻标注成本的同时,进一步提高主题识别的准确性,在主题识别模型的训练过程中,无需人工二次介入,降低人工成本,并可以提高训练过程的处理效率。相对于基于监督学习的深度神经网络方法来说,大大降低了标注成本,相对于基于无监督学习的主题聚类方法来说,主题识别的准确率更高。
下述结合附图4,以本申请提供的主题识别模型的训练方法在对话沟通场景的应用为例,对所述主题识别模型的训练方法进行进一步说明。其中,图4示出了本申请一实施例提供的一种应用于对话沟通场景的主题识别模型的训练方法的处理流程图,具体包括以下步骤:
步骤402:接收待识别文本,根据过滤规则对所述待识别文本进行过滤,获得目标待识别文本。
具体的,接收家长和老师微信聊天中产生的聊天文本,根据过滤规则将不符合过滤条件的聊天文本过滤掉,得到满足过滤规则的聊天文本,并将其作为目标待识别文本。
步骤404:在所述目标待识别文本中确定目标待识别子文本,其中,所述目标待识别子文本包括目标待识别语句和所述目标待识别语句对应的目标对象。
具体的,在目标待识别文本中确定需要进行确定主题类别的子文本为目标待识别子文本,例如,目标待识别子文本为“老师:是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”,其中,“老师”为目标对象,“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”为目标待识别语句。
步骤406:获取所述目标待识别语句的上下文信息,以及所述上下文信息对应的目标对象。
具体的,根据目标待识别语句“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”,获取该目标待识别语句的上文信息为“家长:啊,你好。”,下文信息为“老师:咱宝贝是不是要上一年级了呢。”。其中,“家长”为目标待识别语句的上文信息对应的目标对象,“老师”为目标待识别语句的下文信息对应的目标对象。
步骤408:基于所述目标待识别语句、所述目标待识别语句对应的目标对象、所述上下文信息以及所述上下文信息对应的目标对象构建第一训练文本。
具体的,将目标待识别语句“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”,目标待识别语句的上文信息“家长:啊,你好。”,下文信息“老师:咱宝贝是不是要上一年级了呢。”进行拼接,拼接后的第一训练文本为“[CLS]家长: 啊,你好。[SEP]老师:是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。[SEP]老师:咱宝贝是不是要上一年级了呢。[SEP]”。
步骤410:基于所述第一训练文本获取所述第一训练文本对应的第一主题类别。
具体的,第一训练文本对应的第一主题类别是已经进行标注好的,具体标注过程如上所述,因此,可以直接根据第一训练文本获取与其对应的第一主题类别。获取第一训练文本对应的第一主题类别为“开场白,赠课-赠课询问,年级确认”。
步骤412:获取初始参考主题识别模型,并为所述初始参考主题识别模型设置不同的超参数,获得参考主题识别模型集合。
具体的,获取初始参考主题识别模型,并为该初始参考主题识别模型设置多个不同的超参数进行训练,基于超参数的不同,可以获得多个参考主题识别模型,这些参考主题识别模型构成的主题识别模型集合即为参考主题识别模型集合。
步骤414:将所述第一训练文本分别输入至所述参考主题识别模型集合中的每个参考主题识别模型,并基于所述第一主题类别对所述每个参考主题识别模型进行训练。
步骤416:获取第二训练文本,将所述第二训练文本分别输入至所述参考主题识别模型集合中的每个参考主题识别模型。
具体的,基于上述获取第一训练文本相同的获取方法获取第二训练文本,并将第二训练文本输入至每个参考主题识别模型。
步骤418:基于所述每个参考主题识别模型的输出结果,获得所述第二训练文本对应的待确定主题类别集合。
具体的,基于每个参考主题识别模型的输出结果,获得第二训练文本对应的待确定主题类别集合为{A-5,B-3,C-2}。
步骤420:确定所述待确定主题类别集合中数量信息最多的待确定主题类别为所述第二训练文本对应的预测主题类别。
具体的,在待确定主题类别集合为{A-5,B-3,C-2}中,确定A类别为第二训练文本对应的预测主题类别。
步骤422:将所述第一训练文本和所述第二训练文本确定为样本训练文本,将所述第一主题类别和所述预测主题类别确定为主题类别标签。
步骤424:将所述样本训练文本输入至主题识别模型,并基于所述主题类别标签对所述主题识别模型进行训练。
本申请提供的主题识别模型的训练方法,实现了通过较少的带有标注的训练数据对主题识别模型进行训练,在减轻标注成本的同时,进一步提高主题识别的准确性,在主题识别模型的训练过程中,无需人工二次介入,降低人工成本,并可以提高训练过程的处理效率。相对于基于监督学习的深度神经网络方法来说,大大降低了标注成本,相对于基于无监督学习的主题聚类方法来说,主题识别的准确率更高。
与上述方法实施例相对应,本申请还提供了主题识别模型的训练装置实施例,图5示出了本申请一实施例提供的一种主题识别模型的训练装置的结构示意图。如图5所示,该装置包括:
第一获取模块502,被配置为获取第一训练数据,其中,所述第一训练数据包括第一训练文本和所述第一训练文本对应的第一主题类别;
第一训练模块504,被配置为基于所述第一训练文本和所述第一主题类别训练参考主题识别模型集合;
第二获取模块506,被配置为获取第二初始训练数据,其中,所述第二初始训练数据包括第二训练文本;
数据预测模块508,被配置为将所述第二训练文本输入至所述参考主题识别模型集合,获得所述第二训练文本对应的预测主题类别,并基于所述第二训练文本和所述预测主题类别获得第二训练数据;
第二训练模块510,被配置为根据所述第一训练数据和所述第二训练数据训练主题识别模型。
可选的,所述第一训练模块504,进一步被配置为:
获取初始参考主题识别模型;
为所述初始参考主题识别模型设置不同的超参数,获得所述参考主题识别模型集合;
基于所述第一训练文本和所述第一主题类别训练所述参考主题识别模型集合中的每个参考主题识别模型。
可选的,所述第一训练模块504,进一步被配置为:
在所述参考主题识别模型集合中确定目标参考主题识别模型;
将所述第一训练文本输入至所述目标参考主题识别模型,获得所述目标参考主题识别模型输出的第一预测主题类别;
根据所述第一主题类别和所述第一预测主题类别计算所述目标参考主题识别模型的参考损失值;
根据所述参考损失值调整所述目标参考主题识别模型的模型参数,并继续训练所述目标参考主题识别模型,直至达到训练停止条件。
可选的,所述数据预测模块508,进一步被配置为:
将所述第二训练文本分别输入至所述参考主题识别模型集合中的每个参考主题识别模型;
基于所述每个参考主题识别模型的输出结果,获得所述第二训练文本对应的待确定主题类别集合;
在所述待确定主题类别集合中确定所述第二训练文本对应的预测主题类别。
可选的,所述待确定主题类别集合包括至少一个待确定主题类别和每个待确定主题类别的数量信息;
相应的,所述数据预测模块508,进一步被配置为:
确定数量信息最多的待确定主题类别为所述第二训练文本对应的预测主题类别。
可选的,所述第二训练模块510,进一步被配置为:
将所述第一训练文本和所述第二训练文本确定为样本训练文本,将所述第一主题类别和所述预测主题类别确定为主题类别标签;
将所述样本训练文本输入至所述主题识别模型,获得所述主题识别模型输出的预测类别标签;
根据所述主题类别标签和所述预测类别标签计算所述主题识别模型的损失值;
根据所述损失值调整所述主题识别模型的模型参数,并继续训练所述主题识别模型,直至达到训练停止条件。
可选的,所述第一获取模块502,进一步被配置为:
接收待识别文本;
根据过滤规则对所述待识别文本进行过滤,获得目标待识别文本;
在目标待识别文本中确定目标待识别子文本,基于所述目标待识别子文本的上下文信息构建第一训练文本;
基于所述第一训练文本获取所述第一训练文本对应的第一主题类别。
可选的,所述第一获取模块502,进一步被配置为:
在所述目标待识别文本中确定目标待识别语句;
获取所述目标待识别语句的上下文信息;
基于所述目标待识别语句确定所述目标待识别语句对应的目标对象,基于所述上下文信息确定所述上下文信息分别对应的目标对象;
根据所述目标待识别语句、所述目标待识别语句对应的目标对象、所述上下文信息和所述上下文信息分别对应的目标对象构建第一训练文本。
可选的,所述第一获取模块502,进一步被配置为:
接收待识别语音;
将所述待识别语音转换为所述待识别文本。
本申请提供的主题识别模型的训练装置,包括:第一获取模块,被配置为获取第一训练数据,其中,所述第一训练数据包括第一训练文本和所述第一训练文本对应的第一主题类别;第一训练模块,被配置为基于所述第一训练文本和所述第一主题类别训练参考主题识别模型集合;第二获取模块,被配置为获取第二初始训练数据,其中,所述第二初始训练数据包括第二训练文本;数据预测模块,被配置为将所述第二训练文本输入至所述参考主题识别模型集合,获得所述第二训练文本对应的预测主题类别,并基于所述第二训练文本和所述预测主题类别获得第二训练数据;第二训练模块,被配置为根据所述第一训练数据和所述第二训练数据训练主题识别模型。
本申请一实施例实现了通过较少的带有标注的训练数据对主题识别模型进行训练,在减轻标注成本的同时,进一步提高主题识别的准确性,在主题识别模型的训练过程中,无需人工二次介入,降低人工成本,并可以提高训练过程的处理效率。相对于基于监督学习的深度神经网络方法来说,大大降低了标注成本,相对于基于无监督学习的主题聚类方法来说,主题识别的准确率更高。
上述为本实施例的一种主题识别模型的训练装置的示意性方案。需要说明的是,该主题识别模型的训练装置的技术方案与上述的主题识别模型的训练方法的技术方案属于同一构思,主题识别模型的训练装置的技术方案未详细描述的细节内容,均可以参见上述主题识别模型的训练方法的技术方案的描述。
图6示出了根据本申请一实施例提供的一种主题识别方法的流程图,具体包括以下步骤:
步骤602:接收待识别文本。
其中,待识别文本具体是指,由沟通对象之间产生的沟通语音或沟通字段所获取的文本。具体的,接收待识别文本为接收沟通对象之间产生的沟通语音或沟通字段。沟通对象是指在待识别文本中发起沟通的对象,例如,在老师与家长的沟通中,老师和家长属于沟通对象。
步骤604:在所述待识别文本中确定目标待识别子文本,其中,所述目标待识别子文本包括目标待识别语句和所述目标待识别语句对应的目标对象。
其中,目标待识别子文本可以理解为在待识别文本中确定的需要确定主题类别的子文本,由目标待识别语句和目标待识别语句对应的目标对象构成。例如,在老师和家长的对话沟通中,“老师:您好,因为宝贝之后的学习是我在负责嘛。”需要确定主题类别,“老师:您好,因为宝贝之后的学习是我在负责嘛。”则为目标待识别子文本,其中,“您好,因为宝贝之后的学习是我在负责嘛。”为目标待识别语句,“老师”为目标对象。
具体的,在待识别文本中确定需要确定主题类别的子文本,将该子文本确定为目标待识别子文本,在目标待识别子文本中包括目标待识别语句和目标待识别语句对应的目标对象。
步骤606:基于所述目标待识别语句的上下文信息构建待识别输入文本。
其中,目标待识别语句的上下文信息,可以理解为与目标待识别语句邻近的上文语句和下文语句。例如,目标待识别语句为“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”,获取得到的与该目标待识别语句邻近的上文语句为“啊,你好。”,与该目标待识别语句邻近的下文语句为“咱宝贝是不是要上一年级了呢。”,则与目标待识别语句邻近的上文语句和下文语句分别为“啊,你好。”和“咱宝贝是不是要上一年级了呢。”。
基于获取的目标待识别语句以及目标待识别语句的上下文语句,可以进一步获取每个语句对应的目标对象。例如,目标待识别语句“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”的目标对象为老师,上文语句“啊,你好。”的目标对象为家长,下文语句“咱宝贝是不是要上一年级了呢。”的目标对象为老师。
进一步的,根据目标待识别语句、与目标待识别语句对应的上下文语句以及每条语句对应的目标对象进行拼接,得到待识别输入文本为“[CLS]家长:啊,你好。[SEP]老师:是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。[SEP]老师:咱宝贝是不是要上一年级了呢。[SEP]”。
步骤608:将所述待识别输入文本输入至主题识别模型。
具体的,主题识别模型通过上述实施例中的主题识别模型的训练方法训练获得。将上述构建的待识别输入文本输入至主题识别模型。例如,将待识别输入文本“[CLS]家长:啊,你好。[SEP]老师:是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。[SEP]老师:咱宝贝是不是要上一年级了呢。[SEP]”输入至主题识别模型。
步骤610:获取所述主题识别模型输出的主题类别。
获取主题识别模型的输出结果为“开场白,赠课-赠课询问,年级确认”,将“开场白,赠课-赠课询问,年级确认”作为上述待识别输入文本的主题类别。进而可以确定目标待识别语句“是这样子的,因为咱这边呢,是斑马的老用户了,现在咱们针对老用户呢,有一个助学行动计划。”的主题类别为“赠课-赠课询问”,上文语句“啊,你好。”的主题类别为“开场白”,下文语句“咱宝贝是不是要上一年级了呢。”的主题类别为“年级确认”。
本申请提供的主题识别方法,包括:接收待识别文本;在所述待识别文本中确定目标待识别子文本,其中,所述目标待识别子文本包括目标待识别语句和所述目标待识别语句对应的目标对象;基于所述目标待识别语句的上下文信息构建待识别输入文本;将所述待识别输入文本输入至主题识别模型,其中,所述主题识别模型是通过上述主题识别模型的训练方法训练得到的;获取所述主题识别模型输出的主题类别。
本申请提供的一实施例实现了结合目标待识别语句的上下文信息获取目标待识别语句的主题类别,提高主题识别准确性。
与上述方法实施例相对应,本申请还提供了主题识别装置实施例,图7示出了本申请一实施例提供的一种主题识别装置的结构示意图。如图7所示,该装置包括:
接收模块702,被配置为接收待识别文本;
确定模块704,被配置为在所述待识别文本中确定目标待识别子文本,其中,所述目标待识别子文本包括目标待识别语句和所述目标待识别语句对应的目标对象;
构建模块706,被配置为基于所述目标待识别语句的上下文信息构建待识别输入文本;
输入模块708,被配置为将所述待识别输入文本输入至主题识别模型;
获取模块710,被配置为获取所述主题识别模型输出的主题类别。
本申请提供的主题识别装置,包括:接收模块,被配置为接收待识别文本;确定模块,被配置为在所述待识别文本中确定目标待识别子文本,其中,所述目标待识别子文本包括目标待识别语句和所述目标待识别语句对应的目标对象;构建模块,被配置为基于所述目标待识别语句的上下文信息构建待识别输入文本;输入模块,被配置为将所述待识别输入文本输入至主题识别模型,其中,所述主题识别模型是通过上述主题识别模型的训练方法训练得到的;获取模块,被配置为获取所述主题识别模型输出的主题类别。
本申请提供的一实施例实现了结合目标待识别语句的上下文信息获取目标待识别语句的主题类别,提高主题识别准确性。
上述为本实施例的一种主题识别装置的示意性方案。需要说明的是,该主题识别装置的技术方案与上述的主题识别方法的技术方案属于同一构思,主题识别装置的技术方案未详细描述的细节内容,均可以参见上述主题识别方法的技术方案的描述。
图8示出了根据本申请一实施例提供的一种计算设备800的结构框图。该计算设备800的部件包括但不限于存储器810和处理器820。处理器820与存储器810通过总线830相连接,数据库850用于保存数据。
计算设备800还包括接入设备840,接入设备840使得计算设备800能够经由一个或多个网络860通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备840可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。
在本申请的一个实施例中,计算设备800的上述部件以及图8中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图8所示的计算设备结构框图仅仅是出于示例的目的,而不是对本申请范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。
计算设备800可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备800还可以是移动式或静止式的服务器。
其中,处理器820执行所述计算机指令时实现所述的主题识别模型的训练方法或者所述的主题识别方法的步骤。
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的主题识别模型的训练方法或者主题识别方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述主题识别模型的训练方法或者主题识别方法的技术方案的描述。
本申请一实施例还提供一种计算机可读存储介质,其存储有计算机指令,该计算机指令被处理器执行时实现如前所述主题识别模型的训练方法或者所述主题识别方法的步骤。
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的主题识别模型的训练方法或者主题识别方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述主题识别模型的训练方法或者主题识别方法的技术方案的描述。
上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下, 在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
所述计算机指令包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上公开的本申请优选实施例只是用于帮助阐述本申请。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本申请的内容,可作很多的修改和变化。本申请选取并具体描述这些实施例,是为了更好地解释本申请的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本申请。本申请仅受权利要求书及其全部范围和等效物的限制。

Claims (14)

  1. 一种主题识别模型的训练方法,其特征在于,包括:
    获取第一训练数据,其中,所述第一训练数据包括第一训练文本和所述第一训练文本对应的第一主题类别;
    基于所述第一训练文本和所述第一主题类别训练参考主题识别模型集合;
    获取第二初始训练数据,其中,所述第二初始训练数据包括第二训练文本;
    将所述第二训练文本输入至所述参考主题识别模型集合,获得所述第二训练文本对应的预测主题类别,并基于所述第二训练文本和所述预测主题类别获得第二训练数据;
    根据所述第一训练数据和所述第二训练数据训练主题识别模型。
  2. 如权利要求1所述的方法,其特征在于,基于所述第一训练文本和所述第一主题类别训练参考主题识别模型集合,包括:
    获取初始参考主题识别模型;
    为所述初始参考主题识别模型设置不同的超参数,获得所述参考主题识别模型集合;
    基于所述第一训练文本和所述第一主题类别训练所述参考主题识别模型集合中的每个参考主题识别模型。
  3. 如权利要求2所述的方法,其特征在于,基于所述第一训练文本和所述第一主题类别训练所述参考主题识别模型集合中的每个参考主题识别模型,包括:
    在所述参考主题识别模型集合中确定目标参考主题识别模型;
    将所述第一训练文本输入至所述目标参考主题识别模型,获得所述目标参考主题识别模型输出的第一预测主题类别;
    根据所述第一主题类别和所述第一预测主题类别计算所述目标参考主题识别模型的参考损失值;
    根据所述参考损失值调整所述目标参考主题识别模型的模型参数,并继续训练所述目标参考主题识别模型,直至达到训练停止条件。
  4. 如权利要求1所述的方法,其特征在于,将所述第二训练文本输入至所述参考主题识别模型集合,获得所述第二训练文本对应的预测主题类别,包括:
    将所述第二训练文本分别输入至所述参考主题识别模型集合中的每个参考主题识别模型;
    基于所述每个参考主题识别模型的输出结果,获得所述第二训练文本对应的待确定主题类别集合;
    在所述待确定主题类别集合中确定所述第二训练文本对应的预测主题类别。
  5. 如权利要求4所述的方法,其特征在于,所述待确定主题类别集合包括至少一个待确定主题类别和每个待确定主题类别的数量信息;
    相应的,在所述待确定主题类别集合中确定所述第二训练文本对应的预测主题类别,包括:
    确定数量信息最多的待确定主题类别为所述第二训练文本对应的预测主题类别。
  6. 如权利要求1所述的方法,其特征在于,根据所述第一训练数据和所述第二训练数据训练主题识别模型,包括:
    将所述第一训练文本和所述第二训练文本确定为样本训练文本,将所述第一主题类别和所述预测主题类别确定为主题类别标签;
    将所述样本训练文本输入至所述主题识别模型,获得所述主题识别模型输出的预测类别标签;
    根据所述主题类别标签和所述预测类别标签计算所述主题识别模型的损失值;
    根据所述损失值调整所述主题识别模型的模型参数,并继续训练所述主题识别模型,直至达到训练停止条件。
  7. 如权利要求1所述的方法,其特征在于,获取第一训练数据,包括:
    接收待识别文本;
    根据过滤规则对所述待识别文本进行过滤,获得目标待识别文本;
    在目标待识别文本中确定目标待识别子文本,基于所述目标待识别子文本的上下文信息构建第一训练文本;
    基于所述第一训练文本获取所述第一训练文本对应的第一主题类别。
  8. 如权利要求7所述的方法,其特征在于,在目标待识别文本中确定目标待识别子文本,基于所述目标待识别子文本的上下文信息构建第一训练文本,包括:
    在所述目标待识别文本中确定目标待识别语句;
    获取所述目标待识别语句的上下文信息;
    基于所述目标待识别语句确定所述目标待识别语句对应的目标对象,基于所述上下文信息确定所述上下文信息分别对应的目标对象;
    根据所述目标待识别语句、所述目标待识别语句对应的目标对象、所述上下文信息和所述上下文信息分别对应的目标对象构建第一训练文本。
  9. 如权利要求7所述的方法,其特征在于,接收待识别文本,包括:
    接收待识别语音;
    将所述待识别语音转换为所述待识别文本。
  10. 一种主题识别方法,其特征在于,包括:
    接收待识别文本;
    在所述待识别文本中确定目标待识别子文本,其中,所述目标待识别子文本包括目标待识别语句和所述目标待识别语句对应的目标对象;
    基于所述目标待识别语句的上下文信息构建待识别输入文本;
    将所述待识别输入文本输入至主题识别模型,其中,所述主题识别模型是通过权利要求1-9任意一项所述的训练方法训练得到的;
    获取所述主题识别模型输出的主题类别。
  11. 一种主题识别模型的训练装置,其特征在于,包括:
    第一获取模块,被配置为获取第一训练数据,其中,所述第一训练数据包括第一训练文本和所述第一训练文本对应的第一主题类别;
    第一训练模块,被配置为基于所述第一训练文本和所述第一主题类别训练参考主题识别模型集合;
    第二获取模块,被配置为获取第二初始训练数据,其中,所述第二初始训练数据包括第二训练文本;
    数据预测模块,被配置为将所述第二训练文本输入至所述参考主题识别模型集合,获得所述第二训练文本对应的预测主题类别,并基于所述第二训练文本和所述预测主题类别获得第二训练数据;
    第二训练模块,被配置为根据所述第一训练数据和所述第二训练数据训练主题识别模型。
  12. 一种主题识别装置,其特征在于,包括:
    接收模块,被配置为接收待识别文本;
    确定模块,被配置为在所述待识别文本中确定目标待识别子文本,其中,所述目标待识别子文本包括目标待识别语句和所述目标待识别语句对应的目标对象;
    构建模块,被配置为基于所述目标待识别语句的上下文信息构建待识别输入文本;
    输入模块,被配置为将所述待识别输入文本输入至主题识别模型,其中,所述主题识别模型是通过权利要求1-9任意一项所述的训练方法训练得到的;
    获取模块,被配置为获取所述主题识别模型输出的主题类别。
  13. 一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机指令,其特征在于,所述处理器执行所述计算机指令时实现权利要求1-9或者10任意一项所述方法的步骤。
  14. 一种计算机可读存储介质,其存储有计算机指令,其特征在于,该计算机指令被处理器执行时实现权利要求1-9或者10任意一项所述方法的步骤。
PCT/CN2023/130802 2022-11-30 2023-11-09 主题识别模型的训练方法及装置 WO2024114335A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211519609.0A CN115858783A (zh) 2022-11-30 2022-11-30 主题识别模型的训练方法及装置
CN202211519609.0 2022-11-30

Publications (1)

Publication Number Publication Date
WO2024114335A1 true WO2024114335A1 (zh) 2024-06-06

Family

ID=85668239

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/130802 WO2024114335A1 (zh) 2022-11-30 2023-11-09 主题识别模型的训练方法及装置

Country Status (2)

Country Link
CN (1) CN115858783A (zh)
WO (1) WO2024114335A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858783A (zh) * 2022-11-30 2023-03-28 北京猿力教育科技有限公司 主题识别模型的训练方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966712A (zh) * 2021-02-01 2021-06-15 北京三快在线科技有限公司 语言模型训练方法、装置、电子设备和计算机可读介质
CN114358313A (zh) * 2022-01-04 2022-04-15 上海哔哩哔哩科技有限公司 数据处理方法及装置
WO2022227214A1 (zh) * 2021-04-29 2022-11-03 平安科技(深圳)有限公司 分类模型训练方法、装置、终端设备及存储介质
CN115858783A (zh) * 2022-11-30 2023-03-28 北京猿力教育科技有限公司 主题识别模型的训练方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966712A (zh) * 2021-02-01 2021-06-15 北京三快在线科技有限公司 语言模型训练方法、装置、电子设备和计算机可读介质
WO2022227214A1 (zh) * 2021-04-29 2022-11-03 平安科技(深圳)有限公司 分类模型训练方法、装置、终端设备及存储介质
CN114358313A (zh) * 2022-01-04 2022-04-15 上海哔哩哔哩科技有限公司 数据处理方法及装置
CN115858783A (zh) * 2022-11-30 2023-03-28 北京猿力教育科技有限公司 主题识别模型的训练方法及装置

Also Published As

Publication number Publication date
CN115858783A (zh) 2023-03-28

Similar Documents

Publication Publication Date Title
CN109582793B (zh) 模型训练方法、客服系统及数据标注系统、可读存储介质
WO2024114335A1 (zh) 主题识别模型的训练方法及装置
US20230395075A1 (en) Human-machine dialogue system and method
CN107688576B (zh) 一种cnn-svm模型的构建及倾向性分类方法
CN113987147A (zh) 样本处理方法及装置
CN111930914A (zh) 问题生成方法和装置、电子设备以及计算机可读存储介质
CN110321564A (zh) 一种多轮对话意图识别方法
CN113159187B (zh) 分类模型训练方法及装置、目标文本确定方法及装置
CN114911932A (zh) 基于主题语义增强的异构图结构多会话者情感分析方法
CN114942973A (zh) 一种用于电力智能客服系统的情绪识别方法和系统
CN115269836A (zh) 意图识别方法及装置
CN114238645A (zh) 一种基于bert孪生注意力网络与融合图嵌入特征的关系选择方法
CN117171314A (zh) 基于大模型的多模态政务问答方法
CN110059174B (zh) 问询指引方法及装置
CN114444481A (zh) 一种新闻评论的情感分析与生成方法
CN110969005A (zh) 一种确定实体语料之间的相似性的方法及装置
CN116186259A (zh) 一种会话线索评分方法、装置、设备及存储介质
CN115730607A (zh) 对话检测模型训练方法及装置
CN115827831A (zh) 意图识别模型训练方法及装置
CN115098665A (zh) 一种对话数据扩展方法、装置及设备
CN114358313A (zh) 数据处理方法及装置
CN111091011B (zh) 领域预测方法、领域预测装置及电子设备
CN116186529A (zh) 语义理解模型的训练方法及装置
CN115293148A (zh) 主题识别方法及装置
CN117453895B (zh) 一种智能客服应答方法、装置、设备及可读存储介质