CN112395414A

CN112395414A - Text classification method and training method, device, medium and equipment of classification model

Info

Publication number: CN112395414A
Application number: CN201910759761.8A
Authority: CN
Inventors: 马腾岳; 周蕾蕾
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2021-02-23
Anticipated expiration: 2039-08-16
Also published as: CN112395414B

Abstract

The embodiment of the disclosure discloses a text classification method and a training method, a device, a medium and equipment of a classification model. The text classification method comprises the following steps: acquiring a text to be processed; according to the pre-defined word slot type, carrying out word slot type labeling on the text to be processed; and according to the result of the word slot category marking, performing field classification on the text to be processed to obtain the field category of the text to be processed. The method and the device for classifying the sentences can realize accurate field classification of the sentences, so that the accuracy of the field classification is improved.

Description

Text classification method and training method, device, medium and equipment of classification model

Technical Field

The present disclosure relates to speech technology, and in particular, to a method, an apparatus, a medium, and a device for text classification and training of classification models.

Background

Speech Recognition technology, also known as Automatic Speech Recognition (ASR), is a technology that converts human Speech into a computer-readable input form. In the process of speech recognition, after human speech is converted into text, the text needs to be semantically understood to be converted into a computer-readable input form.

Among them, short text classification is a key step of semantic understanding. Short text classification refers to determining domain category information to which a sentence in text belongs, such as: "play children song", it is the "music" field; "weather today" belongs to the field of "weather".

Disclosure of Invention

In order to solve at least one technical problem in the prior art, embodiments of the present disclosure provide a technical solution for text classification and a technical solution for training a classification model.

According to an aspect of an embodiment of the present disclosure, there is provided a text classification method including:

acquiring a text to be processed;

according to the pre-defined word slot type, carrying out word slot type labeling on the text to be processed;

and according to the result of the word slot category marking, performing field classification on the text to be processed to obtain the field category of the text to be processed.

According to another aspect of the embodiments of the present disclosure, there is provided a method for training a classification model, including:

acquiring a first data set, wherein samples in the first data set are marked with domain category information;

carrying out word slot class labeling on the samples in the first data set according to a predefined word slot class;

and training a domain classification model by using the first data set according to the result of the word slot category marking.

According to still another aspect of an embodiment of the present disclosure, there is provided a text classification apparatus including:

the first acquisition module is used for acquiring a text to be processed;

the marking module is used for marking the word slot types of the text to be processed acquired by the first acquisition module according to the predefined word slot types;

and the classification module is used for performing field classification on the text to be processed according to the word slot type labeling result obtained by the labeling module to obtain the field type of the text to be processed.

According to still another aspect of the embodiments of the present disclosure, there is provided a training apparatus for a classification model, including:

the second acquisition module is used for acquiring a first data set, and samples in the first data set are marked with domain category information;

the labeling module is used for labeling the word slot types of the samples in the first data set acquired by the second acquisition module according to the predefined word slot types;

and the first training module is used for training a domain classification model by utilizing the first data set according to the word slot category labeling result obtained by the labeling module.

According to a further aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the method of any of the above embodiments.

According to still another aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method according to any of the embodiments.

Based on the text classification method and device, the computer-readable storage medium, and the electronic device provided by the embodiments of the present disclosure, the word-slot class labeling is performed on the text to be processed according to the pre-defined word-slot class, and the field classification is performed on the text to be processed according to the result of the word-slot class labeling. Because the field classification of the text to be processed does not need to consider specific words, but is completed according to the word slot classification labeled by the text to be processed, the accurate field classification of sentences can be realized, and the accuracy of the field classification is improved.

Based on the training method and apparatus for the classification model, the computer-readable storage medium, and the electronic device provided by the embodiments of the present disclosure, according to a predefined word slot class, word slot class labeling is performed on samples in a first data set, where the samples in the first data set are labeled with domain class information, and then according to a result of the word slot class labeling, the first data set is used to train a domain classification model. When the domain classification model trained by the method of the embodiment is used for domain classification of the text to be processed, the domain classification of the text to be processed is completed according to the word groove classification labeled by the text to be processed without considering specific words, and for sentences of words which do not appear in the training sample and exist in the text to be processed, accurate domain classification can still be performed on the sentences according to the word groove classification of the sentences, so that the accuracy of the domain classification is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a scene diagram to which the present disclosure is applicable.

Fig. 2 is a flowchart illustrating a text classification method according to an exemplary embodiment of the present disclosure.

Fig. 3 is a flowchart illustrating a text classification method according to another exemplary embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating a text classification method according to another exemplary embodiment of the present disclosure.

Fig. 5 is a flowchart illustrating a training method of a classification model according to an exemplary embodiment of the present disclosure.

Fig. 6 is a flowchart illustrating a training method of a classification model according to another exemplary embodiment of the present disclosure.

Fig. 7 is a flowchart illustrating a training method of a classification model according to another exemplary embodiment of the present disclosure.

Fig. 8 is a schematic structural diagram of a text classification apparatus according to an exemplary embodiment of the present disclosure.

Fig. 9 is a schematic structural diagram of a text classification apparatus according to another exemplary embodiment of the present disclosure.

Fig. 10 is a schematic structural diagram of a training apparatus for a classification model according to an exemplary embodiment of the present disclosure.

Fig. 11 is a schematic structural diagram of a training apparatus for a classification model according to another exemplary embodiment of the present disclosure.

Fig. 12 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the application

In the process of implementing the invention, the inventor finds out through research that the existing domain classification method is based on directly extracting the characteristics of the original sentences in the text and then classifying the short texts through a domain classification model, so that under the condition that the training samples are small samples (namely, the number of the training samples is small and the coverage classes are not comprehensive), the domain classification of the sentences is easy to make mistakes for the sentences which have words and phrases which do not appear in the training samples, thereby influencing the accuracy of the domain classification.

For example, assuming that "ink rain of liu de hua" exists in the training sample, the classification method based on the above prior art can correctly classify the text "ink rain of liu de hua", but for the text "blue and white porcelain of zhou jilun" which does not exist in the training sample, the classification method based on the above prior art is prone to errors.

When the field classification is carried out on the text, the field classification is finished according to the word slot classification labeled on the text, and specific words are not considered, so that for sentences of words which exist in the text and do not appear in the training sample, the accurate field classification can be still carried out on the sentences according to the word slot classification of the sentences, and the accuracy of the field classification is improved.

Exemplary System

The embodiment of the disclosure can be applied to scenes with voice interaction with robots, children toys, sounds and the like, and can also be applied to scenes such as searching and the like. Fig. 1 is a diagram of a scenario to which the present disclosure is applicable. As shown in fig. 1, when the embodiment of the present disclosure is applied to a voice interaction scene, an audio acquisition module (e.g., a microphone) acquires an original audio signal, and a voice processed by a front-end signal processing module performs voice recognition to obtain text information; and performing semantic understanding and domain classification on the text information, searching in an information base of a corresponding domain based on a domain classification result, and outputting a search result. For example, for the user's voice "zhou jilun blue and white porcelain", the user may be classified into the music field based on the embodiments of the present disclosure, and "zhou jilun blue and white porcelain" is searched from the music database and returned.

In addition, when the embodiment of the disclosure is applied to a search scene, a user can input text information, for example, "the plum meditation night", the server performs semantic understanding and domain classification on the text information, searches in the information base of the corresponding category based on the classification result and outputs the search result, for example, "the plum meditation night" is classified into the poetry field, and the server searches for poetry in the poetry database through the keyword "the plum meditation night" and returns the poetry to the user.

Exemplary method

Fig. 2 is a flowchart illustrating text classification according to an exemplary embodiment of the disclosure. The present embodiment can be applied to an electronic device, and as shown in fig. 2, the text classification method of the present embodiment includes the following steps:

step 101, obtaining a text to be processed.

The text to be processed may be a text input by the user, such as "i want to listen to song of zhou jenlen"; alternatively, the text information may be text information obtained by performing voice recognition on voice input by the user. The voice input by the user may be an original audio signal acquired by an audio acquisition module (e.g., a microphone), or may be a voice of the original audio signal processed by a front-end signal processing module.

The processing of the audio signal by the front-end signal processing module may include, but is not limited to: voice Activity Detection (VAD), noise reduction, Acoustic Echo Cancellation (AEC), dereverberation, sound source localization, Beam Forming (BF), etc.

Voice Activity Detection (VAD), also called Voice endpoint Detection and Voice boundary Detection, refers to detecting the existence of Voice in an audio signal in a noise environment and accurately detecting the initial position of a Voice segment in the audio signal, and is generally used in Voice processing systems such as Voice coding and Voice enhancement, and plays roles of reducing a Voice coding rate, saving a communication bandwidth, reducing energy consumption of mobile equipment, improving a recognition rate, and the like.

And 102, carrying out word slot type labeling on the text to be processed according to a predefined word slot (slot) type.

And 103, performing field classification on the text to be processed according to the result of the word slot class marking to obtain the field class of the text to be processed.

Based on the text classification method provided by the embodiment of the disclosure, according to the pre-defined word slot type, the word slot type labeling is performed on the text to be processed, and according to the result of the word slot type labeling, the field classification is performed on the text to be processed. Because the field classification of the text to be processed does not need to consider specific words, but is completed according to the word slot classification labeled by the text to be processed, the accurate field classification of sentences can be realized, and the accuracy of the field classification is improved.

In the embodiment of the present disclosure, word slots of the whole domain category may be predefined. For example, as shown in table 1 below, an example of the word slot defined for embodiments of the present disclosure:

TABLE 1

Slot category	Means of	Examples of such applications are
			artist	Name of a person	Zhou Jie Lun, Liu De Hua … …
title	Name of work	Blue and white porcelain, ice and rain … …
			poi	Position of	Zhongguancun, Xizhuangmen … …
time	Time of day	Today, Nostoc, Tuesday … …
			location	Location of a site	Beijing, Nanjing … …
……	……	……

In step 102 of the embodiment shown in fig. 2, according to the predefined word slot categories, the word slot category labeling is performed on the text to be processed, which may be, for example:

for the text "ice rain of Liu De Hua" to be processed, the word groove class labeling is performed based on the step 102, and the following results are obtained: [ Liudebua: artist ] [ ice rain: title ];

for the text to be processed, "navigate to the middle guancun", the word-groove class labeling is performed based on step 102, and the following results are obtained: navigate to [ central guancun: poi ];

for the text to be processed, "weather of today", labeling the word slot category based on step 102, to obtain: [ today: time ] weather.

In some embodiments, in step 102, the text to be processed may be input into a sequence labeling model, and the word slot category of the text to be processed is labeled through the sequence labeling model.

In some alternative examples, the sequence labeling Model may be implemented by a Hidden Markov Model (HMM), a Maximum Entropy Model (MaxEnt), a conditional random field algorithm (CRF), a neural network, such as a Convolutional Neural Network (CNN), a cyclic neural network (RNN), etc., and the implementation manner of the sequence labeling Model is not limited by the embodiments of the present disclosure.

In the embodiment, the word-slot classification labeling is performed on the text to be processed through the pre-trained sequence labeling model, when the word-slot classification labeling is performed on the text to be processed through the HMM, the MaxEnt, the CRF and the like, the sequence labeling model outputs a sequence, the output sequence has some context associations, and by using the context associations, the sequence labeling model can achieve higher performance than the traditional classification method in labeling the text to be processed serving as the input sequence, so that the accuracy and the efficiency of the word-slot classification are improved, and the efficiency of the whole text classification is improved. Fig. 3 is a flowchart illustrating text classification according to another exemplary embodiment of the present disclosure. As shown in fig. 3, based on the embodiment shown in fig. 2, step 103 may include the following steps:

and step 1031, determining sentence patterns corresponding to the texts to be processed according to the result of the word slot class marking.

In some embodiments, the result labeled by the word slot category obtained in step 102 may be directly used as the sentence corresponding to the text to be processed. For example, a result that can be directly labeled in the word-groove category "[ liu de hua: artist ] [ ice rain: title ] "," navigate to [ interguan village: poi ] "," [ today: time), as the sentence pattern corresponding to the text to be processed.

In other embodiments, regarding the result labeled by the word slot category obtained in step 102, the labeled word slot category is used to replace the corresponding word in the text to be processed, so as to obtain the sentence pattern corresponding to the text to be processed. For example, the result labeled for the word groove category "[ liudebua: artist ] [ ice rain: title ] "," navigate to [ interguan village: poi ] "," [ today: and weather of time, namely replacing corresponding words in the text to be processed with the marked word slot categories to obtain [ title ] "," navigate to [ poi ] ", and weather of [ time ] of [ artist ] as the sentence pattern corresponding to the text to be processed.

The embodiment of the present disclosure does not limit the form of the sentence pattern corresponding to the text to be processed, as long as the labeled word slot category can be embodied.

And 1032, determining the field type of the text to be processed based on the sentence pattern corresponding to the text to be processed.

In this embodiment, the sentence pattern corresponding to the text to be processed is determined according to the result of the word slot class marking, and the sentence pattern corresponding to the text to be processed includes the marked result and the structural relationship of the whole text to be processed, and the field class of the text to be processed is determined based on the sentence pattern corresponding to the text to be processed, so that a more accurate field class can be obtained, and the accuracy and efficiency of the whole text classification are improved.

Fig. 4 is a flowchart illustrating text classification according to still another exemplary embodiment of the present disclosure. As shown in fig. 4, based on the embodiment shown in fig. 3, step 1032 may include the following steps:

step 10321, extracting features in the sentence pattern corresponding to the text to be processed to obtain text features of the text to be processed.

The text in the embodiment of the present disclosure may be specifically represented in a feature vector or a feature diagram, and the embodiment of the present disclosure does not limit the representation manner of the text feature.

Step 10322, based on the text features of the text to be processed, performing domain classification on the text to be processed to obtain a domain category of the text to be processed.

For example, based on the text features "[ title ] of [ artist ]", "and [ title ]" of the [ artist ], the text to be processed is subjected to domain classification, and scores for classifying the text to be processed into each domain, for example, the music domain category: 0.95; navigation field categories: 0.10; weather field categories: 0.10; …, respectively; and selecting a domain category with the highest score as the domain category of the text to be processed.

In some embodiments, in step 10321, a sentence pattern corresponding to the text to be processed may be segmented by using a sliding window with a fixed length of N based on a word-by-word N-gram (N-gram), and then the features may be extracted, where N is an integer greater than 0, for example, 2, 3, 4, or the like. Wherein each labeled word slot category is counted as 1 word. For example, for a period "navigate to [ poi ]", n ═ 4; "navigate to", n-3; "navigation", n ═ 2.

For example, an N-ary model is adopted, N is 2-4, a sentence pattern corresponding to a text to be processed is segmented, then features are extracted, and feature extraction is performed on [ title ] "of the sentence pattern" [ artist ], so that the following possible text features can be obtained: title of title, title of title; for the sentence pattern "navigate to [ poi ]", feature extraction can be performed to obtain the following possible text features: navigate to [ poi ], navigate to [ poi ], to [ poi ]; for feature extraction of weather of sentence "[ time ], the following possible text features can be obtained: weather of [ time ], day of [ time ], weather of [ time.

In addition, in the case of a small training sample, for example, in a music field category, a word such as "liu de hua" often appears, if an existing field classification method is adopted to perform feature extraction on an original sentence in a text directly and then perform short text classification through a field classification model, the text which does not belong to the music field category, i.e., "how high" liu de hua is, is also classified into the music category, so that a relatively serious overfitting phenomenon is generated for the field classification of the short text. The text to be processed is abstracted into the sentence pattern, and is changed into 'how high the artist', the text features are only related to the sentence pattern and are unrelated to the concrete 'Liude Hua', so that the correct classification of the text to be processed can be realized, and the overfitting phenomenon of the field classification of short texts is reduced.

In some embodiments, in step 10322, the text features of the text to be processed may be input into the domain classification model, and the text to be processed is subjected to domain classification by the domain classification model, so as to obtain the domain category of the text to be processed.

In some optional examples, the domain classification Model may be implemented by a Support Vector Machine (SVM), a Maximum Entropy Model (MaxEnt), a neural network, and the like, where the neural network may be, for example, a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and the like, and embodiments of the present disclosure do not limit an implementation manner of the domain classification Model.

In the embodiment, the domain classification is performed on the text to be processed through the pre-trained domain classification model, so that the accuracy and efficiency of the domain classification result are improved, and the whole text classification efficiency is improved.

Before the text classification method according to each of the embodiments of the present disclosure, a domain classification model and a sequence labeling model may be trained in advance, and then the corresponding operations may be performed based on the trained domain classification model and the trained sequence labeling model.

Fig. 5 is a flowchart illustrating a training method of a classification model according to an exemplary embodiment of the present disclosure. The present embodiment can be applied to an electronic device, and as shown in fig. 5, the training method of the classification model of the present embodiment includes the following steps:

step 201, a first data set is obtained.

Wherein the first data set comprises samples of at least one domain category, and each sample is marked with domain category information. And the field type information labeled on each sample is relatively accurate field type information.

Step 202, according to the predefined word slot type, performing word slot type labeling on the samples in the first data set.

Step 203, training a domain classification model by using the first data set according to the result of the word groove class marking.

According to the training method of the classification model provided by the embodiment of the disclosure, word slot class labeling is performed on samples in the first data set according to predefined word slot classes, field class information is labeled on the samples in the first data set, and then the field classification model is trained by using the first data set according to the result of the word slot class labeling. When the domain classification model trained by the method of the embodiment is used for domain classification of the text to be processed, the domain classification of the text to be processed is completed according to the word groove type labeled by the text to be processed without considering specific words, and under the condition that the training sample is a small sample, the sentence of the word which does not appear in the training sample and exists in the text to be processed can be accurately domain-classified according to the word groove type of the sentence, so that the accuracy of the domain classification is improved.

In some embodiments, step 203 comprises: determining a sentence pattern corresponding to the sample in the first data set according to the result labeled by the word groove type; and training a domain classification model based on the sentence patterns corresponding to the samples in the first data set.

In some optional examples, training the domain classification model based on the sentence pattern corresponding to the sample in the first data set may include: extracting features in sentence patterns corresponding to the samples in the first data set to obtain text features of the samples in the first data set; a domain classification model is trained based on text features of the samples in the first dataset.

Fig. 6 is a flowchart illustrating a training method of a classification model according to another exemplary embodiment of the present disclosure. As shown in fig. 6, the training method of the classification model of the present embodiment includes the following steps:

step 301, a first data set is obtained, and a sample in the first data set is marked with domain category information.

Wherein, the samples in the first data set may be, for example: the ice rain of Liu De Hua, play Liu De Hua song, …, navigate to Zhongguancun, I want to go to West Cuimen navigation, …, weather today, Beijing today has rain, …, and so on. The sample labeled domain category information is used to identify the domain category of the sample, which may include, but is not limited to: music, poetry, navigation, weather, etc. After the training of the domain classification model is completed by using the sample labeled with the domain class information, the domain classification model can classify the texts belonging to the domain class of the sample.

Step 302, according to the predefined word slot type, performing word slot type labeling on the samples in the first data set.

Wherein, the word slot category labels, for example: [ Liudebua: artist ] [ ice rain: title ], navigate to [ Zhongguancun: poi ], [ today: time ], etc.

Step 303, determining a sentence pattern corresponding to the sample in the first data set according to the result labeled by the word groove type.

The sentence pattern may be, for example, the result of the word slot class label: [ Liudebua: artist ] [ ice rain: title ], navigate to [ Zhongguancun: poi ], [ today: time ] weather; or, the result obtained by replacing the corresponding words in the text to be processed with the labeled word slot categories respectively can be: title of [ artist ], navigation to [ poi ], weatherof [ time ], etc.

Step 304, extracting features in the sentence pattern corresponding to the sample in the first data set to obtain text features of the sample in the first data set.

The features in the sentence patterns corresponding to the samples may be text features obtained by performing feature extraction on the sentence patterns corresponding to the samples in a preset feature extraction manner. For example, by using an N-ary model, where N is 2 to 4, and performing feature extraction on [ title ] "of a sentence pattern" [ artist ], the following possible text features can be obtained: title of title, title of title; feature extraction is performed on sentence patterns 'navigate to [ poi ]' corresponding to the samples, and the following possible text features can be obtained: navigate to [ poi ], navigate to [ poi ], to [ poi ]; the feature extraction is carried out on the weather of the sentence pattern [ time ] corresponding to the sample, and the following possible text features can be obtained: weather of [ time ], day of [ time ], weather of [ time.

Step 305, training a domain classification model based on the text features of the samples in the first data set.

In this embodiment, word slot class labeling is performed on samples in the first data set according to predefined word slot classes, a sentence pattern corresponding to the samples in the first data set is determined according to the result of the word slot class labeling, then features in the sentence pattern corresponding to the samples in the first data set are extracted, a domain classification model is trained based on text features of the samples in the first data set, when the domain classification model trained by the method of this embodiment is used for domain classification of a text to be processed, since the domain classification of the text to be processed does not need to consider specific words, but is completed according to the word slot classes labeled to the text to be processed and the features in the corresponding sentence patterns, for sentences of words not appearing in the training samples in the text to be processed, accurate domain classification can still be performed on the sentences according to the word slot classes of the sentences, the accuracy and the efficiency of the domain classification result are improved, and therefore the whole text classification efficiency is improved.

In some embodiments, step 305 may include: inputting text characteristics of the samples in the first data set into a domain classification model, and performing domain prediction on the samples in the first data set through the domain classification model to obtain domain category prediction information of the samples in the first data set; the domain classification model is trained based on a difference between domain class prediction information for the samples in the first data set and domain class information labeled for the samples in the first data set.

The step 305 may be an iterative process. In some optional examples, the parameters of the domain classification model may be adjusted according to a difference between the domain category prediction information of the samples in the first data set and the domain category information of the sample labels in the first data set until a training completion condition is met, for example, a difference between the domain category prediction information of the samples in the first data set and the domain category information of the sample labels in the first data set is smaller than a preset threshold, or the number of times of training of the domain classification model reaches a preset number of times.

In some embodiments, in

step

202 or 302 of the embodiments shown in fig. 5 to fig. 6, the samples in the first data set may be input into a trained sequence labeling model, and the word slot categories of the samples in the first data set may be labeled through the trained sequence labeling model.

In the embodiment, the word slot categories of the samples in the first data set are labeled through the trained sequence labeling model, so that the accuracy and efficiency of labeling the word slot categories are improved.

After the training of the domain classification model in the embodiment shown in fig. 6 is completed, the operation of performing domain classification on the text to be processed according to the result of the word slot class labeling in the embodiment 103 shown in fig. 2 to 5 to obtain the domain class of the text to be processed may be performed, and relevant points may be referred to the description in the embodiment shown in fig. 2 to 4, which is not described herein again.

Fig. 7 is a flowchart illustrating a training method of a classification model according to another exemplary embodiment of the present disclosure. As shown in fig. 7, the training method of the classification model of the present embodiment includes the following steps:

step 401, a first data set and a second data set are obtained.

Wherein, the samples in the first data set are marked with domain category information; the samples in the second data set are labeled with word slot class information according to a predefined word slot class.

The samples in the first data set and the labeled domain type information thereof can be referred to the record 301 in the embodiment shown in fig. 6, and are not described herein again.

The samples in the second data set may be, for example: the ice rain of Liu De Hua, play Liu De Hua song, …, navigate to Zhongguancun, I want to go to West Cuimen navigation, …, weather today, Beijing today has rain, …, and so on. The term slot type information labeled by the samples in the second data set may be a term slot of a predefined domain-wide type, for example, the term slot may be an artist, title, poi, time, location, and the like, which may be specifically referred to table 1 above. After the training of the sequence labeling model by using the sample labeled with the word-slot type information is completed, the sequence labeling model can perform corresponding word-slot type labeling on the text.

Step 402, training the sequence annotation model with the second data set.

Step 403, inputting the samples in the first data set into a sequence labeling model, and labeling the word slot categories of the samples in the first data set through the sequence labeling model.

Step 406, determining a sentence pattern corresponding to the sample in the first data set according to the result labeled by the word groove type.

Step 405, extracting features in the sentence pattern corresponding to the sample in the first data set to obtain text features of the sample in the first data set.

Step 406, training a domain classification model based on the text features of the samples in the first dataset.

In the embodiment, the sequence labeling model is trained in advance by using the sample data set, and the slot information labeling is performed on the text to be processed through the trained sequence labeling model, so that the accuracy and the efficiency of the slot information labeling are improved, and the whole text classification efficiency is improved.

In some embodiments, step 402 may comprise:

inputting the samples in the second data set into a sequence labeling model, and performing word-slot class prediction on the samples in the second data set through the sequence labeling model to obtain word-slot class prediction information of the samples in the second data set;

and training the sequence labeling model according to the difference between the word groove type prediction information of the samples in the second data set and the word groove type information labeled by the samples in the second data set.

The step 402 may be an iterative process. In some optional examples, the parameters of the sequence labeling model may be adjusted according to a difference between the slot-word class prediction information of the samples in the second data set and the slot-word class information labeled by the samples in the second data set until a training completion condition is met, for example, a difference between the slot-word class prediction information of the samples in the second data set and the slot-word class information labeled by the samples in the second data set is smaller than a preset threshold, or the number of times of training of the sequence labeling model reaches a preset number of times.

After the training of the sequence labeling model and the domain classification model in the embodiment shown in fig. 7 is completed, the training can be used to implement the operations of 102 and 103 in the embodiments shown in fig. 2 to 5, and related points can be referred to the records in the embodiments shown in fig. 2 to 4, which are not described herein again.

Any of the methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any of the methods provided by the embodiments of the present disclosure may be performed by a processor, such as a processor that executes any of the methods mentioned by the embodiments of the present disclosure by calling corresponding instructions stored in a memory. And will not be described in detail below.

Exemplary devices

Fig. 8 is a schematic structural diagram of a text classification apparatus according to an exemplary embodiment of the present disclosure. The text classification device can be arranged in electronic equipment such as terminal equipment and a server, and executes the text classification method of any one of the above embodiments of the disclosure. As shown in fig. 8, the text classification apparatus includes: a first obtaining module 501, a labeling module 502 and a classifying module 503. Wherein:

a first obtaining module 501, configured to obtain a text to be processed.

The labeling module 502 is configured to perform word slot class labeling on the text to be processed acquired by the first acquiring module 501 according to a predefined word slot class.

In some embodiments, the labeling module 502 may include a sequence labeling model for inputting the text to be processed into the sequence labeling model, and labeling the word slot category of the text to be processed through the sequence labeling model.

The classifying module 503 is configured to perform domain classification on the text to be processed according to the result of the word slot class labeling obtained by the labeling module 502, so as to obtain a domain class of the text to be processed.

According to the text classification device provided by the embodiment of the disclosure, word slot class labeling is performed on a text to be processed according to a predefined word slot class, and field classification is performed on the text to be processed according to a word slot class labeling result. Because the field classification of the text to be processed does not need to consider specific words, but is completed according to the word slot classification labeled by the text to be processed, the accurate field classification of sentences can be realized, and the accuracy of the field classification is improved.

Fig. 9 is a schematic structural diagram of a text classification apparatus according to another exemplary embodiment of the present disclosure. On the basis of the embodiment shown in fig. 8, the classification module 503 includes: a first determining unit 5031, determining a sentence pattern corresponding to the text to be processed according to the result labeled by the word slot category; the second determining unit 5032 determines the domain type of the text to be processed based on the sentence pattern corresponding to the text to be processed.

In some embodiments, the second determining unit 5032 may comprise: the extraction subunit is used for extracting the features in the sentence patterns corresponding to the text to be processed to obtain the text features of the text to be processed; and the classification subunit is used for performing field classification on the text to be processed based on the text characteristics of the text to be processed to obtain the field type of the text to be processed.

In some optional examples, the classification subunit may include a domain classification model, and is configured to input text features of the text to be processed into the domain classification model, and perform domain classification on the text to be processed through the domain classification model to obtain a domain category of the text to be processed.

Fig. 10 is a schematic structural diagram of a training apparatus for a classification model according to an exemplary embodiment of the present disclosure. The training device of the classification model may be disposed in an electronic device such as a terminal device or a server, and executes the text classification method according to any of the embodiments of the disclosure. As shown in fig. 10, the training apparatus for the classification model includes: a second obtaining module 601, a labeling module 602, and a first training module 603. Wherein:

the second obtaining module 601 is configured to obtain a first data set, where a sample in the first data set is labeled with domain category information.

The labeling module 602 is configured to perform word slot class labeling on the samples in the first data set acquired by the second acquiring module according to a predefined word slot class.

The first training module 603 is configured to train a domain classification model using the first data set according to the result of the word-groove class labeling obtained by the labeling module.

According to the training device for the classification model provided by the embodiment of the disclosure, word slot class labeling is performed on samples in the first data set according to predefined word slot classes, field class information is labeled on the samples in the first data set, and then the field classification model is trained by using the first data set according to the result of the word slot class labeling. When the domain classification model trained by the method of the embodiment is used for domain classification of the text to be processed, the domain classification of the text to be processed is completed according to the word groove type labeled by the text to be processed without considering specific words, and under the condition that the training sample is a small sample, the sentence of the word which does not appear in the training sample and exists in the text to be processed can be accurately domain-classified according to the word groove type of the sentence, so that the accuracy of the domain classification is improved.

Fig. 11 is a schematic structural diagram of a training apparatus for a classification model according to another exemplary embodiment of the present disclosure. On the basis of the embodiment shown in fig. 10, the first training module 603 includes: a third determining unit 6031, configured to determine, according to the result labeled by the word groove type, a sentence pattern corresponding to the sample in the first data set; the first training unit 6032 is configured to train a domain classification model based on the sentence pattern corresponding to the sample in the first data set.

Referring again to fig. 11, in some embodiments, the first training unit 6032 may include: the extraction subunit is used for extracting the features in the sentence patterns corresponding to the samples in the first data set to obtain the text features of the samples in the first data set; and the training subunit is used for training the domain classification model based on the text features of the samples in the first data set.

In some optional examples, the training subunit is specifically configured to: inputting text characteristics of the samples in the first data set into a domain classification model, and performing domain prediction on the samples in the first data set through the domain classification model to obtain domain category prediction information of the samples in the first data set; the domain classification model is trained based on a difference between domain class prediction information for the samples in the first data set and domain class information labeled for the samples in the first data set.

In some optional examples, the annotation module 602 is specifically configured to: and inputting the samples in the first data set into a sequence labeling model, and labeling the word slot categories of the samples in the first data set through the sequence labeling model.

Referring to fig. 11 again, in a training apparatus for a classification model provided in another exemplary embodiment, the training apparatus further includes: a second obtaining module 604, configured to obtain a second data set, where a sample in the second data set is marked with word slot category information according to a predefined word slot category; a second training module 605 for training the sequence annotation model with the second data set.

In some embodiments, the second training module 605 may include: a prediction unit 6051, configured to input the samples in the second data set into a sequence tagging model, and perform word-slot class prediction on the samples in the second data set through the sequence tagging model to obtain word-slot class prediction information of the samples in the second data set; a second training unit 6052, configured to train the sequence labeling model according to a difference between the word-slot class prediction information of the samples in the second data set and the word-slot class information labeled by the samples in the second data set.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 12. The electronic device may be either or both of the first device and the second device, or a stand-alone device separate from them, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom.

FIG. 12 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 12, the electronic device includes one or more processors 701 and memory 702.

The processor 701 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.

Memory 702 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 701 to implement the methods of the various embodiments of the disclosure described above and/or other desired functionality. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device may further include: an input device 703 and an output device 704, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, when the electronic device is a first device or a second device, the input device 703 may be the microphone or the microphone array described above for capturing an input signal of a sound source. When the electronic device is a stand-alone device, the input means 703 may be a communication network connector for receiving the acquired input signals from the first device and the second device.

The input device 703 may also include, for example, a keyboard, a mouse, and the like.

The output device 704 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 704 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, for simplicity, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 12, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform steps in methods according to various embodiments of the present disclosure as described in the "exemplary methods" section of this specification above.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in methods according to various embodiments of the present disclosure as described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of text classification, comprising:

acquiring a text to be processed;

2. The method of claim 1, wherein the performing domain classification on the text to be processed according to the result labeled by the word slot category comprises:

determining a sentence pattern corresponding to the text to be processed according to the result of the word slot category marking;

and determining the field type of the text to be processed based on the sentence pattern corresponding to the text to be processed.

3. The method according to claim 2, wherein the determining the domain category of the text to be processed based on the sentence pattern corresponding to the text to be processed comprises:

extracting features in sentence patterns corresponding to the text to be processed to obtain text features of the text to be processed;

and performing field classification on the text to be processed based on the text characteristics of the text to be processed to obtain the field type of the text to be processed.

4. The method of claim 3, wherein the performing a domain classification on the text to be processed based on the text features of the text to be processed comprises:

inputting the text characteristics of the text to be processed into a domain classification model, and performing domain classification on the text to be processed through the domain classification model to obtain the domain category of the text to be processed.

5. The method according to any one of claims 1 to 4, wherein the labeling the to-be-processed text according to a predefined word slot category includes:

and inputting the text to be processed into a sequence labeling model, and labeling the word slot type of the text to be processed through the sequence labeling model.

6. A training method of a classification model comprises the following steps:

7. The method of claim 6, wherein the training a domain classification model using the first data set according to the result of the word-slot class labeling comprises:

determining sentence patterns corresponding to the samples in the first data set according to the result of the word slot category marking;

and training the field classification model based on the sentence patterns corresponding to the samples in the first data set.

8. The method of claim 7, wherein said training the domain classification model based on the sentence pattern corresponding to the samples in the first dataset comprises:

extracting features in sentence patterns corresponding to the samples in the first data set to obtain text features of the samples in the first data set;

training the domain classification model based on text features of samples in the first dataset.

9. The method of claim 8, wherein the training the domain classification model based on text features of samples in the first dataset comprises:

inputting the text features of the samples in the first data set into the domain classification model, and performing domain prediction on the samples in the first data set through the domain classification model to obtain domain category prediction information of the samples in the first data set;

and training the domain classification model according to the difference between the domain class prediction information of the samples in the first data set and the domain class information labeled by the samples in the first data set.

10. The method according to any one of claims 6 to 9, wherein the tagging of the word-slot classes for the samples in the first data set according to predefined word-slot classes comprises:

and inputting the samples in the first data set into a sequence labeling model, and labeling the word slot categories of the samples in the first data set through the sequence labeling model.

11. The method of claim 10, wherein the inputting the samples in the first data set into a sequence labeling model, before labeling the wormhole class of the samples in the first data set by the sequence labeling model, further comprises:

acquiring a second data set, wherein the samples in the second data set are marked with word slot type information according to the predefined word slot types;

training the sequence annotation model using the second data set.

12. The method of claim 11, wherein the training the sequence annotation model using the second data set comprises:

inputting the samples in the second data set into the sequence labeling model, and performing word-slot class prediction on the samples in the second data set through the sequence labeling model to obtain word-slot class prediction information of the samples in the second data set;

13. A domain classification apparatus comprising:

the first acquisition module is used for acquiring a text to be processed;

14. A training apparatus for classification models, comprising:

15. A computer-readable storage medium, storing a computer program for performing the method of any of the preceding claims 1 to 12.

16. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1 to 12.