CN114491034A

CN114491034A - Text classification method and intelligent device

Info

Publication number: CN114491034A
Application number: CN202210080130.5A
Authority: CN
Inventors: 车进
Original assignee: Juhaokan Technology Co Ltd
Current assignee: Juhaokan Technology Co Ltd
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-05-13

Abstract

The application provides a text classification method and intelligent equipment, wherein the method can be used for calculating the score of a support word after an ultra-long text to be classified is obtained, and then segmenting the text to be classified to obtain a plurality of text segments. And respectively calculating the first score and the second score of each text segment to obtain the comprehensive score of each text segment, re-segmenting the text to be classified according to the comprehensive score to obtain short text data, and finally inputting the short text data into a natural language processing model for text classification. The method can calculate and obtain the comprehensive score of the text segment by zero-time learning and the support word score, and determine the importance degree of the text segment so as to keep the model effect as far as possible while ensuring the performance and reduce the semantic loss.

Description

Text classification method and intelligent device

Technical Field

The application relates to the technical field of natural language processing, in particular to a text classification method and intelligent equipment.

Background

The text classification is a data processing mode of a computer and other data processing equipment for automatically classifying and marking a text set according to a certain classification system or standard. The text classification can be based on a deep learning neural network technology, a relation model between text features and text categories is found in labeled training sample data, and then the relationship model obtained through learning is used for carrying out category judgment on a new text, so that semantic understanding of the natural language text is realized.

In the process of text classification, the data processing equipment needs to train the initial model through sample data to obtain a training model, and then identifies new text data by using the training model, so as to output the classification probability of each category corresponding to the text data. Because the training model needs to process the text in the modes of position coding and the like in the text classification process, the text data input into the training model has length limitation. For example, referring to the natural language training model of the classic BERT, due to the design requirement of the original position coding, the training model supports text processing with the maximum length of 512, and cannot support modeling of ultra-long text.

In order to process long texts, the texts need to be segmented before the long texts are input into the training model. For example, the super-long text may be truncated from front to back using hard truncation to cut a long text with a length greater than 512 into a plurality of short texts with specified lengths less than or equal to 512, and then modeled using a training model. However, the text truncation method is only suitable for the semantics of a short text with a specified length capable of representing a complete text, and the actual text data rarely can meet the characteristic, so that the text truncation method ignores the performance sensitivity and usability of the text, and partial semantic loss is caused.

Disclosure of Invention

The application provides a text classification method and intelligent equipment, which aim to solve the problem that semantic loss occurs when a traditional text classification method processes a super-long text.

In a first aspect, the present application provides a text classification method, including:

acquiring a text to be classified;

calculating the score of a support word of each classification label corresponding to the category, wherein the score of the support word is the inverse text frequency IDF numerical value of the key words in the text to be classified; the support words are keywords of which the IDF numerical values are larger than a preset IDF judgment value;

segmenting the text to be classified into a plurality of text segments;

calculating a first score of each text segment, wherein the first score is the information entropy of a category score vector; the category score vector is a vector formed by zero-time learning model classification results of the text segment for each category;

calculating a second score of each text segment, wherein the second score is obtained by calculation according to the scores of the support words in the text segments;

calculating a composite score, which is a normalized sum of the first score and the second score;

and re-segmenting the text to be classified according to the comprehensive score, and inputting a re-segmentation result into a natural language processing model.

In a second aspect, the present application further provides a smart device, including: the device comprises a storage module and a processing module. Wherein the storage module is configured to store a natural language processing model and a zero-time learning model; the processing module is configured to perform the following program steps:

acquiring a text to be classified;

segmenting the text to be classified into a plurality of text segments;

According to the technical scheme, after the ultra-long text to be classified is obtained, the score of the support word is calculated, and then the text to be classified is segmented to obtain a plurality of text segments. And respectively calculating the first score and the second score of each text segment to obtain the comprehensive score of each text segment, re-segmenting the text to be classified according to the comprehensive score to obtain short text data, and finally inputting the short text data into a natural language processing model for text classification. The method can calculate and obtain the comprehensive score of the text segment by zero-time learning and the support word scoring, determine the importance degree of the text segment, ensure the performance, simultaneously keep the model effect as far as possible and reduce the semantic loss.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of natural language processing according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an intelligent dialogue interaction in an embodiment of the present application;

fig. 3 is a schematic diagram of a cooperative work flow of an intelligent device and a server in an embodiment of the present application;

FIG. 4 is a schematic diagram of a network application working scenario in an embodiment of the present application;

fig. 5 is a schematic flowchart illustrating a text classification method executed by an intelligent device in an embodiment of the present application;

FIG. 6 is a schematic text truncation flow diagram in an embodiment of the present application;

FIG. 7 is a schematic diagram of a flow of determining a support word in an embodiment of the present application;

fig. 8 is a flowchart illustrating a text classification method according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following examples do not represent all embodiments consistent with the present application. But merely as exemplifications of systems and methods consistent with certain aspects of the application, as recited in the claims.

In the embodiment of the application, the text classification method can be applied to intelligent equipment with a data processing function and natural language processing requirements. The smart devices include, but are not limited to: the system comprises a computer, an intelligent terminal, an intelligent television, intelligent wearable equipment, intelligent display equipment, a server and the like. The intelligent device can be internally or externally connected with a storage module and provides a processing module to form a text classification system capable of executing the text classification method.

For example, the smart device may be a smart television device having a built-in memory and a controller, wherein the memory may be used to store data such as text, natural language processing models, control programs, and the like. The controller may then retrieve the data from memory and execute processing of the retrieved data by executing a control program.

As shown in FIG. 1, in an embodiment of the present application, natural language processing may include two phases, a model training phase and a text classification phase. In the model training phase, the controller may obtain training sample data, i.e., text data with labels, from a network or other means. And inputting the training sample data into an initial training model to perform model training. The training model may output a classification probability for each classification label based on the input sample data. And then, carrying out comparison operation on the classification probability output by the model and the classification labels to obtain errors between the classification results and the classification labels, and adjusting model parameters in the training model by reversely propagating the errors. Therefore, parameters of the training model are adjusted and optimized repeatedly through a certain volume of training sample data, and the training model with high classification accuracy is obtained. After the model training process is completed, the controller stores the training-derived classification model in memory for later invocation by a subsequent application.

In the text classification stage, the controller may call the trained classification model in the memory, and input the text data to be classified into the called classification model. The classification probability of the current text data to each classification label can be obtained through the internal operation of the classification model, so that the semantics of the text data can be understood.

The training model or the classification model described in the above embodiments may be a Natural Language Processing (NLP) based model. For example, BERT models and other NLP models obtained by optimization or modification based on BERT models. It should be noted that the model in the above embodiment may be referred to as a training model in a model training phase, and the model in a text classification phase is referred to as a classification model. Because the training model and the classification model are only different stages of one model, and both the training model and the classification model take the text data as input in the text classification process, the text data capable of inputting the training model can also be input into the classification model, that is, in any embodiment described later, the training model and the classification model are not distinguished any more except for other description, and the text classification method uniformly using the natural language processing model (or NLP model) can be applied to processing the text data in the model training stage and can also be applied to processing the text data in the text classification stage.

In the text classification process, different classification labels can be set, so that the intelligent equipment can determine the classification probability of the text data to the classification labels according to the classification, and further determine the corresponding meaning of the text data. Namely, the text classification process can determine the machine language from the natural language text data to realize machine learning. Therefore, the text classification process can be applied to fields related to natural language processing, such as intelligent voice control, intelligent question answering, image recognition processing, business statistical analysis and the like.

In some embodiments, to implement the text classification process described above, the intelligent device may perform model training and text classification by embedding an Artificial Intelligence (AI) algorithm in the operating system. For example, as shown in fig. 2, for the smart question-answering robot, a smart question-answering system may be built in an operating system of the smart question-answering robot. In practical application, the intelligent question and answer robot can acquire text data input by a user in real time, such as the position of the 'XX shop'. And calling the intelligent question-answering system so as to input the text data input by the user into a classification model in the intelligent question-answering system to understand the semantics of the text data input by the user, namely 'searching x positions'. And finally, feeding back corresponding dialogue contents according to the understood semantics, namely that the X shop is in the third floor 308F', so as to realize the intelligent question answering function.

Obviously, for intelligent devices with different purposes, due to different functions, artificial intelligence algorithms built in operating systems of the intelligent devices are different, but all the intelligent devices essentially realize a text classification process, and only have corresponding differences in the setting of classification labels. For example, the classification label set for the classification model in the AI algorithm built in the intelligent question-answering robot is label content related to the question-answering process, such as "search", "consult", "select", "suggest" and other classification labels representing the user's intention. And different user intentions may act on different business objects respectively, the category labels may also include business object names such as "title", "item name", "person name", "place name", and the like. In the built-in intelligent voice system of the intelligent television, the classification label for the classification model society can be label content related to media asset playing, such as 'movie', 'TV play', 'cartoon' and the like representing media asset types, and 'movie name', 'author', 'type (comedy, military and rural)' corresponding representing media asset objects.

In addition to the AI algorithm being built into the operating system, the AI algorithm corresponding to the text classification function may also be built into the application program. That is, in some embodiments, the smart device may also implement text classification functionality by installing an application. The application program capable of implementing the text classification function may be a system application or a third party application. For example, to implement the smart question and answer function, the computer may download and install an "smart question and answer robot" application program, and call the classification model by running the application program, and implement the classification function on the text data by acquiring the text data input by the user in real time and inputting the text data into the classification model.

In some embodiments, the text classification function is not limited to be implemented by one intelligent device, but may be implemented by cooperation between multiple devices. That is, the intelligent device may establish a communication connection relationship with the server. In practical application, the intelligent device acquires text data input by a user in real time, the server executes model training and text classification processes, and the intelligent device displays classification results.

For example, as shown in fig. 3, the smart device may acquire text data input by the user in real time on the fly and send the text data to the server. An AI algorithm and a classification model for realizing the text classification function are built in the server, so that after receiving text data sent by the intelligent equipment, the server can input the text data into the classification model to obtain a classification result output by the model. The server feeds the classification result back to the intelligent equipment, so that the classification result and the related interaction information can be fed back to the user through the intelligent equipment.

Obviously, in order to implement more service requirements and reduce data processing amount, in practical applications, specific device data for implementing a text classification function through cooperative operation of multiple devices may be flexibly set according to requirements of implemented functions. And the specific text classification process can be flexibly set according to the hardware configuration and the data volume of the equipment, so that the repeated data processing process is reduced, and the operational capability of the equipment is saved. For example, multiple smart devices may establish a communication connection with a server at the same time. The server is used for providing a classification model for the plurality of intelligent devices in a unified mode, and different intelligent devices can perform programs such as data input, model operation and result output after acquiring the classification model, so that the text classification function is achieved. Meanwhile, the intelligent device can report the text data processed by the intelligent device to the server so as to further carry out model training in the server and continuously perfect the classification model. Therefore, the server can push the classification models to the intelligent devices at a preset time, and update the classification models in each intelligent device so as to keep the timeliness of the intelligent devices.

In addition, when the text classification function is realized through the cooperative operation of multiple devices, the operation load condition of each device can be monitored in real time, and the actual execution main body of the model training stage and/or the text classification stage can be dynamically adjusted according to the real-time load condition. That is, as shown in fig. 4, in some embodiments, the application program for implementing text classification may be a web application, and the smart device and the server accessing the same network may implement a text classification function by running the web application after installing the web application. In the process of running the network application, the network application can monitor the operation loads of each intelligent device and the server in real time, wherein the operation loads include the CPU usage amount, the memory usage amount, the network delay and the like. When the data corresponding to any operation load is abnormal, the AI algorithm execution main body of the corresponding equipment can be adjusted in real time, so that the text classification process can run smoothly.

For example, in a normal state, model calculation in a text classification process may be performed by the smart device, and when it is monitored that the memory usage amount of the smart device exceeds a threshold value, the process of performing the model calculation by the smart device may be suspended, and the smart device is automatically controlled to transmit the acquired text data to the server, so that the model calculation is performed by the server, and a classification result is fed back, so as to reduce the processing load of the smart device and improve the timeliness of text classification.

As can be seen from the above embodiments, in the process of applying text classification, the smart device or server needs to input text data into the training model (or classification model) in the model training stage and the text classification stage. Since the text data to be classified is natural language text data, it has different text forms depending on the source of the text data. For example, for text data generated from voice information input by a user, the content is biased to be spoken according to the voice input process of the user, and the actual text length is low, and is generally only one or several sentences in length. For business texts such as a contract, a decision book, a protocol and the like, the text length is high due to specific format requirements, and the business texts generally comprise a plurality of paragraphs, wherein each paragraph comprises a plurality of sentences.

The natural language processing model has a limit to the length of a text input at one time because of the influence of the original design. For example, for the BERT model, the BERT model cannot support modeling of the very long text due to the design of the original position information mapping (position mapping), and the design principle of the position information mapping is continued in the widely used pre-training model, so that a considerable number of pre-training models cannot model the very long text properly. In general, a natural language processing model constructed based on the BERT model sets the input text length to 512 characters, that is, the text length of the NLP model input at one time must not exceed 512 characters.

Therefore, when the NLP model is actually applied, short texts input by a user in a form of real-time conversation and the like can be directly input as the NLP model, and long texts input in a form of business documents such as a contract book and the like cannot be directly input as the NLP model. It should be noted that the short text and the long text have relativity, that is, the division criteria of the short text and the long text may be different for different application fields or different NLP models. For example, for the BERT model, which sets a maximum text input length of 512, text having a length greater than 512 is referred to as long text, and text having a length less than or equal to 512 is referred to as short text.

In order to enable data of long texts to be input into the NLP model for text classification, in some embodiments, a text truncation program may be provided in the AI algorithm. In practical application, after receiving text data input by a user, the intelligent device may first detect the length of the text data to determine whether the currently input text is a long text. When it is determined that the current text data is a long text, a text truncation procedure in the AI algorithm may be activated. The text truncation program can truncate the current text data, and truncate a long text into a plurality of short texts, wherein the length of each short text obtained by truncation is less than or equal to the maximum value of the input length of the NLP model text. And inputting a plurality of short texts obtained by truncation into an NLP model one by one for text classification.

In the process of executing text truncation, the intelligent device can perform hard truncation according to the maximum text input length. For example, when the length of the text data is 804, since the length thereof is greater than the maximum text input length 512 of the BERT model, text truncation is required. At this time, the smart device may truncate the current text data according to the maximum length 512 to obtain a short text a with a length 512 and a short text b with a length 292. And respectively inputting the short text a and the short text b into a BERT model for text classification.

However, since there is generally a certain correlation between the front and back contents of the text data, and the text truncation operation performed may destroy the correlation between the front and back contents, the smart device may perform text truncation in different ways to retain the correlation between the front and back contents in the text as much as possible. In order to preserve the relevance as much as possible, in some embodiments, the text data may be converted into a plurality of sentences or a plurality of paragraphs according to punctuation marks, paragraph marks, space characters and other marks in the document during the preprocessing of the text data. And each sentence or paragraph is input into the NLP model as a separate short text.

The method of text segmentation by sentence or segment can keep the relevance of the front and back contents in the text, but the text truncation method still has the great disadvantage that the relevance relation between the sentences is split according to the method of text segmentation by sentences, so that the finally identified semantics is too thin. And according to the way of segmenting paragraphs, when the text in partial scenes is dealt with, the whole paragraph cannot be input due to the fact that the paragraph is too long and exceeds the limit of the maximum length of the input text, and the long paragraph needs to be split secondarily. Moreover, the way of splitting paragraphs also leads to too much semantic aggregation, which affects the text classification effect.

In order to improve the text segmentation effect, some embodiments of the present application provide a text classification method, as shown in fig. 5, where the method is applicable to an intelligent device or a server capable of performing text classification, and specifically includes the following steps:

and acquiring the text to be classified. Before text classification, the intelligent device or the server may obtain a text to be classified. For intelligent devices with different functions, the acquisition modes of the texts to be classified can be different. For example, for the intelligent question-answering robot, the text to be classified can be obtained by acquiring voice data of a user through a voice acquisition device and then converting the voice data into a text through a voice-to-text tool. And for the computer of the auditing task, the text to be classified can be obtained by reading the business document stored in the database.

After the text to be classified is obtained, the intelligent device can also preprocess the text to be classified. The text preprocessing refers to a series of preprocessing operations performed when a text to be classified is input into an NLP model, so that the text to be classified can meet the input requirement of the NLP model. For example, text preprocessing may include removing meaningless characters from the text, and text preprocessing may also include a tensor needed to convert the text into a model, a size of a canonical tensor, and so on.

In some embodiments, the preprocessing of the text may further include removing stop words, i.e., the smart device may filter words in the text using a preset thesaurus to remove words or symbols that have no actual meaning. For example. Words such as "can be removed from the text, and other symbols that have no actual meaning.

The preset word library is a database which is constructed in advance according to the application field, and the preset word library can include basic words and special nouns used in the field, conventional grammatical words used in the field and the like. During preprocessing, the intelligent device can perform word segmentation processing on the text based on a preset word bank, and split the whole sentence according to word rules. For example, when the user inputs "×" is a film of positive energy ", the word segmentation result for the text according to the preset lexicon may be" × ×/is/a/positive energy/film/o ". After word segmentation, the intelligent device can filter word segmentation results according to a preset word bank so as to remove meaningless words in the text. For example, after performing filtering, a preprocessing result of "×/yes/positive energy/movie" may be obtained.

In addition to preprocessing the text, after the intelligent device obtains the text to be classified, it may also determine whether the text to be classified needs to be truncated, that is, as shown in fig. 6, in some embodiments, after obtaining the text to be classified, the intelligent device may detect the text length of the text to be classified by traversing the number of effective characters in the text to be classified. And comparing the text length with a preset length threshold, if the text length is greater than the preset length threshold, namely the current text to be classified is a long text, the text to be classified needs to be truncated, so that a text truncation program can be activated. Similarly, if the length of the text to be classified is less than or equal to the preset length threshold, the text to be classified currently can be determined to be a short text, and at this time, the text to be classified is not required to be truncated, so that a text truncation program is not required to be activated, and the text to be classified currently is directly input into the NLP model.

After the text data to be classified is obtained, the intelligent device can extract the category information corresponding to the classification labels from the downstream task and calculate the score of the support word of the category corresponding to each classification label. The support word is divided into an Inverse text Frequency (IDF) numerical value of a keyword in a text to be classified; the support words are keywords of which the IDF numerical values are larger than the preset IDF judgment values.

For long texts, the importance degree of each sentence for the downstream task is different, and the importance degree of one sentence is related to the sentence semantics and is also related to the downstream task. For example, when the obtained text to be classified includes a sentence "movie a" is a positive energy movie ", the sentence obviously has a great effect on the downstream task of" movie classification ", and can help the intelligent device to correctly classify the movie a into the category of" positive energy ". However, if the downstream task is "to distinguish the country of the movie", there is relatively no significant effect.

Therefore, in order to determine the degree of association and importance of each sentence to the downstream task, in the embodiment, the text segment formed by each sentence may be scored, and the text segment with higher score is higher than the importance of the downstream task. In order to calculate the score of the text segment, the intelligent device needs to determine the support words associated with the classified categories in the downstream task and the score of each support word.

In order to obtain the support words and the support word scores, in some embodiments, the intelligent device may, in the process of calculating the support word score of the category corresponding to each classification tag, remove the noise words in the text to be classified based on the preset word bank to obtain the keyword set. For example, after obtaining a text to be classified, the intelligent device may obtain a candidate word set through word segmentation processing. And calling a preset word bank, and removing words such as noise, stop words (including symbols) and the like which do not have any help on semantics, so that the stop words are removed from the candidate word set to obtain a keyword set.

In the process of determining the keyword set, if the occurrence frequency of a part of words in the candidate word set after the stop words are removed is too small, randomness occurs during subsequent IDF calculation, and the accuracy of calculating the score of the support word is affected. Therefore, for the candidate word set with the stop words removed, the intelligent device may traverse the total number of occurrences C of each keyword k in the keyword set in the text to be classified_kAnd based on said total number of occurrences C_kAnd determining a keyword set. For example, a super-parameter, that is, the first super-parameter α is 100, may be preset, and when the total number of occurrences C of the keyword k is greater than or equal to the first super-parameter α, the total number of occurrences C of the keyword k is smaller than or equal to the second super-parameter α_kIf the frequency is less than 100, the corresponding key word k can be removed from the key word set, so that the key words with too few times in the key word set are reduced, and the randomness of subsequent IDF calculation results is relieved.

In the actual text classification process, the occurrence frequency of the keywords generally satisfies the exponential distribution, that is, assuming that the keyword set includes n keywords, the occurrence frequency of the keywords satisfies:

therefore, by setting the preset low-frequency word probability θ, it can be determined that the following relational expression is satisfied when the low-frequency words with the occurrence probability smaller than θ are eliminated:

that is, according to the above relation, in some embodiments, after traversing the total number of occurrences of each keyword in the keyword set in the text to be classified, the intelligent device may obtain the preset low-frequency word probability θ, and calculate the first hyper-parameter α according to the following formula:

in the formula, alpha is a first hyperparameter, theta is a preset low-frequency word probability, n is the number of keywords in the keyword set, and C_kThe total number of occurrences for each keyword k.

After the first hyperparameter α is obtained through the calculation in the above formula, the intelligent device may filter the candidate word set again based on the first hyperparameter α to remove low-frequency words in the keyword set, that is, remove keywords whose total occurrence times are less than the first hyperparameter in the keyword set. For example, when the total number of occurrences C of the keyword k_kIf the frequency is less than alpha, the corresponding key word k can be removed from the key word set, so that the key words with too few times in the key word set are reduced, and the randomness of the subsequent IDF calculation result is relieved.

Determining the total occurrence number C of each keyword k in the keyword set and the texts to be classified_kThen, the intelligent device can also traverse the occurrence frequency C of each keyword k in each category j in the keyword set_k，j(ii) a Recalculate the number of occurrences C_k，jAnd total number of occurrences C_kTo obtain the IDF value corresponding to each keyword in the keyword set, that is:

in the formula, IDF_k,jThe IDF value is corresponding to the keyword; c_k，jIs the number of occurrences of keyword k in category j; c_kThe total number of occurrences of the keyword k in the text to be classified.

After calculating the IDF value corresponding to the keyword, the IDF value obtained by calculation may be used as the score of the support word. However, for the partial text classification process, if there are many categories corresponding to the downstream tasks, the calculated IDF value may be small, and the scores of different categories are not comparable. To improve this problem, after the intelligent device can obtain the IDF value, the keyword set can be filtered again according to the IDF value obtained by calculation, that is, the IDF data obtained by calculation is subjected to normalization processing, such as max normalization processing. However, the direct normalization process may cause a large number of inaccurate keywords due to the fact that a certain category is too broad, and thus, no proper category supporting words are available. For example, for a downstream task of classifying a movie by contents, which includes a category called "scenario", the direct planning will generate a large amount of inaccurate keywords due to the broad category, resulting in inaccurate text classification results. In this regard, as shown in FIG. 7, in some embodiments, the smart device may first calculate a normalized component IDF_minAnd the normalization component is the reciprocal of the total number N of the categories. Namely:

in the formula, IDF_minFor the normalized component, N is the total number of classes for the downstream task.

IDF after calculating normalized component_minThe smart device may be based on the normalized component IDF_minSetting a second hyperparameter β, wherein the second hyperparameter β is a constant number greater than 0 and less than or equal to the total number N of categories, that is:

β∈(0,N]

obtaining a second hyperparameter beta and a normalized component IDF_minThereafter, the smart device may calculate a second hyperparameter β and a normalized component IDF_minTo obtain the IDF judgment value IDF_HI.e. IDF judgment value IDF_HSatisfies the following formula:

according to the IDF judgment value obtained by calculation, the intelligent device may compare the IDF value obtained by calculation with the IDF judgment value in the above embodiment, and mark the keyword corresponding to the IDF value as the support word of the current category if the IDF value is greater than the IDF judgment value, that is, the keyword k is the support word of the category j. Similarly, if the IDF value is less than or equal to the IDF determination value, the keyword corresponding to the IDF value is marked not to be a supporting word of the current category, i.e., the keyword k is a supporting word of the category j. Through the method, the intelligent device can obtain different support words aiming at different categories, so that the IDF value corresponding to the support word in each category is determined, and the score of the support word is obtained.

After the score of the support word is obtained through calculation, the intelligent device can segment the text to be classified into a plurality of text segments. The method can be used for segmenting the text by adopting different modes according to the actual application function of the intelligent equipment and the characteristics of the processed text data. For example, text data of short text can be segmented sentence by sentence, each sentence being a text fragment. For text data of long and multiple paragraphs, segmentation can be performed segment by segment, and each paragraph is used as a text segment.

However, because the two text segmentation modes respectively have the defects of too thin semantics and too aggregated semantics, a relatively compromised text segmentation mode can be adopted. That is, in some embodiments, the smart device may segment the text to be classified into a plurality of text segments in the following manner, including: and traversing sentence marks in the text to be classified, wherein the sentence marks comprise punctuations, paragraph marks, space characters and the like. And splitting the text to be classified one by one according to the sentence marks to obtain a sentence set. For example, the smart device may segment the text data while traversing to a period, question mark, exclamation mark, ellipsis, etc. in the text representing a one-sentence punctuation mark, thereby separating the text data from sentence-to-sentence.

Meanwhile, a third hyperparameter γ is set, the third hyperparameter γ is used for representing the number of sentences contained in each text segment, and is an integer greater than or equal to 1. The third hyperparameter can be comprehensively set according to the maximum value or the average value of all sentence lengths in the sentence set and the maximum value of the input text length of the NLP model.

For example, by traversing each sentence length D in the set of sentences_mThereafter, the maximum value max (D) of the sentence lengths is calculated_m) Then, the maximum input length D of the classification model corresponding to the downstream task is obtained_maxThereby calculating a third hyperparameter gamma, i.e. the third hyperparameter gamma is smaller than the maximum input length D_maxWith maximum sentence length max (D)_m) The largest integer of the ratio.

After setting the third hyperparameter, the smart device may extract a text segment from the sentence set according to the third hyperparameter. For example, for the maximum input length 512 and the maximum sentence length 25, the smart device may first calculate 512/25 to 20.48, and then determine that the maximum integer smaller than the ratio of the maximum input length to the maximum sentence length is 20, i.e., the third hyperparameter γ is 20. At this time, the smart device may compose a text fragment every 20 sentences.

In order to determine text segments which are more important for the downstream task, aiming at the text data segmentation result, the intelligent device can respectively calculate a first score and a second score of each text segment, wherein the first score is a text segment weight obtained based on zero-time learning model calculation; the second score is an IDF score weighted result obtained based on the strut IDF score.

For the first score, the intelligent device may calculate a first score for each text segment according to the segmentation result of the text data. Wherein the first score is an information entropy of a category score vector; the category score vector is a vector formed by zero-time learning model classification results of the text segments for each category.

To obtain the first score, a Zero-Shot Learning (ZSL) model may be constructed first, and the Zero-Shot Learning model is a model to which a Zero-Shot Learning method in the migration Learning is applied. In zero learning, the sample labels in the training set are disjoint from the labels in the test set, i.e., the samples in the class of the test set are not trained during training, and the zero learning task is to identify the samples in the class not trained during training. In the zero-time learning process, a semantic embedding layer is used as a migration bridge (or intermediate representation) of the seen class and the unsen class in the zero-time learning model, and is used for migrating knowledge of the seen class to the unsen class to realize the effect of obtaining the classification probability in the zero-time learning.

Therefore, in the embodiment, the model algorithm can infer the downstream task without training the downstream task by constructing the zero-time learning model. The zero-time learning model can be pre-constructed and trained by the intelligent device, and can also be uniformly constructed and trained by the server. Therefore, in calculating the first score, the smart device may first call from memory or request from a server to obtain the zero-time learning model. The zero-order learning model takes the text segment as the text input, the classification label as the class description input and the classification result score of each class as the output, and the zero-order learning model obtained after the training of the upstream task training data set can output each classification result score of the downstream task according to the input text segment and the classification label.

That is, the intelligent device can input the text segment and the category label of the downstream task into the zero-time learning model to obtain the classification result score of each text segment output by the zero-time learning model for each category. Seg for an input text segment_iThe zero-time learning model can output the text segment seg_iFor each class LABEL_jScore of (1)_(i,j)。

Since the category labels of the downstream tasks are multiple, i.e. the total number of categories N > 1, seg is for one text segment_iIn other words, each category label may output a classification result score, and the classification result scores of each category are combined together to form a category score vector, i.e. the text segment seg_iFor each LABEL_jThe classification result scores of (a) may constitute a vector:

(score_i,0,score_i,1,…,score_i,N)

generally, after the classification result score is obtained through calculation, the association degree of the current text segment relative to the downstream task category label can be determined according to the classification result score, so that the score values in the category score vector can be combined, and the importance degree of the current text segment to the downstream task can be represented. For example, the scores corresponding to the categories in the category score vector may be subjected to summation or weighted summation, so as to obtain a total score, and a higher total score indicates a higher importance degree of the total score for the downstream task, so as to screen out text segments with higher importance degrees from a plurality of text segments.

However, it should be noted that the importance of each sentence is essentially independent of the category. Therefore, after composing the score vector, the above score vector may be calculated to obtain the zero-learning score, i.e., the first score, of each segment. In order to determine the score value independent of the category, in the embodiment, the information entropy is used to describe the importance degree of the text segment to the downstream task. Wherein the information entropy refers to the mathematical expectation of a random variable defined for a given discrete probability space representing information, the information entropy describing the uncertainty of an event. While entropy is a measure representing the uncertainty of a random variable, and is an expectation of the amount of information produced by all events that may occur. Therefore, the information entropy may be calculated according to the score of the text segment segi to obtain the first score, specifically as follows:

in the formula (I), the compound is shown in the specification,

is the first score, N is the total number of categories; score_i,jThe classification result score for the text segment i to the category j.

According to the formula, the intelligent device can calculate the information entropy of the category score vector, so as to determine the importance degree of the current text segment segi to the downstream task. It can be seen that, in the above embodiment, the intelligent device may apply the zero-learning model to calculate the first score of the text segment for classifying the downstream task, and the zero-learning mode may be used to adapt to the downstream task without training. Therefore, a zero-time learning task is constructed through text classification of a downstream task, and the information entropy is used for measuring the value of zero-time learning, so that a first score for evaluating the importance degree of a text segment to the downstream task is determined.

While calculating the first score, the smart device may also calculate a second score for each text segment based on the previously calculated strut IDF score. Since in the above embodiment the intelligent device has already calculated the score of each category of the support word in the text to be classified, the calculated support word score is still related to the category label, i.e. each support word score is related to the category of the downstream task. In order to obtain a score independent of the label, a final score of the text segment, called a segment score, needs to be determined according to the score of the support word. Thus, in calculating the second score, the smart device may calculate a keyword final score from the tokenized score.

Wherein the segment score is the maximum or average of the scores of the support word for all categories. That is, the intelligent device can determine the final scores of the single keyword for all categories through maxporoling and avgporoling. For the maxporoling approach, the maximum value of the scores of the keyword k in all categories is used as the final score of the keyword k, namely:

for the avgpoling approach, the average value of the scores of the keyword k in all categories can be taken as the final score of the keyword k, that is:

after the final scores of the keywords are obtained, the intelligent device can traverse the times of the keywords corresponding to each category in the text segment. The intelligent device can obtain the times of the occurrence of the keyword k in the text segment i by calling a counting function count (i, k). The same counting mode is adopted for the keywords corresponding to each category, so that the times of the keywords corresponding to each category in the text segment can be traversed.

Based on this, the smart device may calculate the second score as follows:

in the formula (I), the compound is shown in the specification,

is the second score;

is the final score of keyword k; and the count (i, k) is the number of times of the keywords corresponding to the category appearing in the text fragment i.

The smart device may calculate a second score for the text segment according to the above equation. Since the second score is calculated according to the IDF score of the support word in the text segment, the IDF score can be used to evaluate the importance of the support word to the class label in the document set or corpus corresponding to the downstream task. Also, the importance of a support word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Therefore, the second score obtained by calculating the IDF score can also obtain the IDF value of the keyword in each category in a statistical manner, and calculate the importance score according to the IDF value.

It should be noted that the first score and the second score obtained by calculation in the above embodiments may be used separately to evaluate the importance of the text segment to the downstream task. For example, when the first score (or the second score) obtained by the text segment corresponding calculation is higher, it indicates that the higher the association degree of the text segment with the downstream task is, the more accurate classification result is easily obtained. However, in order to obtain a more reasonable importance evaluation result, after the first score and the second score are calculated, the smart device may further calculate a composite score according to the first score and the second score.

Since the first score and the second score obtained by calculation in the above embodiment are not in the same value range, the text classification method provided in the present application aims to compare the score of each text segment seg after one text data (text) is cut into a plurality of text segments (seg), and therefore, the score obtained by calculation needs to be meaningful within one text data. In general, the first scores of all text pieces in one text data

And a second score

Both fit in a normal distribution, so the first score and the second score can be normalized based on the characteristics of the normal distribution. I.e. the composite score is a normalized sum of the first score and the second score.

In order to calculate the composite score, the smart device may first set a fourth hyperparameter, where θ is used to characterize the weight of the IDF value in the composite score, and may be manually adjusted according to the actual application environment. Then, a mean function mean (x) and a standard deviation function var (x) are called to respectively calculate the mean value of the first scores of all the text segments in the text to be classified

And the mean of the second scores

And calculating the standard deviation of the first scores of all the text segments in the text to be classified

And standard deviation of the second score

Finally according to the fourth superParameters, mean and standard deviation, calculating a composite score according to the following formula:

in the formula, S_iThe comprehensive score of the text segment i is obtained; theta is a fourth hyperparameter;

is a first score;

is the first score mean value;

is the first scored standard deviation;

is the second score;

is the second score mean;

the second score standard deviation.

Through the comprehensive score calculation mode provided in the above embodiment, the intelligent device can calculate each text segment in the text data to obtain a comprehensive score. The composite score may be used to characterize how important each text segment is to the downstream task, i.e., text segments with higher composite scores are more important to the downstream task. Based on the above, after the comprehensive score is obtained through calculation, the intelligent device can re-segment the text to be classified according to the comprehensive score so as to input the text segment with high importance degree into the training model for text classification.

For example, for text data text to be classified, a text segment set (seg) can be obtained after preliminary segmentation₁，seg₂，…，seg_i) Corresponding to each can obtainComposite score (S) of text segment₁，S₂，…，S_i) The intelligent device can recombine the text segments according to the comprehensive score of each text segment, so that the combined comprehensive score is kept at a higher level on the premise that the integral length of the combined text segment is less than or equal to the maximum length of the input data of the training model, and the text data input into the training model is kept in better association with the downstream tasks.

To enable text data to be input into the training model, the smart device may apply the composite score. In some embodiments, the intelligent device may sort the plurality of text segments according to the calculated composite score of each text segment, and truncate the text data sequentially from front to back as required by the downstream task.

Since the purpose of segmenting the text to be classified is to segment the text data, and in order to obtain a segmentation result with higher importance for the downstream task, it is desirable that the score of each segmented sentence in unit length is maximized, and at this time, how to segment the text data repeatedly is a typical "01 knapsack problem". To solve such a 01-knapsack problem to re-segment the textual data, in some embodiments, the smart device may first define a length matrix, a score matrix, a maximum length, and a number of candidate segments. Then define dp matrix with size (num +1, weight _ most +1), and record list. And then, a traversal algorithm is used, so that the candidate segment (num +1) is defined as x, and the maximum length (weight _ most +1) is defined as y. If the score of the current segment is less than or equal to y, enabling dp [ x ] [ y ] ═ max (dp [ x-1] [ y-length matrix [ x ] ] + scoring matrix [ x ], dp [ x-1] [ y ]), and adding x from the recording list [ x ] [ y ]; otherwise: let dp [ x ] [ y ] ═ dp [ x-1] [ y ]. Finally, the record list [ -1] [ -1] is the final result.

Since the neural network model has a certain sensitivity to the relative order of sentences, i.e. in some scenarios (e.g. rigorous reasoning), the relative order of sentences has a significant semantic impact. Therefore, there is a need to maintain the order of the segments as much as possible in these scenarios. Accordingly, in some embodiments, the system may,in order to enable the text data to be correctly input into the training model, the intelligent device may further traverse the comprehensive score S of each text segment in the text to be classified first when the text to be classified is re-segmented according to the comprehensive score_iAnd the length ls of each text segment_i. And then sequencing the text segments according to the comprehensive scores and acquiring an input length extreme value of the training model, thereby extracting at least one target text segment from the sequencing result of the text segments according to the length and the length extreme value of the text segments, obviously, the total length of the extracted target text segments

Should be less than or equal to the length extremum. And finally, inputting the target text segment into the training model.

For example, after obtaining the composite score for each text segment, the smart device may sort the plurality of text segments according to the composite score to obtain a set of segments. And traversing the current segment set again, and setting a loop function, namely adding the current segment in the segment set if the current segment length sum is equal to the segment length sum and + the current segment length. And if the sum of the lengths of the current segments is larger than the maximum length of the input data of the training model, exiting the traversal. And sequencing according to the sequence in the original text, thereby splicing the target text segments.

Based on the text classification method, as shown in fig. 5, in some embodiments of the present application, an intelligent device is further provided, where the intelligent device includes: a storage module and a processing module, wherein the storage module is configured to store a natural language processing model and a zero-time learning model; as shown in fig. 8, the processing module is configured to perform the following program steps:

acquiring a text to be classified;

segmenting the text to be classified into a plurality of text segments;

According to the technical scheme, the intelligent device comprises a storage module and a processing module, wherein the processing module can calculate the score of the support word after the ultra-long text to be classified is obtained, and then segment the text to be classified to obtain a plurality of text segments. And respectively calculating the first score and the second score of each text segment to obtain the comprehensive score of each text segment, re-segmenting the text to be classified according to the comprehensive score to obtain short text data, and finally inputting the short text data into a natural language processing model for text classification. The intelligent device can calculate and obtain the comprehensive score of the text segment in two modes of zero-time learning and support word scoring, determine the importance degree of the text segment, keep the model effect as far as possible while ensuring the performance and reduce the semantic loss.

The embodiments provided in the present application are only a few examples of the general concept of the present application, and do not limit the scope of the present application. Any other embodiments extended according to the scheme of the present application without inventive efforts will be within the scope of protection of the present application for a person skilled in the art.

Claims

1. A method of text classification, comprising:

acquiring a text to be classified;

calculating the score of a support word of each classification label corresponding to the category, wherein the score of the support word is the inverse text frequency IDF numerical value of the key words in the text to be classified; the support words are keywords of which the IDF numerical values are larger than preset IDF judgment values;

segmenting the text to be classified into a plurality of text segments;

calculating a second score of each text segment, wherein the second score is obtained according to the calculation of the scores of the supporting words in the text segments;

2. The method of claim 1, wherein the step of calculating the score of the support word for each category corresponding to the classification label comprises:

based on a preset word bank, eliminating noise words in the text to be classified to obtain a keyword set;

traversing the total occurrence times of each keyword in the keyword set in the text to be classified;

traversing the occurrence times of each keyword in each category in the keyword set;

and calculating the ratio of the occurrence times to the total occurrence times to obtain the IDF value.

3. The method of claim 2, wherein after the step of traversing the total number of occurrences of each keyword in the set of keywords in the text to be classified, the method further comprises:

acquiring the probability of a preset low-frequency word;

calculating a first hyper-parameter, wherein the first hyper-parameter is used for judging a low-frequency word in a keyword;

and eliminating low-frequency words in the keyword set, wherein the low-frequency words are the keywords of which the total occurrence times are less than a first super-parameter.

4. The text classification method according to claim 2, wherein after the step of obtaining the IDF value, the method further comprises:

calculating a normalized component, wherein the normalized component is the reciprocal of the total number of categories;

setting a second hyperparameter, wherein the second hyperparameter is a constant which is larger than 0 and smaller than or equal to the total number of the categories;

calculating the product of the second hyperparameter and the normalized component to obtain an IDF judgment value;

if the IDF numerical value is larger than the IDF judgment value, marking the keyword corresponding to the IDF numerical value as a support word of the current category;

and if the IDF numerical value is smaller than or equal to the IDF judgment value, marking that the keyword corresponding to the IDF numerical value is not the support word of the current category.

5. The method for classifying texts according to claim 1, wherein the step of segmenting the text to be classified into a plurality of text segments comprises:

traversing sentence marks in the text to be classified, wherein the sentence marks comprise punctuation marks, paragraph marks and space characters;

splitting the text to be classified one by one according to the sentence marks to obtain a sentence set;

setting a third hyper-parameter, wherein the third hyper-parameter is used for representing the number of sentences contained in each text segment;

and extracting text segments from the sentence set according to the third hyperparameter.

6. The text classification method of claim 1, wherein the step of calculating a first score for each of the text segments comprises:

acquiring a zero-order learning model, wherein the zero-order learning model takes the text segment as text input, the classification label as category description input and the classification result score of each category as output;

inputting the text segments into the zero-time learning model to obtain a classification result score of each text segment output by the zero-time learning model for each category;

combining the classification result scores for each category to form the category score vector;

calculating the information entropy of the category score vector to obtain the first score according to the following formula:

in the formula (I), the compound is shown in the specification,

7. The text classification method of claim 1, wherein the step of calculating a second score for each of the text segments comprises:

calculating a final score of the keyword according to the score of the supporting word, wherein the final score of the keyword is the maximum value or the average value of the scores of the keyword for the supporting words of all categories;

traversing the times of the keywords corresponding to each category in the text segment;

calculating the second score according to the following formula:

in the formula (I), the compound is shown in the specification,

is the second score;

is the final score of keyword k; the count (i, k) is the number of times of occurrence of the keyword corresponding to the category in the text fragment i.

8. The text classification method of claim 1, wherein the step of calculating a composite score comprises:

setting a fourth hyperparameter, wherein the fourth hyperparameter is used for representing the weight of the IDF numerical value in the comprehensive score;

calculating the mean value of the first scores and the mean value of the second scores of all text segments in the text to be classified;

calculating the standard deviation of the first scores and the standard deviation of the second scores of all text segments in the text to be classified;

calculating the composite score according to the fourth hyperparameter, the mean and the standard deviation according to the following formula:

is a first score;

is the first score mean value;

is the first scored standard deviation;

is the second score;

is the second score mean;

the second score standard deviation.

9. The method of claim 1, wherein the steps of re-segmenting the text to be classified according to the composite score and inputting the re-segmented result into a natural language processing model comprise:

traversing the comprehensive score of each text segment in the text to be classified and the length of each text segment;

sorting the text segments according to the composite score;

acquiring an input length extreme value of the natural language processing model;

extracting at least one target text segment from the sequencing result of the text segments according to the length of the text segment and the length extreme value, wherein the total length of the extracted target text segment is less than or equal to the length extreme value;

inputting the target text segment into the natural language processing model.

10. A smart device, comprising:

a storage module configured to store a natural language processing model and a zero-learning model;

a processing module configured to:

acquiring a text to be classified;

segmenting the text to be classified into a plurality of text segments;