CN117473088A

CN117473088A - Text classification method, text classification model training method, device and equipment

Info

Publication number: CN117473088A
Application number: CN202311490092.1A
Authority: CN
Inventors: 薛振宇; 聂文俊; 韩瑞; 李文深
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-11-09
Filing date: 2023-11-09
Publication date: 2024-01-30

Abstract

The disclosure provides a text classification method, a text classification device, electronic equipment and a storage medium, which can be applied to the technical field of artificial intelligence and the technical field of finance. The method comprises the following steps: extracting the characteristics of the text fragments in the text to be classified to obtain text word characteristic vectors and text digital characteristic vectors; according to the text word feature vector and the text number feature vector, obtaining a text segment feature vector of the text segment; extracting the characteristics of the audio fragments in the audio corresponding to the text to be classified to obtain the characteristic vectors of the audio fragments, wherein the audio fragments correspond to the text fragments; and obtaining the category information of the text to be classified according to the text segment feature vector and the audio segment feature vector.

Description

Text classification method, text classification model training method, device and equipment

Technical Field

The present disclosure relates to the field of artificial intelligence and financial technology, and in particular, to a text classification method, a text classification model training method, a device, equipment, a medium, and a program product.

Background

Text classification refers to the process of automatically assigning text to predefined categories based on text content, which is a natural language processing basic task, and can also provide a basis for more complex language understanding tasks.

In the process of implementing the inventive concept of the present disclosure, the inventor has found that in the related art, there is a problem in that the accuracy of classifying text is low.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a text classification method, a text classification model training method, apparatus, device, medium, and program product.

According to a first aspect of the present disclosure, there is provided a text classification method, comprising: extracting the characteristics of the text fragments in the text to be classified to obtain text word characteristic vectors and text digital characteristic vectors; according to the text word feature vector and the text number feature vector, obtaining a text segment feature vector of the text segment; extracting the characteristics of the audio fragments in the audio corresponding to the text to be classified to obtain the characteristic vectors of the audio fragments, wherein the audio fragments correspond to the text fragments; and obtaining the category information of the text to be classified according to the text segment feature vector and the audio segment feature vector.

According to an embodiment of the present disclosure, feature extraction is performed on an audio clip in audio corresponding to a text to be classified, to obtain an audio clip feature vector, including: according to different feature types, extracting the features of the audio fragments to obtain a plurality of types of audio fragment feature vectors; initializing the audio segment feature vectors of the multiple types to obtain the audio segment feature vectors with multiple dimensions, wherein the multiple dimensions and the multiple types are in one-to-one correspondence.

According to an embodiment of the present disclosure, the above text classification method further includes: and processing the text to be classified and the audio corresponding to the text to be classified by using a text processing tool to obtain a text fragment and an audio fragment corresponding to the text fragment.

According to a second aspect of the present disclosure, there is provided a text classification model training method, including: extracting features of training text fragments in the training text to obtain training text word feature vectors and training text digital feature vectors; according to the training text word feature vector and the training text digital feature vector, training text segment feature vectors of training text segments are obtained; extracting features of training audio fragments in training audio corresponding to the training text to obtain training audio fragment feature vectors, wherein the training audio fragments correspond to the training text fragments; and training a first model by using the training text segment feature vector and the training audio segment feature vector to obtain a first target model, wherein the first target model is used for determining the category information of the text to be classified.

According to an embodiment of the present disclosure, training a first model using training text segment feature vectors and training audio segment feature vectors to obtain a first target model includes: obtaining a target training feature vector according to the training text segment feature vector, the training audio segment feature vector and the noise vector; inputting the target training feature vector into a first model, and training the first model by using a target gradient descent algorithm to obtain a first target model, wherein the target gradient descent algorithm is obtained by processing an original gradient descent algorithm by using a gradient clipping method.

According to an embodiment of the present disclosure, a target training feature vector is obtained from a training text segment feature vector, a training audio segment feature vector, and a noise vector, including: performing splicing processing on the training text segment feature vectors and the training audio segment feature vectors to obtain training segment spliced vectors; obtaining a segment position vector corresponding to the training text segment according to the position information of the training text segment in the training text; obtaining a target training segment splicing vector according to the segment position vector and the training segment splicing vector; and adding the noise vector into the target training segment spliced vector to obtain a target training feature vector.

According to an embodiment of the present disclosure, adding a noise vector to a target training segment stitching vector, to obtain a target training feature vector, includes: adding the noise vector to the target training segment splicing vector to obtain a target segment noise vector; and according to the segment structure of the target segment noise vector, carrying out coding processing on the target segment noise vector to obtain a target training feature vector.

According to the embodiment of the disclosure, the number of the target training feature vectors is multiple, the number of the first models is multiple, the multiple first models are in one-to-one correspondence with the multiple target training feature vectors, and the multiple target training feature vectors are respectively from different training servers; inputting the target training feature vector into a first model, training the first model by using a target gradient descent algorithm to obtain a first target model, and repeatedly executing the following operations under the condition that model parameters of the first model do not meet preset conditions: determining a plurality of target training servers from a plurality of training servers, calling the plurality of target training servers, training a plurality of first models corresponding to the plurality of target training servers by using a plurality of target training feature vectors corresponding to the plurality of target training servers respectively, obtaining a plurality of model parameters corresponding to the plurality of first models respectively, and determining new model parameters according to the plurality of model parameters; and under the condition that the model parameters meet the preset conditions, obtaining a first target model according to the model parameters meeting the preset conditions and the first model.

According to the embodiment of the disclosure, the first model comprises a target word level coding layer, the target word level coding layer is obtained by training an intermediate word level coding layer by using a text numerical value comparison task, the intermediate word level coding layer is obtained by training an initial word level coding layer by using a text numerical classification task, the text numerical classification task characterizes a task of classifying numbers in a text, and the text numerical value comparison task characterizes a task of numerical value comparison of numbers belonging to the same class in the text; extracting features of training text fragments in training texts to obtain training text word feature vectors and training text digital feature vectors, wherein the feature extraction comprises the following steps: and extracting features of the training text fragments by using the target word level coding layer to obtain training text word feature vectors and training text digital feature vectors.

A third aspect of the present disclosure provides a text classification apparatus, comprising: the first extraction module is used for extracting the characteristics of the text fragments in the text to be classified to obtain text word characteristic vectors and text digital characteristic vectors; the first acquisition module is used for acquiring text segment feature vectors of the text segments according to the text word feature vectors and the text digital feature vectors; the second extraction module is used for extracting the characteristics of the audio fragments in the audio corresponding to the text to be classified to obtain the characteristic vectors of the audio fragments, wherein the audio fragments correspond to the text fragments; and the second acquisition module is used for obtaining the category information of the text to be classified according to the text segment feature vector and the audio segment feature vector.

A fourth aspect of the present disclosure provides a text classification model training apparatus, comprising: the third extraction module is used for extracting the characteristics of the training text fragments in the training text to obtain training text word characteristic vectors and training text digital characteristic vectors; the third acquisition module is used for acquiring training text segment feature vectors of the training text segments according to the training text word feature vectors and the training text digital feature vectors; the fourth extraction module is used for extracting the characteristics of the training audio fragments in the training audio corresponding to the training text to obtain the characteristic vectors of the training audio fragments, wherein the training audio fragments correspond to the training text fragments; the training module is used for training the first model by utilizing the training text segment feature vectors and the training audio segment feature vectors to obtain a first target model, wherein the first target model is used for determining the category information of the text to be classified.

A fifth aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method described above.

A sixth aspect of the present disclosure also provides a computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the above-described method.

A seventh aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above method.

According to the text classification method, the text classification model training method, the device, the equipment, the medium and the program product, the text word feature vector of the text word in the text and the text digit feature vector of the text digit in the text are obtained through feature extraction of the text, so that not only the feature vector of the text word in the text but also the feature vector of the text digit is fully utilized, and further the text segment feature vector can be obtained. Based on this, in the case where more numbers are included in the text, accuracy of classifying the text can be improved. In addition, the audio frequency corresponding to the text is used for extracting the audio frequency segment feature vector, so that the text segment feature vector and the audio frequency segment feature vector can be used for text classification together, and the text classification is not performed by using the text segment feature vector only, so that the accuracy of text classification is further improved.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a text classification method or a text classification model training method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a text classification method according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a flow chart of a text classification model training method according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a schematic diagram of a training word level encoding layer according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a structural schematic of a first object model according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a block diagram of a text classification device according to an embodiment of the disclosure;

FIG. 7 schematically illustrates a block diagram of a text classification model training apparatus according to an embodiment of the disclosure; and

fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement a text classification method or a text classification model training method in accordance with an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical scheme of the disclosure, the related data (such as including but not limited to personal information of a user) are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated.

According to an embodiment of the present disclosure, in a text classification method based on traditional machine learning, a naive bayes classifier may be used to classify financial text; the Bayes classifier can also be combined with decision trees to realize a multi-model mixed text classification method applicable to different text structures.

According to an embodiment of the present disclosure, in a text classification method based on deep learning, text may be classified based on a TextCNN algorithm of CNN (convolutional neural network ), which is a text classification algorithm, which may obtain local features of sentences in a text using convolutional calculation and extract key information from the sentences. The BiLSTM, the attention mechanism and the convolution layer can be fused into a neural network, and bidirectional long-term storage is formed through the neural network, so that local features of phrases in the fragment can be acquired, and context semantic information of the fragment can be acquired.

Based on the above, text classification methods mostly rely on word features extracted from text data alone to classify text, and rarely use audio data such as telephone conferences and audio recordings, etc. to classify text. In addition, when classifying texts, only the importance of text words in the texts is usually focused, but since texts in some fields include a large number, such as financial fields and chemical fields, the number is also important in the above fields, and thus, text classification can be performed by fully utilizing the meaning of the number and the structure of the number with respect to texts in the above fields, thereby achieving the effect of improving the accuracy of text classification.

In view of this, an embodiment of the present disclosure provides a text classification method, including: and extracting the characteristics of the text fragments in the text to be classified to obtain text word characteristic vectors and text digital characteristic vectors. And obtaining the text segment feature vector of the text segment according to the text word feature vector and the text number feature vector. And extracting the characteristics of the audio fragments in the audio corresponding to the text to be classified to obtain the characteristic vectors of the audio fragments, wherein the audio fragments correspond to the text fragments. And obtaining the category information of the text to be classified according to the text segment feature vector and the audio segment feature vector.

Fig. 1 schematically illustrates an application scenario diagram of a text classification method or a text classification model training method according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the text classification method or the text classification model training method provided in the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the text classification device or the text classification model training device provided by the embodiments of the present disclosure may be generally provided in the server 105. The text classification method or the text classification model training method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105. Accordingly, the text classification device or the text classification model training device provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The text classification method or the text classification model training method of the disclosed embodiment will be described in detail below with reference to the scenario described in fig. 1 through fig. 2 to 5.

Fig. 2 schematically illustrates a flow chart of a text classification method according to an embodiment of the disclosure.

As shown in fig. 2, the text classification method of this embodiment includes operations S210 to S240.

In operation S210, feature extraction is performed on the text segments in the text to be classified, so as to obtain text word feature vectors and text digit feature vectors.

In operation S220, a text segment feature vector of the text segment is obtained according to the text word feature vector and the text digit feature vector.

In operation S230, feature extraction is performed on the audio segments in the audio corresponding to the text to be classified, so as to obtain audio segment feature vectors, where the audio segments and the text segments correspond to each other.

In operation S240, category information of the text to be classified is obtained according to the text segment feature vector and the audio segment feature vector.

According to embodiments of the present disclosure, category information of text to be classified may include transaction categories, market categories, institution categories, and the like. The text to be classified may include: industry dynamics, industry related reviews, expert interpretation of quotations, industry news, institution conditions, transaction information, transaction assessment, transaction analysis, transaction knowledge, deposit rates, and the like. Thus, more text numbers are included in the text to be classified.

According to the embodiment of the disclosure, the text segment of the text to be classified may be obtained by dividing the text to be classified, or may be determined from the text to be classified, but is not limited thereto. The text segment can be a sentence in the text to be classified, or can be a paragraph in the text to be classified, but is not limited to the sentence, the text segment can be determined from the text to be classified according to the requirement, or the text to be classified can be segmented according to the requirement, so that the text segment can be obtained.

According to embodiments of the present disclosure, the text numbers may include time, currency, percentages, and the like. The text words may include words other than text digits in the text passage, but are not limited thereto.

According to an embodiment of the present disclosure, the text word feature vector may be a feature vector obtained by extracting features of text words in text to be classified.

According to an embodiment of the present disclosure, the text number feature vector may be a feature vector obtained by feature extraction of text numbers in the text to be classified.

According to an embodiment of the present disclosure, the text segment feature vector may be a vector for characterizing features of the above-described text segment to be classified.

According to an embodiment of the present disclosure, the audio corresponding to the text to be classified may be audio having the same meaning as the text to be classified. For example: the text content of the text to be classified may include: "by the end of 10 months in 2022, the balance of the institution in the market is 3.5 trillion yuan, the specific gravity of the balance in the market is 2.4%", but not limited thereto. The audio content of the audio corresponding to the text to be classified may include: "the balance of the institution accounts for 2.4% of the specific gravity of the market balance, the balance of the institution is 3.5 trillion yuan, and the data is 10 months end of 2022", but is not limited thereto. The audio content of the audio corresponding to the text to be classified may also include: "by the end of 10 months in 2022, the balance of the institution in the market is 3.5 trillion yuan, the specific gravity of the balance in the market is 2.4%", but not limited thereto.

According to the embodiment of the present disclosure, the audio clip may be divided from audio or may be determined from audio, but is not limited thereto. The audio clip may be a sentence in audio or a paragraph in audio, but is not limited thereto, and the audio clip may be determined from audio according to requirements, or the audio may be divided according to requirements to obtain the audio clip.

According to an embodiment of the present disclosure, the audio segment feature vector may be a feature vector obtained by performing feature extraction on the audio segment.

According to an embodiment of the present disclosure, the audio clip corresponds to the text clip, for example: in the case of text clips of "3.5 trillion points in balance of an organization" the audio clip may correspond to "3.5 trillion points in balance of the organization".

According to embodiments of the present disclosure, a text segment including text words and text numbers may be feature extracted using a trained BiLSTM (Bi-directional Long Short-Term Memory) neural network to obtain text word feature vectors and text number feature vectors. By using the trained BiLSTM, the text segment can be subjected to feature extraction according to semantics, so that the accuracy of the extracted feature vector is improved.

According to embodiments of the present disclosure, a text segment feature vector may be determined from an average of text word feature vectors and text digit feature vectors. The text segment can comprise a plurality of text words and a plurality of text numbers, so that the text segment is subjected to feature extraction, and a plurality of text word feature vectors and a plurality of text number feature vectors can be obtained. Further, an average value of the plurality of text word feature vectors and the plurality of text digit feature vectors may be used as the text segment feature vector. For example: the sum of 7 text word feature vectors and 3 text digit feature vectors is 2222220 and the text segment feature vector may be 222222.

According to the embodiment of the disclosure, the text segment may be further subjected to feature extraction by using a trained BERT (Bidirectional Encoder Representations from Transformers) model to obtain a text word feature vector and a text digital feature vector, wherein the trained BERT model may be a language representation model, the trained BERT model may be obtained by training by using a text digital processing task, and the effect of the trained BERT model on digital processing may be improved by training the BERT model by using the text digital processing task, thereby, the accuracy of the text digital feature vector may be improved relative to other models; determining an original text segment feature vector according to the average value of the text word feature vector and the text digital feature vector; and then, carrying out feature extraction on the original text segment feature vector by utilizing the trained BiLSTM neural network to obtain the text segment feature vector. The text numerical processing task may include a text numerical comparison task and a text numerical classification task, wherein the text numerical classification task characterizes a task of classifying numbers in a text, and the text numerical comparison task characterizes a task of numerical comparison of numbers belonging to the same class in the text.

According to the embodiment of the disclosure, the trained BERT model can be used for extracting the characteristics of the text segment to obtain text word characteristic vectors and text digital characteristic vectors; then, feature extraction is carried out on the text word feature vector and the text digital feature vector by utilizing the trained BiLSTM neural network, and a target text word feature vector and a target text digital feature vector are obtained; and determining the text segment feature vector according to the average value of the target text word feature vector and the target text digital feature vector.

According to embodiments of the present disclosure, feature extraction may be performed on audio segments using a trained BiLSTM neural network to obtain audio segment feature vectors. By using trained BiLSTM, feature extraction can be performed on audio segments according to semantics, thereby improving the accuracy of the extracted feature vectors.

According to an embodiment of the present disclosure, the audio segment feature vector and the text segment feature vector may be processed by using the trained first object model to classify the text to be classified according to the audio segment feature vector and the text segment feature vector, so as to obtain class information of the text to be classified, which may include class probability, but is not limited thereto. The first target model may be obtained by training the first model using a training text segment feature vector, which may be obtained based on a training text segment of the training text, and a training audio segment feature vector, which may be obtained based on a training audio segment of training audio corresponding to the training text.

According to the embodiment of the disclosure, the text classification is assisted by utilizing the audio feature vector extracted from the audio corresponding to the text to be classified, so that the problem of overfitting of the model can be relieved by classifying the text through multi-mode information.

According to the embodiment of the disclosure, the text word feature vector of the text word in the text and the text number feature vector of the text number in the text are obtained by extracting the features of the text, so that not only the feature vector of the text word in the text but also the feature vector of the text number are fully utilized, and further the text segment feature vector can be obtained. Based on this, in the case where more numbers are included in the text, accuracy of classifying the text can be improved. In addition, the audio frequency corresponding to the text is used for extracting the audio frequency segment feature vector, so that the text segment feature vector and the audio frequency segment feature vector can be used for text classification together, and the text classification is not performed by using the text segment feature vector only, so that the accuracy of text classification is further improved.

According to an embodiment of the present disclosure, feature extraction is performed on an audio clip in audio corresponding to a text to be classified, to obtain an audio clip feature vector, including: and extracting the characteristics of the audio fragments according to different characteristic types to obtain a plurality of types of audio fragment characteristic vectors. Initializing the audio segment feature vectors of the multiple types to obtain the audio segment feature vectors with multiple dimensions, wherein the multiple dimensions and the multiple types are in one-to-one correspondence.

According to embodiments of the present disclosure, feature types may include audio types such as pitch, intensity, waveform, formants, pitch contour, spectrogram, and the like.

According to an embodiment of the present disclosure, the plurality of dimensions and the plurality of types may be one-to-one correspondence, where the plurality of types includes a pitch type, a scale type, and a waveform type, and the plurality of dimensions also includes a pitch dimension, a scale dimension, and a waveform dimension.

According to the embodiment of the disclosure, the Praat script program can be used for extracting the audio feature vectors with different feature types such as pitch, intensity, waveform, formants, pitch contour, spectrogram and the like in the audio fragment, and randomly initializing the audio feature vectors with different feature types into the audio fragment feature vectors with multiple dimensions, wherein the Praat script program can be a voice learning program and can be used for analyzing, labeling, processing, synthesizing and the like of the digitized voice signals. Therefore, the category information of the text to be classified can be obtained according to the audio segment feature vector and the text segment feature vector.

According to the embodiment of the disclosure, the trained BiLSTM neural network can be used for extracting the characteristics of the audio segment characteristic vectors with multiple dimensions to obtain the target audio segment characteristic vectors, so that the category information of the text to be classified can be obtained according to the target audio segment characteristic vectors and the text segment characteristic vectors. By using trained BiLSTM, feature extraction can be performed on audio segments according to semantics, thereby improving the accuracy of the extracted feature vectors.

According to the embodiment of the disclosure, the audio fragment feature vectors of a plurality of types are obtained by carrying out feature extraction on the audio fragment according to different feature types, and the audio fragment feature vectors with a plurality of dimensions are obtained according to the audio fragment feature vectors of the plurality of types, so that the determination of the category information of the text to be classified by using the audio fragment feature vectors with the plurality of dimensions can be realized, and the accuracy of determining the category information is improved.

According to embodiments of the present disclosure, text to be classified and audio corresponding to the text to be classified may be processed using a text processing tool. The text processing tool may include an Aeneas alignment tool, which may be a tool for text and voice alignment, by which a text segment in a text to be classified and an audio segment of audio corresponding to the text to be classified may be aligned, whereby it may be achieved that a text segment is derived from the text to be classified and an audio segment corresponding to the text segment is derived from the audio corresponding to the text to be classified. Also, the alignment tool may be used to align text words in a text segment with audio words in an audio segment.

For example: the text content of the text to be classified may include "by the end of 10 months of 2022, the balance of the institution in the market is 3.5 trillion yuan, and the proportion of the balance in the market is 2.4%". The audio content of the audio corresponding to the text to be classified may include: "the balance of the institution was 2.4% of the market balance, the balance of the institution was 3.5 trillion yuan, and the data was 10 months end of 2022".

Thus, the text snippet processed with the alignment tool may include "the institution balance is 2.4% of the market balance, 3.5 trillion yuan, by the end of 10 months 2022".

According to the embodiment of the disclosure, the text to be classified and the audio corresponding to the text to be classified are processed by using the text processing tool to obtain the text segment and the audio segment corresponding to the text segment, so that the determination of the category information of the text to be classified according to the text segment feature vector obtained based on the text segment and the audio segment feature vector obtained based on the audio segment can be realized, and the accuracy of classifying the text to be classified can be improved.

Fig. 3 schematically illustrates a flow chart of a text classification model training method according to an embodiment of the disclosure.

As shown in fig. 3, the text classification model training method of this embodiment includes operations S310 to S340.

In operation S310, feature extraction is performed on the training text segments in the training text, so as to obtain training text word feature vectors and training text digital feature vectors.

In operation S320, training text segment feature vectors of the training text segments are obtained according to the training text word feature vectors and the training text digital feature vectors.

In operation S330, feature extraction is performed on the training audio segments in the training audio corresponding to the training text, so as to obtain feature vectors of the training audio segments, where the training audio segments and the training text segments correspond to each other.

In operation S340, training a first model using the training text segment feature vector and the training audio segment feature vector to obtain a first target model, wherein the first target model is used for determining category information of the text to be classified.

According to embodiments of the present disclosure, the category information of the training text may include transaction categories, market categories, institution categories, and the like. Training text may include: industry dynamics, industry reviews, expert interpretation of quotations, industry news, institution conditions, transaction information, transaction assessment, transaction analysis, transaction knowledge, deposit rates, and the like. Thus, more training text numbers will be included in the training text.

According to the embodiment of the disclosure, the training text segment of the training text may be obtained by dividing the training text or may be determined from the training text, but is not limited thereto. The training text segment can be a sentence in the training text or a paragraph in the training text, but is not limited to the sentence, the training text segment can be determined from the training text according to the requirement, and the training text can be segmented according to the requirement to obtain the training text segment.

According to embodiments of the present disclosure, training text numbers may include time, currency, and percentages, among others. The training text words may include words in the training text segment other than training text numbers, but are not limited thereto.

According to an embodiment of the present disclosure, the training text word feature vector may be a feature vector obtained by extracting features of training text words in training text.

According to an embodiment of the present disclosure, the training text number feature vector may be a feature vector obtained by feature extraction of training text numbers in training text.

According to embodiments of the present disclosure, the training text segment feature vector may be a vector for characterizing features of the training text segment described above.

According to embodiments of the present disclosure, the training audio corresponding to the training text may be audio having the same meaning as the training text. For example: the text content of the training text may include: "by the end of 10 months in 2022, the balance of the institution in the market is 3.5 trillion yuan, the specific gravity of the balance in the market is 2.4%", but not limited thereto. The audio content of the training audio corresponding to the training text may include: "the balance of the institution accounts for 2.4% of the specific gravity of the market balance, the balance of the institution is 3.5 trillion yuan, and the data is 10 months end of 2022", but is not limited thereto. The audio content of the training audio corresponding to the training text may also include: "by the end of 10 months in 2022, the balance of the institution in the market is 3.5 trillion yuan, the specific gravity of the balance in the market is 2.4%", but not limited thereto.

According to the embodiment of the disclosure, the training audio segment may be obtained by dividing the training audio or may be determined from the training audio, but is not limited thereto. The training audio segment can be a sentence in the training audio or a paragraph in the training audio, but is not limited to this, and the training audio segment can be determined from the training audio according to the requirement, or the training audio can be segmented according to the requirement, so as to obtain the training audio segment.

According to an embodiment of the present disclosure, the training audio segment feature vector may be a feature vector obtained by extracting features of the training audio segment.

According to an embodiment of the present disclosure, the training audio clip corresponds to the training text clip, for example: in the case where the training text segment is "the balance of the organization in the market is 3.5 trillion yuan", the training audio segment may correspond to "the balance of the organization is 3.5 trillion yuan".

According to the embodiment of the disclosure, the trained BiLSTM neural network can be utilized to extract the characteristics of the training text fragments comprising the training text words and the training text numbers, so as to obtain the training text word characteristic vectors and the training text number characteristic vectors. By using the trained BiLSTM, feature extraction can be performed on the training text segment according to semantics, thereby improving the accuracy of the extracted feature vector.

According to embodiments of the present disclosure, training text segment feature vectors may be determined from an average of training text word feature vectors and training text digit feature vectors. The training text segment can comprise a plurality of training text words and a plurality of training text numbers, so that feature extraction is carried out on the training text segment, and a plurality of training text word feature vectors and a plurality of training text number feature vectors can be obtained. Further, the average value of the plurality of training text word feature vectors and the plurality of training text digit feature vectors may be used as the training text segment feature vector. For example: the sum of 7 training text word feature vectors and 3 training text digit feature vectors is 2222220, and the training text segment feature vector may be 222222.

According to the embodiment of the disclosure, the trained BERT model can be used for extracting the characteristics of the training text segment to obtain the training text word characteristic vector and the training text digital characteristic vector, wherein the trained BERT model can be a language representation model, the trained BERT model can be obtained through training by using a text digital processing task, the effect of the trained BERT model on digital processing can be improved, and therefore, the accuracy of the training text digital characteristic vector can be improved relative to other models. And determining the feature vector of the original training text segment according to the average value of the feature vector of the training text word and the digital feature vector of the training text.For example: each training text word may be initialized to a 300-dimensional training text word feature vector using the trained BERT model, and each training text digit may be initialized to a training text digit feature vector using the trained BERT model, and the arithmetic average of all training text word feature vectors and all training text digit feature vectors in a segment may be used as the 300-dimensional training segment feature vector for the segment. W (W) _i ＝(w _i ¹ ，w _i ¹ ，…，w _i ^|Wi| ) Represents a training segment, wherein |W _i The I represents the length of the training segment, w _i ^|Wi| Representation of<EOS>The special symbol at the end of the one training segment, the training segment end identifier. The BERT model can be utilized to train the segment W _i Initializing to obtain W _i Training segment vector T of (1) _i ，T _i ∈R ^dt Where dt=300 represents the dimension of each training text word feature vector in the training segment.

And extracting the characteristics of the original training text segment characteristic vector by utilizing the trained BiLSTM neural network to obtain the training text segment characteristic vector.

According to the embodiment of the disclosure, the trained BERT model can be used for extracting the characteristics of the training text segment to obtain training text word characteristic vectors and training text digital characteristic vectors; then, the trained BiLSTM neural network is utilized to conduct feature extraction on the training text word feature vector and the training text digital feature vector, and a target training text word feature vector and a target training text digital feature vector are obtained; and determining the feature vector of the training text segment according to the average value of the feature vector of the target training text word and the digital feature vector of the target training text.

According to the embodiment of the disclosure, the trained BiLSTM neural network can be utilized to perform feature extraction on the training audio segment, so as to obtain the training audio segment feature vector. By using the trained BiLSTM, feature extraction can be performed on the training audio segments according to semantics, thereby improving the accuracy of the extracted feature vectors.

According to embodiments of the present disclosure, the first model may be trained using training text segment feature vectors and training audio segment feature vectors to obtain a first target model. Thus, the first target model can be utilized to determine the category information of the text to be classified according to the text segment feature vector and the audio segment feature vector.

In some fields the number has a special meaning, for example, the field may include a financial field, but is not limited thereto, and may also include a biological field, a chemical field, and the like. Based on this, a large number of text numbers may be included in the text to be classified in the field, and the text numbers have relatively important special meanings in the text to be classified in the field. Therefore, richer text numerical feature vectors are extracted from texts to be classified in the field, and the first model is trained by using the text numerical feature vectors and the text word feature vectors, so that the performance of the first target model can be improved.

According to an embodiment of the present disclosure, the text classification model training method further includes: feature extraction is carried out on training audio fragments in training audio corresponding to the training text, and training audio fragment feature vectors are obtained, and the feature extraction method comprises the following steps: and extracting the characteristics of the training audio fragments according to different characteristic types to obtain a plurality of types of training audio fragment characteristic vectors. Initializing the training audio segment feature vectors of the multiple types to obtain the training audio segment feature vectors with multiple dimensions, wherein the multiple dimensions and the multiple types are in one-to-one correspondence.

According to the embodiment of the disclosure, different feature types of training audio feature vectors such as pitch, intensity, waveform, formants, pitch contour, spectrogram and the like in the training audio fragments can be extracted by using a Praat script program, and the different feature types of training audio feature vectors are randomly initialized to be training audio fragment feature vectors with multiple dimensions, wherein the Praat script program can be a voice learning program and can be used for analyzing, labeling, processing, synthesizing and the like of a digitized voice signal.

According to the embodiment of the disclosure, the trained BiLSTM neural network can be used for extracting the characteristics of the training audio segment characteristic vectors with multiple dimensions, so as to obtain the target training audio segment characteristic vectors. By using the trained BiLSTM, feature extraction can be performed on the training audio segments according to semantics, thereby improving the accuracy of the extracted feature vectors.

According to the embodiment of the disclosure, the training audio segment is subjected to feature extraction according to different feature types to obtain a plurality of types of training audio segment feature vectors, and the training audio segment feature vectors with a plurality of dimensions are obtained according to the plurality of types of training audio segment feature vectors, so that the first model can be trained by using the multi-dimensional training audio segment feature vectors, and the classification accuracy of the first model can be improved.

According to an embodiment of the present disclosure, the text classification model training method further includes: further comprises: and processing the training text and the training audio corresponding to the training text by using a text processing tool to obtain a training text fragment and a training audio fragment corresponding to the training text fragment.

According to embodiments of the present disclosure, training text and training audio corresponding to the training text may be processed using a text processing tool. The text processing tool may include an Aeneas alignment tool, which may be a tool for text and speech alignment, by which a training text segment in a training text and a training audio segment of training audio corresponding to the training text may be aligned, whereby obtaining a training text segment from the training text and a training audio segment corresponding to the training text segment from the training audio corresponding to the training text may be achieved. Also, the alignment tool may be used to align training text words in the training text segment with training audio words in the training audio segment.

For example: the text content of the training text may include "10 months since 2022, the balance of the institution in the market is 3.5 trillion yuan, and the proportion of the balance in the market is 2.4%". The training audio content of the training audio corresponding to the training text may include: "the balance of the institution was 2.4% of the market balance, the balance of the institution was 3.5 trillion yuan, and the data was 10 months end of 2022".

Thus, the training text snippet processed with the alignment tool may include "the institution balance is 2.4% of the market balance, 3.5 trillion yuan, by the end of 10 months 2022".

According to the embodiment of the disclosure, the training text segment and the training audio segment corresponding to the training text are obtained by processing the training text and the audio corresponding to the training text by using the text processing tool, so that the accuracy of classifying the text to be classified by the first target model can be improved by training the training text segment feature vector obtained based on the training text segment and the training audio segment feature vector obtained based on the training audio segment.

According to an embodiment of the disclosure, the first model includes a target word-level encoding layer, the target word-level encoding layer is obtained by training an intermediate word-level encoding layer by using a text numerical value comparison task, the intermediate word-level encoding layer is obtained by training an initial word-level encoding layer by using a text numerical classification task, the text numerical classification task characterizes a task of classifying numbers in a text, and the text numerical value comparison task characterizes a task of numerical value comparison of numbers belonging to the same class in the text.

According to an embodiment of the present disclosure, feature extraction is performed on a training text segment in a training text to obtain a training text word feature vector and a training text digital feature vector, including: and extracting features of the training text fragments by using the target word level coding layer to obtain training text word feature vectors and training text digital feature vectors.

According to an embodiment of the present disclosure, the initial word-level encoding layer may be constructed based on a BERT model. The BERT model may be pre-trained using an in-field adaptive pre-training strategy, which may be accomplished using the text numerical comparison task and the text numerical classification task described above. For example: the initial word-level coding layer can be trained by using a text-to-digital classification task to obtain an intermediate word-level coding layer, and the intermediate word-level coding layer can be trained by using a text numerical comparison task to obtain a target word-level coding layer.

For example: for a text-to-digital classification task, a text segment may be labeled with one or more of the currency, time, percentage, etc. of the digital type. I.e. text numbers are used as the basis for labeling fragments, for example, the balance of the institution in the market is 3.5 trillion yuan by the end of 10 months in 2022, and the balance of the institution accounts for 2.4 percent of the balance of the market. This sentence can be labeled with three labels, time (i.e., end of 10 months 2022), currency (i.e., 3.5 trillion yuan), and percentage (i.e., 2.4%), respectively. Therefore, the text segment marked with the text can be used for training the text-digital classification task of the initial word-level coding layer so as to obtain the intermediate word-level coding layer.

For example: for the text numerical value comparison task, the task may be to make the intermediate word level coding layer capable of predicting the numerical value with the largest numerical value in a plurality of text numerical feature vectors. Each list of textonyms may include digits of similar numerical value size within the same digit type, such as 102, 103, 101.2, 107.9, 107.3, and 100.4 digits, and the goal of pre-training may be to have the intermediate word-level encoding layer find the largest digit from among the 6 digits. Under the condition that the intermediate word level coding layer completes the text numerical value comparison task, the target word level coding layer can be obtained.

Based on this, the training process described above is completed in the self-supervision mode, and a target word-level encoding layer suitable for performing a text classification task for a text including a large number of text digits can be obtained.

According to the embodiment of the disclosure, the target word-level coding layer is obtained through training by using the text numerical comparison task and the text numerical classification task, so that the target word-level coding layer is more suitable for the classification task of texts containing text numerical values, and the classification accuracy is improved.

Fig. 4 schematically illustrates a schematic diagram of a training word level encoding layer according to an embodiment of the present disclosure.

As shown in fig. 4, the training word level encoding layer of this embodiment includes operations S410 to S420.

In operation S410, the initial word-level encoding layer is trained using the text-to-number classification task, resulting in an intermediate word-level encoding layer.

In operation S420, the intermediate word level coding layer is trained using the text numerical comparison task to obtain a target word level coding layer.

According to an embodiment of the present disclosure, training a first model using training text segment feature vectors and training audio segment feature vectors to obtain a first target model includes: and obtaining a target training feature vector according to the training text segment feature vector, the training audio segment feature vector and the noise vector. Inputting the target training feature vector into a first model, and training the first model by using a target gradient descent algorithm to obtain a first target model, wherein the target gradient descent algorithm is obtained by processing an original gradient descent algorithm by using a gradient clipping method.

According to the embodiment of the disclosure, the training text segment feature vector, the training audio segment feature vector and the noise vector can be spliced to obtain the target training feature vector. Wherein the noise vector is a randomly generated vector, but is not limited thereto. For example: the training text segment feature vector may be 3321, the training audio segment feature vector may be 3323, the noise vector may be 78, and thus the target training feature vector may be 3321332378. It should be noted that the noise vector may also be obtained through a preset, and the noise vector may also be spliced at a preset position.

According to the embodiment of the disclosure, the safety of information such as identification information and asset information can be ensured by using the target training feature vector added with the noise vector.

According to embodiments of the present disclosure, the gradient values for each inverse optimization may be random values.

According to an embodiment of the present disclosure, the gradient clipping method may be a maximum gradient threshold and a minimum gradient threshold of a preset gradient descent. In the case where the inverse optimized gradient value is greater than the maximum gradient threshold, the gradient may be clipped to the same value as the maximum gradient threshold; in the case where the gradient value is less than the maximum gradient threshold, the gradient may be clipped to the same value as the minimum gradient threshold. Therefore, by using the gradient clipping method to control the gradient value of the inverse gradient optimization, the range of the gradient value can be ensured, and the optimization effect of the model can be improved.

According to the embodiment of the disclosure, since information such as object identification information, account information, asset information and the like are involved in the transaction process and recorded in text or audio, protection information is important for financial data collection and storage in the process of training a first target model by using text and audio, so that gradient calculation is controlled in the training process by using a DP-LGD (i.e., differential privacy random gradient descent algorithm), differential privacy is realized by adding noise in the gradient descent algorithm, and information protection is realized by clipping gradient values and then adding noise in the gradient clipping process.

According to the embodiment of the disclosure, the target training feature vector comprising the noise vector is obtained according to the text segment feature vector, the audio segment feature vector and the noise vector, so that the training of the first model by using the target training feature vector comprising the noise vector is helpful to the first target model obtained by training to classify the text to be classified under the condition that the text to be classified comprises noise, the data safety of the text to be classified can be ensured, and the reduction of classification accuracy caused by the noise can be avoided. And the gradient descent algorithm is processed by using the gradient clipping method to obtain a target gradient vector algorithm, so that the training effect of the first target model is improved. Based on this, the noise immunity of the first target model is further improved, and even in the case that more noise is included in the text to be classified, the accuracy of classification can be ensured.

According to an embodiment of the present disclosure, a target training feature vector is obtained from a training text segment feature vector, a training audio segment feature vector, and a noise vector, including: and performing splicing processing on the training text segment feature vector and the training audio segment feature vector to obtain a training segment spliced vector. And obtaining the segment position vector corresponding to the training text segment according to the position information of the training text segment in the training text. And obtaining a target training segment splicing vector according to the segment position vector and the training segment splicing vector. And adding the noise vector into the target training segment spliced vector to obtain a target training feature vector.

According to an embodiment of the present disclosure, a training text segment feature vector and a training audio segment feature vector are spliced to obtain a training segment spliced vector, for example: the training text segment feature vector may be 12345, the training audio segment feature vector may be 67891, and the training segment splice vector may be 1234567891.

According to the embodiment of the disclosure, according to the position information of the training text segment in the training text, a segment position vector corresponding to the training text segment is obtained, for example: the training text segment may be a sentence, and the segment position vector may be 7 in the case where the position of the sentence in the training text is 7 th sentence.

According to the embodiment of the disclosure, a target training segment stitching vector is obtained according to the segment position vector and the training segment stitching vector, for example: the segment position vector may be 7 and the training segment splice vector may be 1234567891, whereby the target training segment splice vector 1234567898, but is not limited thereto.

According to the embodiment of the disclosure, the BiLSTM neural network can be used for respectively extracting the training text and the single-mode characteristics corresponding to the training audio to obtain the text segment characteristic vector and the audio segment characteristic vector, and then the training text segment characteristic vector and the training audio segment characteristic vector are spliced to obtain the training segment spliced vector. And adding the segment position vector and the training segment splicing vector to obtain a target training segment splicing vector. Based on this, according to training text and training audio, the obtained target training segment stitching vector set may be D ^(k) ＝(S ₁ ^(k) ，S ₁ ^(k) ，…，S _M ^(k) )。

Wherein k can represent training text, and the target training segment splicing vector can be S _i ^(k) ＝((T _i ^(k) ，A _i ^(k) )+P _i ^(k) )，T _i ^(k) Training text segment feature vector, A, which can represent the ith segment in training text _i ^(k) Representing training audio segment feature vectors corresponding to segment vectors of ith segment in training text, P _i ^(k) A position vector representing a trainable segment level, M representing the number of segments in the training text, ds may represent the dimension of the target training segment splice vector, where D ^(k) ∈R ^M×ds ，P _i ^(k) ∈R ^M×ds 。

According to the embodiment of the disclosure, the target training segment stitching vector can be input into a BiLSTM network, so that feature extraction is performed on the target training segment stitching vector according to the semantics of the target training segment stitching vector, a new target training segment stitching vector is obtained, and the accuracy of the target training segment stitching vector is improved.

According to the embodiment of the disclosure, the noise vector can be added to the target training segment stitching vector to obtain the target training feature vector. For example: the target training segment stitching vector may be 258369 and the noise vector may be 654, whereby the target training feature vector may be 258369654.

According to the embodiment of the disclosure, the training text segment feature vector and the training audio segment feature vector are spliced to obtain the training segment spliced vector, and then the segment position vector corresponding to the training text segment is obtained according to the position information of the training text segment in the training text, so that the target training segment spliced vector can be obtained according to the segment position vector and the training segment spliced vector, the segment position is fully utilized through the segment position vector, the representation of the target training segment spliced vector is enhanced, and the training effect of training by utilizing the target training segment spliced vector is improved. In addition, as the noise vector is added to the target training segment splicing vector to obtain the target training feature vector, the training is facilitated to enable the first target model to classify the text to be classified under the condition that the text to be classified comprises noise, so that the data safety of the text to be classified can be ensured, and the reduction of classification accuracy caused by noise can be avoided.

According to an embodiment of the present disclosure, adding a noise vector to a target training segment stitching vector, to obtain a target training feature vector, includes: and adding the noise vector to the target training segment splicing vector to obtain a target segment noise vector. And according to the segment structure of the target segment noise vector, carrying out coding processing on the target segment noise vector to obtain a target training feature vector.

According to an embodiment of the present disclosure, the first model further comprises a slice-level encoding layer. The slice-level coding layer may be built based on a self-attention mechanism. The segment-level coding layer can be utilized to code the target segment noise vector according to the segment structure of the target segment noise vector, so as to obtain the target training feature vector.

Based on the method, the segment-level coding layer is constructed through a self-attention mechanism, so that the segment-level coding layer can learn the semantic relation of the internal words of the noise vector of the target segment better, the internal structure is obtained, and the accuracy of the target training feature vector output by the segment-level coding layer can be improved based on the internal structure.

According to the embodiment of the disclosure, the target segment noise vector is obtained by adding the noise vector to the target training segment stitching vector, and then the target segment noise vector is encoded according to the segment structure of the target segment noise vector to obtain the target training feature vector, so that the segment structure of the target segment noise vector is fully utilized, the representation of the target training feature vector is enhanced, and the model effect of the first target model obtained through training is improved.

According to the embodiment of the disclosure, the target training feature vector output by the segment-level coding layer can pass through a full-connection layer, the output of the full-connection layer is converted into corresponding probability by using a softmax function, and the largest probability item is selected as a classification label of model prediction, so that the text is classified, wherein the softmax function is an activation function.

According to the embodiment of the disclosure, the number of the target training feature vectors is multiple, the number of the first models is multiple, the first models are in one-to-one correspondence with the target training feature vectors, and the target training feature vectors are respectively from different training servers.

According to an embodiment of the disclosure, inputting a target training feature vector into a first model, training the first model by using a target gradient descent algorithm to obtain a first target model, including repeatedly performing the following operations if model parameters of the first model do not satisfy a preset condition: determining a plurality of target training servers from the plurality of training servers, calling the plurality of target training servers, training a plurality of first models corresponding to the plurality of target training servers by using a plurality of target training feature vectors corresponding to the plurality of target training servers, obtaining a plurality of model parameters corresponding to the plurality of first models, and determining new model parameters according to the plurality of model parameters. And under the condition that the model parameters meet the preset conditions, obtaining a first target model according to the model parameters meeting the preset conditions and the first model.

According to an embodiment of the present disclosure, there may be K training servers, each having a first model and a target training feature vector corresponding to the training server.

K target training servers may be determined from the K training servers and the respective first models trained by the K training servers using the respective target training feature vectors, based on which each of the K training servers has a trained first model. Model parameters of the trained first model of each of the k training servers may be obtained, based on which k model parameters may be obtained. New model parameters may be determined based on the average of the k model parameters.

And under the condition that the new model parameters do not meet the preset conditions, determining the target training server from the K training servers again, and executing the training process until the new model parameters meet the preset conditions.

In case the new model parameter meets a preset condition, the new model parameter may be taken as a model parameter of the first model, thereby obtaining a first target model.

Under the condition that the new model parameters are model parameters obtained through the training of the preset training rounds, the new model parameters can be determined to meet the preset conditions. The preset training period is not particularly limited herein.

According to the embodiment of the disclosure, in practical production application, resources owned by a single training subject are limited, so that it is difficult to train out a classification model with higher performance and strong robustness, wherein the training subject may be a mechanism, but not limited thereto, the resources may include training data, and the training data may include training text, training audio, and target training feature vectors obtained based on the training text and the training audio. In this case, a plurality of training subjects are required to train a single model, but since each training subject protects the training data, it is difficult to share the training data among the training subjects.

Based on the method, the federal learning training classification model can be used, and the training subject can carry out encryption exchange of information and model parameters under the condition of keeping independence. The first model used FedAvg (i.e., federal averaging algorithm) in training with federal learning.

For federal learning scenarios that combine multiple training subjects, a client/server architecture may be employed. Assuming that there are K training subjects in total, the central server may initialize model parameters of the first model, and then perform a plurality of rounds of model training, where K is a positive integer and K is a positive integer less than K, and at least 1 to K training subjects are selected for each round to participate in training. The central server sends a calling instruction to the training server of each selected training subject, and the training server sends the model parameters W of the present round (namely t round) according to the central server _t Training respective model parameters W with respective target training feature vectors ^k _t+1 After training, the training server transmits the model parameters W ^k _t+1 Uploading back to the central server. The central server collects the model parameters W of each training subject ^k _t+1 According to the number of target training feature vectors of each training subject, aggregation is carried out in a weighted average mode to obtain the model parameters W of the next round _t+1 The polymerization process is shown in formula (1).

Wherein n is _k For the number of target training feature vectors owned by training subject k, n is the number of target training feature vectors of all the selected training subjects.

Based on the method, a classification model with higher performance and strong robustness can be trained by a plurality of training subjects in a cooperative mode.

According to the embodiment of the disclosure, the target server is determined from the plurality of training servers, the plurality of target servers are called, the plurality of first models corresponding to the plurality of target training servers are trained by utilizing the plurality of target training feature vectors corresponding to the plurality of target training servers, the plurality of model parameters corresponding to the plurality of first models are obtained, and the new model parameters are determined according to the plurality of model parameters, so that model training can be performed by fully utilizing the target training feature vectors of the training servers under the condition that the training servers perform data interaction to train the first models, the data safety of the training servers is guaranteed, and the model effect of the trained first target models is improved.

Fig. 5 schematically illustrates a structural schematic of a first object model according to an embodiment of the present disclosure.

As shown in fig. 5, the first object model includes an object word level encoding layer 510, a multi-modal information fusion layer 520, a segment encoding layer 530, and a full connection layer 540. The multimodal information fusion layer 520 may include a BiLSTM neural network, and may be used to extract a training text segment feature vector and a training audio segment feature vector, and perform a stitching process on the training text segment feature vector and the training audio segment feature vector to obtain a target training segment stitching vector, but is not limited thereto. .

According to the embodiment of the disclosure, 6 training subjects are set in the experiment, 63000 training texts and corresponding training audios are collected in total, partial data in which the training texts and the training audios cannot be aligned on a sentence level are discarded, and 42500 training texts and corresponding training audios are left to form a final data set.

The first object model of the present disclosure may be compared to a baseline model to verify the effectiveness of the proposed method of the present invention. The experimental results are shown in table 1:

table 1 accuracy of comparing first target model to baseline model classification

Wherein LSTM (Long Short Term Memory, long-short term memory) is a recurrent neural network, textCNN is an algorithm for classifying text by using convolutional neural network, and BiLSTM+CNN can be a two-way long-short term memory model.

From the comparison of table 1, it can be seen that the first target model is superior to the other baseline models. Compared with a text classification method based on traditional machine learning (i.e. naive Bayes, support vector machine and random forest), the text classification method based on deep learning (i.e. LSTM, textCNN, biLSTM +CNN and first target model) has significantly improved classification accuracy. This is because the text classification method based on deep learning can automatically extract features during training, and has stronger learning ability and efficient feature expression ability compared with the conventional machine learning. The accuracy of the first target model classification reaches 97.28%, not only is the training text classification realized by audio assistance introduced, but also the self-adaptive pre-training task in two fields is added, the long-distance dependency relationship between words in the segment is better obtained, the more important segment information is extracted, the key features are focused, and the better classification performance of the training text is realized.

According to an embodiment of the present disclosure, classification performance of the model proposed by the present invention on 9 types of text is evaluated using Precision (Precision), recall (Recall), and F1 value, where F1 value is an index used in statistics to measure the accuracy of two classification models. The experimental results are shown in table 2:

table 2 evaluation of classification of various training texts

From the results in Table 2, it can be seen that the first object model can better classify financial texts of traded and market classes. But the classification performance of the organization class is poor because the content expression of the organization class text may be less clear, for example, the organization class text often contains the status of the organization and credit evaluation of the organization, etc., so the organization class text may be erroneously marked as other class text. But in general, the first object model has better performance in training text classification including training text numbers, and accuracy, recall, and F1 values all reach 0.90.

According to embodiments of the present disclosure, in some embodiments, text classification methods ignore the importance of digits in text in some fields, and do not fully exploit the meaning and structure of the digits to classify the text to be classified. Based on this, through the above-described first object model, the digital meaning and the digital structure can be fully utilized to classify the text to be classified.

Based on the text classification method, the disclosure also provides a text classification device. The device will be described in detail below in connection with fig. 6.

Fig. 6 schematically shows a block diagram of a text classification apparatus according to an embodiment of the disclosure.

As shown in fig. 6, the text classification apparatus 600 of this embodiment includes a first extraction module 610, a first acquisition module 620, a second extraction module 630, and a second acquisition module 640.

The first extraction module 610 is configured to perform feature extraction on text segments in the text to be classified, so as to obtain a text word feature vector and a text digit feature vector. In an embodiment, the first extraction module 610 may be used to perform the operation S210 described above, which is not described herein.

The first obtaining module 620 is configured to obtain a text segment feature vector of the text segment according to the text word feature vector and the text digit feature vector. In an embodiment, the first obtaining module 620 may be configured to perform the operation S220 described above, which is not described herein.

The second extraction module 630 is configured to perform feature extraction on an audio segment in audio corresponding to the text to be classified, so as to obtain an audio segment feature vector, where the audio segment corresponds to the text segment. In an embodiment, the second extraction module 630 may be used to perform the operation S230 described above, which is not described herein.

The second obtaining module 640 is configured to obtain category information of the text to be classified according to the text segment feature vector and the audio segment feature vector. In an embodiment, the second obtaining module 640 may be configured to perform the operation S240 described above, which is not described herein.

According to an embodiment of the present disclosure, the second extraction module 630 includes a first extraction sub-module and an initialization module. The first extraction submodule is used for extracting the characteristics of the audio fragments according to different characteristic types to obtain a plurality of types of audio fragment characteristic vectors; the initialization module is used for initializing the audio segment feature vectors of a plurality of types to obtain the audio segment feature vectors with a plurality of dimensions, wherein the dimensions and the types are in one-to-one correspondence.

According to an embodiment of the present disclosure, the text classification apparatus further includes a processing sub-module. The processing sub-module is used for processing the text to be classified and the audio corresponding to the text to be classified by using a text processing tool to obtain a text fragment and an audio fragment corresponding to the text fragment.

According to embodiments of the present disclosure, any of the first extraction module 610, the first acquisition module 620, the second extraction module 630, and the second acquisition module 640 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the first extraction module 610, the first acquisition module 620, the second extraction module 630, the second acquisition module 640 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or as any one of or a suitable combination of any of the three. Alternatively, at least one of the first extraction module 610, the first acquisition module 620, the second extraction module 630, and the second acquisition module 640 may be at least partially implemented as computer program modules that, when executed, may perform the corresponding functions.

Based on the text classification model training method, the disclosure also provides a text classification model training device. The device will be described in detail below in connection with fig. 7.

Fig. 7 schematically illustrates a block diagram of a text classification model training apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the text classification model training apparatus 700 of this embodiment includes a third extraction module 710, a third acquisition module 720, a fourth extraction module 730, and a training module 740.

The third extraction module 710 is configured to perform feature extraction on the training text segment in the training text, so as to obtain a training text word feature vector and a training text digital feature vector. In an embodiment, the third extraction module 710 may be configured to perform the operation S310 described above, which is not described herein.

The third obtaining module 720 is configured to obtain a training text segment feature vector of the training text segment according to the training text word feature vector and the training text digital feature vector. In an embodiment, the third obtaining module 720 may be configured to perform the operation S320 described above, which is not described herein.

The fourth extraction module 730 is configured to perform feature extraction on a training audio segment in the training audio corresponding to the training text, so as to obtain a training audio segment feature vector, where the training audio segment corresponds to the training text segment. In an embodiment, the fourth extraction module 730 may be configured to perform the operation S330 described above, which is not described herein.

The training module 740 is configured to train a first model by using the training text segment feature vector and the training audio segment feature vector to obtain a first target model, where the first target model is used to determine category information of the text to be classified. In an embodiment, the training module 740 may be configured to perform the operation S340 described above, which is not described herein.

According to an embodiment of the present disclosure, training module 740 includes an acquisition sub-module and a training sub-module. The acquisition sub-module is used for acquiring a target training feature vector according to the training text segment feature vector, the training audio segment feature vector and the noise vector; the training submodule is used for inputting the target training feature vector into the first model, and training the first model by utilizing a target gradient descent algorithm to obtain the first target model, wherein the target gradient descent algorithm is obtained by processing an original gradient descent algorithm by utilizing a gradient clipping method.

According to an embodiment of the disclosure, the acquisition sub-module includes a splicing unit, a first acquisition unit, a second acquisition unit, and an adding unit. The splicing unit is used for carrying out splicing processing on the training text segment feature vector and the training audio segment feature vector to obtain a training segment splicing vector; the first acquisition unit is used for obtaining a segment position vector corresponding to the training text segment according to the position information of the training text segment in the training text; the second acquisition unit is used for obtaining a target training segment splicing vector according to the segment position vector and the training segment splicing vector; the adding unit is used for adding the noise vector to the target training segment splicing vector to obtain a target training feature vector.

According to an embodiment of the present disclosure, the adding unit includes an adding subunit and an encoding subunit. The adding subunit is used for adding the noise vector to the target training segment splicing vector to obtain a target segment noise vector; the coding subunit is used for coding the target segment noise vector according to the segment structure of the target segment noise vector to obtain a target training feature vector.

According to an embodiment of the present disclosure, the training subunit includes a first determining unit, a training unit, a second determining unit, and a third acquiring unit. The first determining unit is used for determining a plurality of target training servers from the plurality of training servers; the training unit is used for calling a plurality of target training servers, and training a plurality of first models corresponding to the plurality of target training servers by utilizing a plurality of target training feature vectors corresponding to the plurality of target training servers respectively to obtain a plurality of model parameters corresponding to the plurality of first models respectively; the second determining unit is used for determining new model parameters according to the plurality of model parameters; the third obtaining unit is used for obtaining a first target model according to the model parameters meeting the preset conditions and the first model under the condition that the model parameters meet the preset conditions.

According to an embodiment of the present disclosure, wherein the third extraction module 710 comprises a second extraction sub-module. The second extraction submodule is used for extracting features of the training text fragments by utilizing the target word level coding layer to obtain training text word feature vectors and training text digital feature vectors.

According to an embodiment of the present disclosure, any of the third extraction module 710, the third acquisition module 720, the fourth extraction module 730, and the training module 740 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the third extraction module 710, the third acquisition module 720, the fourth extraction module 730, and the training module 740 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or as any one of or a suitable combination of any of the three. Alternatively, at least one of the third extraction module 710, the third acquisition module 720, the fourth extraction module 730, and the training module 740 may be at least partially implemented as computer program modules, which when executed, may perform the respective functions.

As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 801 may also include on-board memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

In the RAM 803, various programs and data required for the operation of the electronic device 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 802 and/or the RAM 803. Note that the program may be stored in one or more memories other than the ROM 802 and the RAM 803. The processor 801 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the electronic device 800 may also include an input/output (I/O) interface 805, the input/output (I/O) interface 805 also being connected to the bus 804. The electronic device 800 may also include one or more of the following components connected to an input/output (I/O) interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to an input/output (I/O) interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 802 and/or RAM 803 and/or one or more memories other than ROM 802 and RAM 803 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to perform the methods provided by embodiments of the present disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, and/or from a removable medium 811 via a communication portion 809. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A text classification method, comprising:

extracting the characteristics of the text fragments in the text to be classified to obtain text word characteristic vectors and text digital characteristic vectors;

Obtaining a text segment feature vector of the text segment according to the text word feature vector and the text digital feature vector;

extracting the characteristics of the audio fragments in the audio corresponding to the text to be classified to obtain audio fragment characteristic vectors, wherein the audio fragments and the text fragments correspond to each other;

and obtaining the category information of the text to be classified according to the text segment feature vector and the audio segment feature vector.

2. The method of claim 1, wherein the feature extraction of the audio segment in the audio corresponding to the text to be classified to obtain an audio segment feature vector comprises:

extracting the characteristics of the audio fragments according to different characteristic types to obtain a plurality of types of audio fragment characteristic vectors;

initializing the audio segment feature vectors of the multiple types to obtain the audio segment feature vectors with multiple dimensions, wherein the multiple dimensions and the multiple types are in one-to-one correspondence.

3. The method of claim 1 or 2, further comprising:

and processing the text to be classified and the audio corresponding to the text to be classified by using a text processing tool to obtain the text fragment and the audio fragment corresponding to the text fragment.

4. A text classification model training method, comprising:

extracting features of training text fragments in the training text to obtain training text word feature vectors and training text digital feature vectors;

obtaining training text segment feature vectors of the training text segments according to the training text word feature vectors and the training text digital feature vectors;

extracting features of training audio fragments in training audio corresponding to the training text to obtain training audio fragment feature vectors, wherein the training audio fragments correspond to the training text fragments;

and training a first model by using the training text segment feature vector and the training audio segment feature vector to obtain a first target model, wherein the first target model is used for determining the category information of the text to be classified.

5. The method of claim 4, wherein the training the first model with the training text segment feature vector and the training audio segment feature vector to obtain a first target model comprises:

obtaining a target training feature vector according to the training text segment feature vector, the training audio segment feature vector and the noise vector;

Inputting the target training feature vector into the first model, and training the first model by using a target gradient descent algorithm to obtain the first target model, wherein the target gradient descent algorithm is obtained by processing an original gradient descent algorithm by using a gradient clipping method.

6. The method of claim 5, wherein the deriving a target training feature vector from the training text segment feature vector, the training audio segment feature vector, and a noise vector comprises:

performing splicing processing on the training text segment feature vector and the training audio segment feature vector to obtain a training segment spliced vector;

obtaining a segment position vector corresponding to the training text segment according to the position information of the training text segment in the training text;

obtaining a target training segment splicing vector according to the segment position vector and the training segment splicing vector;

and adding the noise vector to the target training segment spliced vector to obtain the target training feature vector.

7. The method of claim 6, wherein the adding the noise vector to the target training segment splice vector results in the target training feature vector, comprising:

Adding the noise vector to the target training segment splicing vector to obtain a target segment noise vector;

and according to the segment structure of the target segment noise vector, carrying out coding processing on the target segment noise vector to obtain the target training feature vector.

8. The method of any of claims 5-7, wherein the target training feature vectors are a plurality of, the first model is a plurality of, the plurality of first models are in one-to-one correspondence with the plurality of target training feature vectors, the plurality of target training feature vectors are each from a different training server;

inputting the target training feature vector into the first model, training the first model by using a target gradient descent algorithm to obtain the first target model, and repeating the following operations under the condition that model parameters of the first model do not meet preset conditions:

from the plurality of training servers, a plurality of target training servers are determined,

invoking the plurality of target training servers, training a plurality of first models corresponding to the plurality of target training servers by using a plurality of target training feature vectors corresponding to the plurality of target training servers respectively to obtain a plurality of model parameters corresponding to the plurality of first models respectively,

Determining new model parameters according to the plurality of model parameters;

and under the condition that the model parameters meet preset conditions, obtaining the first target model according to the model parameters meeting the preset conditions and the first model.

9. The method of any of claims 4-7, wherein the first model includes a target word level encoding layer that is derived from training an intermediate word level encoding layer with a text numerical comparison task that is derived from training an initial word level encoding layer with a text numerical classification task that characterizes a task that classifies numbers in text, and a text numerical comparison task that characterizes a task that numerically compares numbers in text that belong to the same class;

the feature extraction is performed on training text fragments in training texts to obtain training text word feature vectors and training text digital feature vectors, and the feature extraction comprises the following steps:

and extracting features of the training text fragments by using the target word level coding layer to obtain the training text word feature vectors and the training text digital feature vectors.

10. A text classification device, comprising:

The first extraction module is used for extracting the characteristics of the text fragments in the text to be classified to obtain text word characteristic vectors and text digital characteristic vectors;

the first acquisition module is used for acquiring the text segment feature vector of the text segment according to the text word feature vector and the text digital feature vector;

the second extraction module is used for extracting the characteristics of the audio fragments in the audio corresponding to the text to be classified to obtain audio fragment characteristic vectors, wherein the audio fragments correspond to the text fragments;

and the second acquisition module is used for obtaining the category information of the text to be classified according to the text segment feature vector and the audio segment feature vector.

11. A text classification model training device, comprising:

the third extraction module is used for extracting the characteristics of the training text fragments in the training text to obtain training text word characteristic vectors and training text digital characteristic vectors;

the third acquisition module is used for obtaining training text segment feature vectors of the training text segments according to the training text word feature vectors and the training text digital feature vectors;

The fourth extraction module is used for extracting the characteristics of the training audio fragments in the training audio corresponding to the training text to obtain training audio fragment characteristic vectors, wherein the training audio fragments correspond to the training text fragments;

the training module is used for training a first model by utilizing the training text segment feature vector and the training audio segment feature vector to obtain a first target model, wherein the first target model is used for determining the category information of the text to be classified.

12. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-3 or claims 4-9.

13. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-3 or 4-9.

14. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 3 or 4 to 9.