CN113919338A

CN113919338A - Method and device for processing text data

Info

Publication number: CN113919338A
Application number: CN202010655433.6A
Authority: CN
Inventors: 彭颖鸿
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2022-01-11

Abstract

A method and apparatus for processing text data, a method and apparatus for reducing a complex text processing model into a lightweight text processing model, and a computer-readable storage medium are disclosed. The method for processing text data comprises the following steps: acquiring text data to be classified; converting the text data to be classified into numerical vectors; converting the numerical value vector into a sentence vector by using a lightweight text processing model; and determining a category label of the text data based on the sentence vector. The method utilizes the lightweight text processing model of the three branch models, and can quickly and accurately identify and classify the text data.

Description

Method and device for processing text data

Technical Field

The present disclosure relates to the field of artificial intelligence services, and more particularly, to a method, apparatus, and computer-readable storage medium based on text processing. The present disclosure also relates to a method and apparatus for reducing a complex text processing model to a lightweight text processing model.

Background

There is currently a huge amount of information in the internet. Many mobile-end applications have built-in content aggregators. These content aggregators aggregate information already published on these applications. The content aggregation server corresponding to the content aggregator can push corresponding data sources such as articles, pictures, long videos, short videos, music and the like to the user according to the subscription information of the user, the user interests and the like.

Currently, in order to attract readers or viewers, some data source publishers (such as public number bloggers, video number bloggers, music creators, etc.) add some exaggerated titles, misleading, false, pornographic, vulgar, violating national policy and regulation, etc. to the data sources they publish. Some data source publishers may even set some false, fraudulent, misleading usernames (nicknames), profiles to attract readers or viewers.

If the content appears in a large amount, the content quality and the experience of the user in using the application are reduced, and negative effects are brought to the content aggregation products. At present, text information such as titles, user names, brief descriptions and the like is mainly identified and classified in a manual review and user reporting mode, and the identification rate is low and the cost is high.

Disclosure of Invention

Embodiments of the present disclosure provide a method and apparatus for processing text data, a method and apparatus for reducing a complex text processing model into a lightweight text processing model, and a computer-readable storage medium.

An embodiment of the present disclosure provides a method for processing text data, further including: acquiring text data to be classified; converting the text data to be classified into numerical vectors; converting the numerical value vector into a sentence vector by using a lightweight text processing model; and determining a category label of the text data based on the sentence vector; wherein converting the numeric vector into a sentence vector using the lightweight text processing model comprises: acquiring a first clause vector representing sequence information of the text data from the numerical value vector by using a first branch model of the lightweight text processing model; acquiring a second clause vector representing the incidence relation among words in the text data from the numerical value vector by using a second branch model of the lightweight text processing model; acquiring a third clause vector representing keyword information in the text data from the numerical value vector by using a third branch model of the lightweight text processing model; and fusing the first clause vector, the second clause vector and the third clause vector into a sentence vector.

An embodiment of the present disclosure provides a method for simplifying a complex text processing model into a lightweight text processing model, including: acquiring a complex text processing model trained on the basis of a first training text base, wherein each sample in the first training text base comprises text data of the sample; obtaining a second training text library, wherein each sample in the second training text library comprises a class label of the sample and a word segmentation sequence of the sample, and the sample amount in the second training text library is smaller than that of the first training text library; converting the category labels and the word segmentation sequences of the samples in the second training text library into first sample sentence vectors by using the complex text processing model; and training a lightweight text processing model based on the category label, the word segmentation sequence and the first sample sentence vector of each sample in the second training text library, wherein the complexity of the lightweight text processing model is lower than that of a complex text processing model.

An embodiment of the present disclosure provides an apparatus for processing text data, including: a processor; and a memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method described above.

An embodiment of the present disclosure provides an apparatus for simplifying a complex text processing model into a lightweight text processing model, including: a processor; a memory storing computer instructions that, when executed by the processor, implement the above-described method.

Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described method.

According to another aspect of the present disclosure, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the above aspects or various alternative implementations of the above aspects.

The embodiment of the disclosure provides a method for processing text data, which utilizes a lightweight text processing model of three branch models to quickly and accurately identify and classify text data such as titles, user names, brief introduction and the like, so as to help a content aggregator avoid recommending data sources such as exaggerated words, misguidances, fakes, pornography, customs and violations of national policy and regulations to users, and further improve the quality of content provided by a content aggregation platform.

The method for processing the text data provided by the embodiment of the disclosure also fuses the information in the complex text processing model into the lightweight text processing model, so that the text data can be still rapidly and accurately identified and classified on the basis of low complexity of the lightweight text processing model, and the training speed and the reasoning speed of the lightweight text processing model are improved.

The embodiment of the disclosure provides a method for simplifying a complex text processing model into a lightweight text processing model, and improves the processing efficiency of text data. In industrial application, the simplified lightweight text processing model is used for processing text data, the accuracy and recall rate similar to the accuracy and recall rate of processing the text data by using a complex text processing model can be realized, meanwhile, the reasoning and training efficiency can be greatly improved, and the method can be more widely applied to equipment with insufficient computing power.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly introduced below. The drawings in the following description are merely exemplary embodiments of the disclosure.

Fig. 1 is an example diagram illustrating a scenario in which a data source is recommended to a user by analyzing text data related to the data source according to an embodiment of the present disclosure.

Fig. 2A is a flowchart illustrating a method of processing text data according to an embodiment of the present disclosure.

Fig. 2B is a schematic diagram illustrating a method of processing text data according to an embodiment of the present disclosure.

Fig. 2C and 2D are schematic diagrams illustrating a lightweight text processing model according to an embodiment of the disclosure.

FIG. 3A is a flow diagram illustrating a process of training a lightweight text processing model according to an embodiment of the disclosure.

FIG. 3B is a schematic diagram illustrating training a lightweight text processing model according to an embodiment of the disclosure.

Fig. 3C is a schematic diagram illustrating a complex text processing model according to an embodiment of the present disclosure.

Fig. 3D is a schematic diagram illustrating a computational processing penalty according to an embodiment of the present disclosure.

FIG. 4A is a flow diagram illustrating a method of reducing a complex text processing model to a lightweight text processing model according to an embodiment of the disclosure.

FIG. 4B is a schematic diagram illustrating a method of reducing a complex text processing model to a lightweight text processing model according to an embodiment of the disclosure.

Fig. 5 is a block diagram illustrating an apparatus for processing text data according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

In the present specification and the drawings, steps and elements having substantially the same or similar characteristics are denoted by the same or similar reference numerals, and repeated description of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance or order.

For the purpose of describing the present disclosure, concepts related to the present disclosure are introduced below.

The content aggregator described above may be Artificial Intelligence (AI) based. Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. For example, for an artificial intelligence based content aggregator, it can classify data sources in a manner similar to human reading of text data related to the data sources. The artificial intelligence enables the content aggregator to have the functions of sensing text data and reasoning and deciding the text data by researching the design principle and the implementation method of various intelligent machines.

The title of the data source, the publisher's username (nickname), profile, etc. may all be referred to as textual data associated with the data source. In particular, each data source also has different text data. For example, for data sources of the photo class and the article class, the textual data associated with the data source may include comments, titles, summaries, authors, author nicknames, and the like. For music-like data sources, its associated text data may include singers, composers, album names, music reviews, lyrics, and the like. For a video-like data source, its associated text data may include actors, directors, dramas, lines, movie titles, scripts, etc.

The content aggregator that processes the above-described text data employs a Natural Language Processing (NLP) technique. Natural language processing technology is an important direction in the fields of computer science and artificial intelligence, and can implement various theories and methods for effectively communicating between human and computer by using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. Thus, based on natural language processing techniques, the content aggregator may analyze text data associated with the data sources, classify the text data, and identify poor text data (e.g., text data with exaggerated words, misleading, false, pornographic, vulgar, violating national policy and regulation) for further processing by operators of the content community.

Natural language processing techniques may also be based on Machine Learning (ML) and deep Learning. Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The natural language processing technology utilizes machine learning to study how a computer simulates or realizes the behavior of human learning language, acquires new knowledge or skills by analyzing the existing and classified text data, and reorganizes the existing knowledge structure to continuously improve the performance of the knowledge structure. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.

Alternatively, each model that is hereinafter available to the content aggregator may be an artificial intelligence model, in particular an artificial intelligence based neural network model. Typically, artificial intelligence based neural network models are implemented as acyclic graphs, with neurons arranged in different layers. Typically, the neural network model comprises an input layer and an output layer, the input layer and the output layer being separated by at least one hidden layer. The hidden layer transforms input received by the input layer into a representation that is useful for generating output in the output layer. The network nodes are all connected to nodes in adjacent layers via edges, and no edge exists between nodes in each layer. Data received at a node of an input layer of a neural network is propagated to a node of an output layer via any one of a hidden layer, an active layer, a pooling layer, a convolutional layer, and the like. The input and output of the neural network model may take various forms, which the present disclosure does not limit.

The scheme provided by the embodiment of the disclosure relates to technologies such as artificial intelligence, natural language processing and machine learning, and is specifically described by the following embodiment.

Fig. 1 is an example schematic diagram illustrating a scenario 100 in which a data source is recommended to a user by analyzing textual data related to the data source, according to an embodiment of the present disclosure.

Currently, there are already a number of content aggregation and sharing platforms. The data source publisher can upload the data source to a server of the content aggregation and sharing platform through a network, so that the video can be published on the content aggregation and sharing platform. The network may be an Internet of Things (Internet of Things) based on the Internet and/or a telecommunication network, which may be a wired network or a wireless network, for example, which may be a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a cellular data communication network, or other electronic networks capable of implementing information exchange functions.

As shown in fig. 1, a server of a content aggregation and sharing platform may receive data sources published by a plurality of data source publishers. Assume that there are two data source publishers (a publisher of data source a and a publisher of data source B) that upload data source a and data source B, respectively, to the server. The text data associated with the data source A is as follows: "professional teach you how to place an order, with a winning rate as high as 90%! ". The text data associated with the data source B is: "the gas station of province issues the early warning of rainstorm yellow, and most of the city universities in this province enter the early warning range! ".

The server may determine the content of data source a as being relevant for a wagering contest using the methods disclosed in embodiments of the present disclosure, which violate national policy regulations. Therefore, the server may classify the data source a as a contest, may account limit the publisher of the data source a, and does not publish the data source a on the content aggregation and sharing platform.

The server may determine that data source B is classified as a weather forecast using the method disclosed in the embodiments of the present disclosure, and the server may also identify the issuer of data source B as an organization number. Thus, if the data source recipient subscribes to weather-related information, the server may push data source B to the data source recipient.

The embodiment of the disclosure improves the processing efficiency of text data by providing a method for simplifying a complex text processing model into a lightweight text processing model. In industrial application, the simplified lightweight text processing model is used for processing text data, the accuracy and recall rate similar to the accuracy and recall rate of processing the text data by using a complex text processing model can be realized, meanwhile, the reasoning and training efficiency can be greatly improved, and the method can be more widely applied to equipment with insufficient computing power.

Fig. 2A is a flow chart illustrating a method 200 of processing text data according to an embodiment of the present disclosure. Fig. 2B is a schematic diagram illustrating a method 200 of processing text data according to an embodiment of the present disclosure. Fig. 2C and 2D are schematic diagrams illustrating a lightweight text processing model according to an embodiment of the disclosure.

The method 200 of processing text data according to an embodiment of the present disclosure may be applied to any electronic device. It is understood that the electronic device may be a different kind of hardware device, such as a Personal Digital Assistant (PDA), an audio/video device, a mobile phone, an MP3 player, a personal computer, a laptop computer, a server, etc. For example, the electronic device may be a server in fig. 1, an application terminal of a publisher of the data source a, an application terminal of a publisher of the data source B, an application terminal of a recipient of the data source, and the like. In the following, the present disclosure is described by taking a server as an example, and those skilled in the art should understand that the present disclosure is not limited thereto.

First, in step S201, the server acquires text data to be classified.

Optionally, the text data to be classified is associated with at least one data source, and the text data to be classified characterizes the data source in the form of text. The title of the data source, the publisher's username (nickname), profile, etc. may all be referred to as textual data associated with the data source. In particular, each data source also has different text data. For example, for data sources of the photo class and the article class, the textual data associated with the data source may include comments, titles, summaries, authors, author nicknames, and the like. For music-like data sources, its associated text data may include singers, composers, album names, music reviews, lyrics, and the like. For a video-like data source, its associated text data may include actors, directors, dramas, lines, movie titles, scripts, etc.

Next, in step S202, the server converts the text data to be classified into a numerical vector.

Referring to fig. 2B, the server may convert text data to be classified into a numerical vector through an Embedding (Embedding) operation. For example, the server may segment text data to be classified into a plurality of participles, then convert the participles into word vectors by word embedding (word embedding), and finally concatenate the word vectors as a numerical value vector.

Optionally, the server may also segment the text data to be classified into a plurality of participles, where the participles form a participle sequence. For example, assuming that the text data to be classified is "how a professional teaches you to order", the sequence of participles may be { professional, teach, you, how, order }.

The server then encodes each participle in the sequence of participles into a numerical value.

Alternatively, the server may convert each participle into a numerical value separately using a preset dictionary. The predetermined dictionary may be a set in which each element includes a participle and its corresponding numerical value. The elements in the dictionary are expressed in the following < word segmentation, numerical value >. Suppose the dictionary is { < professional, 5>, < religion, 7>, < you, 4>, < how, 1>, < order, 2> }. Each participle in the sequence of participles may be converted in turn into: 5,7,4,1,2.

Optionally, the server may also dynamically encode each participle in the sequence of participles. For example, the server may dynamically build a dictionary and encode a participle as the number of elements of the dictionary plus one and into the dictionary as long as the participle is not included in the dictionary. For example, for the above-described word segmentation sequence: { professional, religion, you, how, order }, the server can dynamically construct a dictionary { < professional, 1>, < religion, 2>, < you, 3>, < how, 4>, < order, 5> }. Thus, each participle in the sequence of participles may be converted in turn into: 1,2,3,4,5. After dynamically building the dictionary, if a new sequence of word segmentations is obtained { professional, teach, you, how, dressing }, then correspondingly the dictionary will become { < professional, 1>, < teach, 2>, < you, 3>, < how, 4>, < order, 5>, < dressing, 6> }. From this dictionary, each participle in the new participle sequence can be converted in turn into: 1,2,3,4,6.

Then, the server combines the numerical values corresponding to each participle in the participle sequence to convert the participle sequence into a numerical value vector.

Next, in step S203, the server converts the numeric vector into a sentence vector using a lightweight text processing model.

The lightweight text processing model may be as shown in fig. 2C and 2D. Referring to fig. 2C and 2D, the lightweight text processing model may include three leg models: a first branch model, a second branch model, and a third branch model.

Step S203 may include: acquiring a first clause vector representing sequence information of the text data from the numerical value vector by using a first branch model of the lightweight text processing model; acquiring a second clause vector representing the incidence relation among words in the text data from the numerical value vector by using a second branch model of the lightweight text processing model; acquiring a third clause vector representing keyword information in the text data from the numerical value vector by using a third branch model of the lightweight text processing model; and fusing the first clause vector, the second clause vector and the third clause vector into a sentence vector.

Alternatively, the first branch model may be a Bi-directional Long Short-Term Memory (Bi-LSTM) model. The Bi-LSTM model is a time recursive neural network, is suitable for processing and predicting sequence information in an ordered sequence, and can effectively solve the problem of long-path dependence of the traditional recurrent neural network. The Bi-LSTM model realizes two-way memory (LSTM only realizes forward memory, while Bi-LSTM can realize forward and backward memory) on the basis of a common long-short term memory model. Referring to fig. 2D, the Bi-LSTM model fully considers the sequential relationship between the context words and the words, and fully utilizes the bidirectional information, so that the sequential information of the text data can be fully extracted, and a first clause vector containing the sequential information of the text data is constructed.

Alternatively, the second branch model may be TextCNN (text convolutional neural network model). The TextCNN model is a text processing model based on a convolutional neural network model (CNN). A typical TextCNN model may include convolution, pooling, and full connectivity layers (FCs), which may effectively catch on the pseudo-random features. Therefore, information (as shown in fig. 2D, the information is also referred to as co-occurrence information) indicating the association relationship between words in the text data can be efficiently extracted from the numerical vector by using the TextCNN model, and a second clause vector including the co-occurrence information of the text data is constructed.

Optionally, the third leg model is a FastText model. The FastText model is a fast text classification algorithm, and the training and testing speed is accelerated under the condition of keeping high precision. The FastText model takes into account similarities between words so that it facilitates fast training of word vectors. The FastText model generally has a 1-layer neural network, so that the FastText model has the advantage of high learning and prediction speed.

Optionally, as shown in fig. 2D, the third branch model may also be a modified FastText model. For example, the third branch model may be a model in which the FastText model and the TextBank model are mixed. For example, the third branch model may be a model obtained by connecting the FastText model and the TextBank model in series or in parallel. The TextBank model is a neural network model that can rapidly extract the weight of a keyword in a sentence, and extracts the weight of the keyword using the similarity between sentences. The third branch model may also be a third clause vector that includes keyword information of the text data and is constructed by mixing the FastText model and the TextBank model (for example, by hashing, matching the weight of the keyword with the keyword), and capturing the keyword information in the text data more accurately.

The present disclosure does not limit the manner in which the weight of the keyword is extracted. The third branch model can also be a mixture of the FastText model and the N-gram model (bag of words model), for example. The N-gram model extracts the weight of the keyword by the order of the words. The third branch model performs weighted average on the word vector acquired through the FastText model and the keyword weight acquired through the TextBank model to construct a third clause vector containing keyword information of text data. For example, the third branch model may also extract the weight through TF-IDF (term frequency-inverse document frequency). The TF-IDF gives higher weight to words that appear more frequently in the text data and less frequently in the entire language environment. The third branch model may even combine the weighted values obtained by the above three methods to obtain a more accurate third clause vector.

The three branch models described above may be in a parallel relationship. Numerical value vectors are respectively input into the three branch models to obtain three clause vectors, and then the three clause vectors are fused to obtain a sentence vector.

For example, as shown in fig. 2C, each branch model may obtain a weighted clause vector through a full link layer. The three clause vectors are combined and then passed through a pooling layer (i.e., the three clause vectors are averaged) to merge into the final sentence vector. Of course, other ways may also be used to fuse the clause vectors output by the three branch models, which is not limited in this application.

Finally, in step S204, the server determines a category label of the text data based on the sentence vector.

For example, the server may calculate the probability that the text data belongs to each category label using a full-connected layer/classification model based on the sentence vector. For example, the server may "professional teach you how to place an order based on the text data associated with data Source A, with a win rate as high as 90%! ", a sentence vector of text data associated with the data source a is calculated. Then, the probability that the category label of the sentence is gambling is calculated to be far higher than that of other categories according to the numerical values of all the elements in the sentence vector, and the category label of the text data is determined to be gambling. Similarly, if the text data associated with data source B is: "the gas station of province issues the early warning of rainstorm yellow, and most of the city universities in this province enter the early warning range! "the server may also determine the category label of the text data as the organization number in a similar manner.

Optionally, the method 200 further includes generating recommendation information of the data source based on the category label of the text data to be classified. The server automatically generates category labels for the text data to be classified using the method 200 and recommends data sources that are compliant with national policy and regulations and content benign to users interested in the data sources of the category. Therefore, while the exposure of benign content is improved, inferior content is prevented from being recommended to a user, and the quality of content provided by the content aggregation platform is improved.

According to the method for processing the text data, the text data such as the title, the user name, the brief introduction and the like can be rapidly and accurately identified and classified by utilizing the lightweight text processing model of the three branch models, so that the content aggregator is helped to avoid recommending data sources such as exaggerated words, misleading, false, pornographic, popular, violating national policy and regulation and the like to the user, and the quality of the content provided by the content aggregation platform is improved.

FIG. 3A is a flow diagram illustrating a process 300 of training a lightweight text processing model according to an embodiment of the disclosure. FIG. 3B is a schematic diagram illustrating training a lightweight text processing model according to an embodiment of the disclosure. Fig. 3C is a schematic diagram illustrating a complex text processing model according to an embodiment of the present disclosure. Fig. 3D is a schematic diagram illustrating a computational processing penalty according to an embodiment of the present disclosure.

The lightweight text processing model is trained based on a complex text processing model, wherein the lightweight text processing model is less complex than the complex text processing model.

Referring to FIG. 3A, training a lightweight text processing model may include the following steps.

In step S301, a complex text processing model trained based on a first training text base is obtained, where each sample in the first training text base includes text data of the sample.

The complex text processing model may include a plurality of transformers (transformers). For example, the complex text processing model may be a BERT (Bidirectional Encoder characterization from converters) model as shown in fig. 3C. The BERT model is a semantic coding model, and after training, corresponding semantic information (i.e., sentence vectors) can be obtained by inputting words, words or sentences. The BERT model uses bi-directional translation for the language model. The traditional language models are all unidirectional language models. The BERT model can obtain deeper understanding than a unidirectional language model through a bidirectional conversion structure, namely capturing more context information between words in words and between words in sentences. Compared with the traditional language model, the BERT model has stronger learning ability and better prediction effect.

The BERT model may be a BERT-BASE model, which includes 12 operation layers, i.e., 12 converters, and each converter may perform feature extraction on text data based on an attention mechanism, and encode and decode the text data. The BERT-BASE model may also include 768 hidden units and 12 Attention heads (Attention heads). The number of parameters of the BERT-BASE model is about 11 hundred million. Furthermore, the BERT model may also be a BERT-LARGE model, which includes 24 converters, 1024 hidden units, and 16 attention headers. The number of parameters of the BERT-LARGE model is about 34 hundred million. The present disclosure is not limited as to which BERT model is used.

If the BERT-BASE model is used, the output will be 12 subvectors with 768 dimensions. The 12 subvectors can then be combined into one 768-dimensional vector using a pooling layer (e.g., a mean pooling layer) as the sentence vector output by the complex text processing model. If the BERT-LARGE model is used, its output will be 24 subvectors of dimension 1024. The 24 subvectors can then be combined into one 1024-dimensional vector using a pooling layer (e.g., mean pooling layer) as the sentence vector output by the complex text processing model. The present disclosure does not limit the dimensionality of the final output sentence vector.

The complex text processing model may be trained in advance using a first training text library. Each sample in the first training text corpus comprises text data for the sample. For example, the first training text base may be corpus information published on a network. For example, where the complex text processing model is a BERT model, the first training text library may be a wikipedia, which includes approximately 25 hundred million words and corresponding interpretations of the words. The first training sample library may also include textual data obtained from various public/private databases, such as encyclopedias, dictionaries, news, questions and answers, and the like. The sample size of the first training text corpus is typically large. Thus, a complex text processing model trained through a first training text library that contains a general understanding of a priori knowledge (e.g., most words, terms, etc.). Therefore, text data (especially short text data) is processed by using a complex text processing model, and relatively accurate sentence vectors containing enough information can be obtained.

Due to the large parameter quantity of the complex text processing model, in the industry, if the complex text processing model is directly used, the training and reasoning efficiency is low. In order to industrially improve the processing efficiency of the text processing model, it is necessary to compress and simplify the complex text processing model to obtain a lightweight text processing model that can be industrially used. The lightweight text processing model obtained by compression and simplification has fewer parameter quantities and is also integrated with the understanding of the complex text processing model to the first training text base, so that the training and reasoning efficiency is improved. Meanwhile, the lightweight text processing model and the complex text processing model can obtain approximate sentence vectors under the condition of facing the same input, so that the accuracy of text data processing is ensured.

In step S302, information in the complex text processing model about a first training text base is fused to the lightweight text processing model.

Optionally, step S302 may further include step S3021, step S3022, and step S3023.

In step S3021, a second training text library is obtained, where each sample in the second training text library includes the class label of the sample and the word segmentation sequence of the sample, and the sample amount in the second training text library is smaller than the sample amount in the first training text library.

Several examples of samples in the second training corpus are given below. The labels of some samples in the second training text corpus and the participle sequences of the samples are shown in the following example. Each participle in the participle sequence is separated by a slash in the following example.

The samples in the second training text library may be collected from a content aggregation platform. For example, text data of the data source may be marked by manual review and user reporting. And then, the text data with the category labels are subjected to word segmentation processing and then stored in a second training text library. The present disclosure does not limit the expression form of the word sequence and the manner of obtaining the samples in the second training text base.

In step S3022, the word segmentation sequence of the samples in the second training text base is converted into the first sample sentence vector by using the complex text processing model.

Optionally, the sequence of words may be preprocessed before the samples in the second training text corpus are input into the complex text processing model. The segmentation sequence may be encoded, for example, using a BERT token (a segmentation tool built into the BERT model). For example, assume that the segmentation sequence includes three segments { fund, composition, info }. Each of these three segmentations is assigned a code, e.g., a "fund" may have a code of 0, a "combination" may have a code of 1, and an "info" may have a code of 2. Thus, the word segmentation-coding dictionary corresponding to the word segmentation sequence can be { fund-0, combination-1, info-2 }. It should be understood by those skilled in the art that the dictionary may be the same as or different from the dictionary in step S202, and the disclosure is not limited thereto.

And replacing each participle in the participle sequence based on the dictionary, converting one participle into a numerical value, and obtaining a numerical value sequence corresponding to the participle sequence. The sequence of values is input into a complex text processing model. In the case where the complex text processing model is a BERT-BASE model, the sequence of values is subjected to encoding and decoding operations by 12 converters in the BERT-BASE model to extract text features of the sample and labeled using class labels to form a first sample sentence vector including the text features of the sample.

In step S3023, a lightweight text processing model is trained based on the category label, the segmentation sequence, and the first sample sentence vector for each sample in the second training text corpus.

Referring to fig. 3B, training a lightweight text processing model includes: and converting the word segmentation sequence of each sample in the second training text base into a second sample sentence vector of the sample by using a lightweight text processing model, and determining the processing loss of the lightweight text processing model based on the first sample sentence vector and the second sample sentence vector.

The word segmentation sequence corresponding to the sample in the second training text library may be encoded by the method described in step S202 to obtain a numerical vector. The numeric vector is then converted to a second sample sentence vector by a lightweight text processing model.

The processing loss of the lightweight text processing model relative to the complex text processing model may then be determined by comparing the difference between the first sample sentence vector and the second sample sentence vector.

For example, a processing penalty for the lightweight text processing model may be determined using a penalty function.

The processing penalty may be an L2 penalty (squared penalty function) over euclidean space. For example, assume that i elements in the first sample sentence vector are denoted as T (x)_i) The i elements of the second sample sentence vector are denoted as S (x)_i). The first sample sentence vector and the second sample sentence vector have the same dimension N. i is less than or equal to N and greater than 0.

The loss function corresponding to the L2 loss (denoted as L2) can be denoted as:

wherein j is less than or equal to N and is about 0.

The process loss can also be a nuclear loss on the RKHS space (regenerating nuclear hilbert space). Wherein the loss function is a kernel function based on a regenerated kernel Hilbert space, and the processing loss is a kernel loss. The L2 loss calculation described above involves a large number of inner product terms, and is computationally expensive. Therefore, the L2 loss on Euclidean space can be converted to the nuclear loss on RKHS space for calculation.

In calculating the kernel loss over the RKHS space, kernel functions can be used instead of the inner product terms to simplify the computational effort. Meanwhile, the RKHS space is a high-dimensional space, which is often capable of capturing more association information between vectors.

Suppose thatThe kernel loss is calculated using a kernel function K (m, n) where m and n represent different parameters. Then it corresponds to a core loss (noted as L)_kernel) The loss function of (d) can be written as:

since the kernel function K (m, n) has reproducibility, positivity and symmetry, the kernel-skill equation (2) can be written as:

wherein phi (T (x)_i) Denotes that T (x)_i) Mapping to RKHS space, φ (S (x)_i) Denotes that S (x)_i) Mapping to the RKHS space.

For example, assuming that the kernel function K (m, n) is a gaussian kernel function, the gaussian kernel function can be represented by the following formula:

then the above equation (2) can be written as:

it can be seen that the amount of calculation of formula (5) is greatly reduced relative to the amount of calculation of formula (1).

Furthermore, the processing penalty may also be a cosine penalty. The present disclosure does not further limit process losses.

Optionally, the loss function may also calculate the first sample sentence vector and obtain processing loss components between the three clause vectors through the three branch models of the lightweight text processing model, and add the three processing loss components to obtain the final processing loss.

That is to say that the first and second electrodes,the model total processing loss L can be calculated using the following equation (6)_{Handling losses}：

L_{Handling losses}＝L_{First branch model}+L_{Second branch model}+L_{Third branch model} (6)

Wherein L is_{First branch model}Represents the processing loss between the first sample sentence vector and the first clause vector, L_{Second branch model}Indicating a processing penalty, L, between the first and second clause vectors_{Third branch model}Representing a processing penalty between the first sample sentence vector and the third clause vector.

The processing loss can then be minimized by updating parameters in the lightweight text processing model. For example, parameters in the lightweight text processing model may be updated iteratively, with each iteration attempting to reduce processing losses. When the processing loss converges, it can be stated that the training of the lightweight text processing model is completed using the second training text library.

Furthermore, in industrial practice, lightweight text processing models (i.e., downstream tasks in FIG. 3B) can also be dynamically trained. For example, the lightweight text processing model is continuously input by collecting more text data and corresponding category labels from the content aggregation platform and then converting the text data into a numerical vector. And calculating the prediction category label of the text data through the sentence vector output by the lightweight text processing model. The prediction category labels are compared to the text data and their corresponding category labels to adjust parameters in the lightweight text processing model.

Therefore, the method for processing the text data provided by the embodiment of the disclosure also fuses the information in the complex text processing model into the lightweight text processing model, so that the text data can be still quickly and accurately recognized and classified on the basis of low complexity of the lightweight text processing model, and the training speed and the reasoning speed of the lightweight text processing model are improved.

FIG. 4A is a flow diagram illustrating a method 400 of reducing a complex text processing model to a lightweight text processing model according to an embodiment of the disclosure. FIG. 4B is a schematic diagram illustrating a method 400 of reducing a complex text processing model to a lightweight text processing model according to an embodiment of the disclosure.

Referring to FIG. 4A, the reduction of a complex text processing model to a lightweight text processing model may include the following steps.

In step S401, a complex text processing model trained based on a first training text base is obtained, where each sample in the first training text base includes text data of the sample.

The complex text processing model may include a plurality of transformers (transformers). For example, the complex text processing model may be a BERT (Bidirectional Encoder characterization based converter) model. The BERT model is a semantic coding model, and after training, corresponding semantic information (i.e., sentence vectors) can be obtained by inputting words, words or sentences. The BERT model uses bi-directional translation for the language model. The traditional language models are all unidirectional language models. The BERT model can obtain deeper understanding than a unidirectional language model through a bidirectional conversion structure, namely capturing more context information between words in words and between words in sentences. Compared with the traditional language model, the BERT model has stronger learning ability and better prediction effect.

In step S402, a second training text library is obtained, where each sample in the second training text library includes the category label of the sample and the word segmentation sequence of the sample, and the sample amount in the second training text library is smaller than the sample amount in the first training text library.

In step S403, the complex text processing model is used to convert the word segmentation sequence of the samples in the second training text base into the first sample sentence vector.

In step S404, a lightweight text processing model is trained based on the category label, the segmentation sequence, and the first sample sentence vector for each sample in the second training text library.

The lightweight text processing model comprises a first branch model used for extracting sequence information of a word segmentation sequence, a second branch model used for extracting association information between words in the word segmentation sequence, and a third branch model used for extracting keyword information in the word segmentation sequence.

Training a lightweight text processing model includes: converting the category label and the word segmentation sequence of each sample in a second training text base into a second sample sentence vector of the sample by using a lightweight text processing model, and determining the processing loss of the lightweight text processing model based on the first sample sentence vector and the second sample sentence vector.

For example, the converting, using the lightweight text processing model, the word segmentation sequence of each sample in a second training text library into a second sample sentence vector of the sample further comprises: acquiring a first sample clause vector representing sequence information of the text data from the word segmentation sequence by using a first branch model of the lightweight text processing model; acquiring a second sample clause vector representing the incidence relation among all words in the text data from the word segmentation sequence by utilizing a second branch model of the lightweight text processing model; acquiring a third sample clause vector representing keyword information in the text data from the word segmentation sequence by using a third branch model of the lightweight text processing model; and merging the first sample clause vector, the second sample clause vector and the third sample clause vector into a second sample clause vector.

The processing loss may be an L2 loss (square loss function) on euclidean space, a core loss on RKHS space (regenerative core hilbert space), or a cosine loss. The present disclosure does not further limit process losses.

The processing loss can then be minimized by updating parameters in the lightweight text processing model. When the processing loss converges, it can be stated that the training of the lightweight text processing model is completed using the second training text library.

In addition, in industrial practice, lightweight text processing models can also be dynamically trained. For example, a sentence vector is obtained by continuously collecting more text data and corresponding category labels from the content aggregation platform, and then converting the text data with the category labels into a numerical vector and inputting the numerical vector into the lightweight text processing model. And then, calculating a prediction type label of the text data by normalizing the sentence vector. The prediction category label is compared to a category label of the text data to adjust parameters in the lightweight text processing model.

Therefore, according to the method for simplifying the complex text processing model into the lightweight text processing model, provided by the embodiment of the disclosure, the information in the complex text processing model is fused into the lightweight text processing model, so that the text data can be still rapidly and accurately identified and classified on the basis of low complexity of the lightweight text processing model, and the training speed and the reasoning speed of the lightweight text processing model are improved.

Fig. 5 is a block diagram illustrating an apparatus 500 for processing text data according to an embodiment of the present disclosure.

Referring to fig. 5, device 500 may include a processor 501 and a memory 502. The processor 501 and the memory 502 may be connected by a bus 503.

The processor 501 may perform various actions and processes according to programs stored in the memory 502. In particular, the processor 501 may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which may be of the X87 or ARM architecture.

The memory 502 has stored thereon computer instructions that, when executed by the microprocessor, implement the method 200. The memory 502 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memories of the methods described in this disclosure are intended to comprise, without being limited to, these and any other suitable types of memories.

The device for simplifying the complex text processing model into the lightweight text processing model provided by the embodiment of the present disclosure also has the same or similar structure as the device 500, and therefore, the present disclosure is not repeated herein.

It is to be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In general, the various example embodiments of this disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of embodiments of the disclosure have been illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The exemplary embodiments of the invention, as set forth in detail above, are intended to be illustrative, not limiting. It will be appreciated by those skilled in the art that various modifications and combinations of the embodiments or features thereof may be made without departing from the principles and spirit of the invention, and that such modifications are intended to be within the scope of the invention.

Claims

1. A method of processing text data, comprising:

acquiring text data to be classified;

converting the text data to be classified into numerical vectors;

converting the numerical value vector into a sentence vector by using a lightweight text processing model; and

determining a category label of the text data based on the sentence vector;

wherein converting the numeric vector into a sentence vector using the lightweight text processing model comprises:

acquiring a first clause vector representing sequence information of the text data from the numerical value vector by using a first branch model of the lightweight text processing model;

acquiring a second clause vector representing the incidence relation among words in the text data from the numerical value vector by using a second branch model of the lightweight text processing model;

acquiring a third clause vector representing keyword information in the text data from the numerical value vector by using a third branch model of the lightweight text processing model;

and fusing the first clause vector, the second clause vector and the third clause vector into a sentence vector.

2. The method of claim 1, wherein the lightweight text processing model is trained based on a complex text processing model, wherein the lightweight text processing model is less complex than the complex text processing model, the training comprising:

acquiring a complex text processing model trained on the basis of a first training text base, wherein each sample in the first training text base comprises text data of the sample;

fusing information in the complex text processing model about a first training text base to the lightweight text processing model.

3. The method of claim 2, the fusing information in the complex text processing model to the lightweight text processing model comprising:

obtaining a second training text library, wherein each sample in the second training text library comprises a class label of the sample and a word segmentation sequence of the sample, and the sample amount in the second training text library is smaller than that of the first training text library;

converting the word segmentation sequence of the samples in the second training text base into a first sample sentence vector by using the complex text processing model; and

training a lightweight text processing model based on the category label, the segmentation sequence and the first sample sentence vector of each sample in the second training text base.

4. The method of claim 3, wherein the training a lightweight text processing model comprises:

converting the word segmentation sequence of each sample in the second training text library into a second sample sentence vector of the sample by using a lightweight text processing model,

determining a processing loss of the lightweight text processing model based on the first sample sentence vector and the second sample sentence vector;

parameters in the lightweight text processing model are updated to minimize the processing loss.

5. The method of claim 1, wherein the converting the text data to be classified into a numerical vector further comprises:

dividing the text data to be classified into a plurality of participles, wherein the participles form a participle sequence;

encoding each participle in the participle sequence into a numerical value;

and combining the numerical values corresponding to each participle in the participle sequence to convert the participle sequence into a numerical value vector.

6. The method of claim 1, wherein the text data to be classified is associated with at least one data source, and the text data to be classified characterizes the data source in the form of text.

7. The method of claim 6, further comprising: and generating recommendation information of the data source based on the category label of the text data to be classified.

8. The method of claim 2, wherein the complex text processing model comprises a plurality of converters.

9. The method of claim 3, wherein the determining a processing loss of the lightweight text processing model further comprises:

determining a processing loss of the lightweight text processing model using a loss function;

wherein the loss function is a kernel function based on a regenerated kernel Hilbert space, and the loss is a kernel loss.

10. A method of reducing a complex text processing model to a lightweight text processing model, comprising:

training a lightweight text processing model based on the class label, the segmentation sequence and the first sample sentence vector of each sample in a second training text library, wherein the lightweight text processing model is less complex than a complex text processing model.

11. The method of claim 10, wherein the training a lightweight text processing model comprises:

12. The method of claim 11, wherein said converting, using the lightweight text processing model, the sequence of participles for each sample in a second library of training texts into a second sample sentence vector for the sample:

acquiring a first sample clause vector representing sequence information of the text data from the word segmentation sequence by using a first branch model of the lightweight text processing model;

acquiring a second sample clause vector representing the incidence relation among all words in the text data from the word segmentation sequence by utilizing a second branch model of the lightweight text processing model;

acquiring a third sample clause vector representing keyword information in the text data from the word segmentation sequence by using a third branch model of the lightweight text processing model;

and merging the first sample clause vector, the second sample clause vector and the third sample clause vector into a second sample clause vector.

13. An apparatus for processing text data, comprising:

a processor; and

memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-9.

14. An apparatus for reducing a complex text processing model to a lightweight text processing model, comprising:

a processor; and

memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 10-12.

15. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any one of claims 1-12.