CN113919338B

CN113919338B - Method and device for processing text data

Info

Publication number: CN113919338B
Application number: CN202010655433.6A
Authority: CN
Inventors: 彭颖鸿
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2024-05-24
Anticipated expiration: 2040-07-09
Also published as: CN113919338A

Abstract

A method and apparatus for processing text data, a method and apparatus for simplifying a complex text processing model into a lightweight text processing model, and a computer readable storage medium are disclosed. The method for processing text data comprises the following steps: acquiring text data to be classified; converting the text data to be classified into a numerical vector; converting the numerical vector into a sentence vector by using a lightweight text processing model; and determining a category label of the text data based on the sentence vector. According to the method, the lightweight text processing models of the three branch models are utilized, and the text data can be rapidly and accurately identified and classified.

Description

Method and device for processing text data

Technical Field

The present disclosure relates to the field of artificial intelligence services, and more particularly to a method, apparatus, and computer-readable storage medium based on text processing. The present disclosure also relates to a method and apparatus for simplifying a complex text processing model into a lightweight text processing model.

Background

There is currently a vast amount of information in the internet. Many mobile-side applications have built-in content aggregators. The content aggregators aggregate information that has been published on the applications. The content aggregation server corresponding to the content aggregator can push corresponding data sources such as articles, pictures, long videos, short videos, music and the like to the user according to subscription information, user interests and the like of the user.

Currently, in order to attract readers or viewers, some data source publishers (such as public number bloggers, video number bloggers, music creators, etc.) may add titles with exaggerated words, misleadership, falsiness, pornography, hypogamuts, violations of national policy regulations, etc. to the data sources they publish. Some data source publishers may even set some false, fraudulent, misleading usernames (nicknames), profiles to attract readers or viewers.

If the content is present in a large amount, the quality of the content and the experience of the user in using the application are reduced, and negative effects are brought to the content aggregation product. At present, text information such as titles, user names, brief introduction and the like is mainly identified and classified in a manual auditing and user reporting mode, and the identification rate is low and the cost is high.

Disclosure of Invention

Embodiments of the present disclosure provide a method and apparatus for processing text data, a method and apparatus for simplifying a complex text processing model into a lightweight text processing model, and a computer readable storage medium.

Embodiments of the present disclosure provide a method of processing text data, further comprising: acquiring text data to be classified; converting the text data to be classified into a numerical vector; converting the numerical vector into a sentence vector by using a lightweight text processing model; and determining a category label of the text data based on the sentence vector; wherein said converting said numeric vector into a sentence vector using said lightweight text processing model comprises: obtaining a first clause vector representing sequence information of the text data from the numerical vector by using a first branch model of the lightweight text processing model; obtaining a second clause vector representing the association relationship between each word in the text data from the numerical vector by using a second branch model of the lightweight text processing model; obtaining a third clause vector representing keyword information in the text data from the numerical vector by using a third branch model of the lightweight text processing model; and merging the first clause vector, the second clause vector and the third clause vector into a sentence vector.

Embodiments of the present disclosure provide a method of simplifying a complex text processing model into a lightweight text processing model, comprising: acquiring a complex text processing model trained based on a first training text library, wherein each sample in the first training text library comprises text data of the sample; acquiring a second training text library, wherein each sample in the second training text library comprises a category label of the sample and a word segmentation sequence of the sample, and the sample size in the second training text library is smaller than that in the first training text library; converting the class labels and word segmentation sequences of the samples in the second training text library into first sample sentence vectors by using the complex text processing model; and training a lightweight text processing model based on the class label, the word segmentation sequence, and the first sample sentence vector for each sample in the second training text base, wherein the lightweight text processing model is less complex than the complex text processing model.

Embodiments of the present disclosure provide an apparatus for processing text data, including: a processor; and a memory, wherein the memory stores a computer executable program that, when executed by the processor, performs the method described above.

Embodiments of the present disclosure provide an apparatus for simplifying a complex text processing model into a lightweight text processing model, comprising: a processor; and a memory storing computer instructions which, when executed by the processor, implement the above-described method.

Embodiments of the present disclosure provide a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described method.

According to another aspect of the present disclosure, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable medium and executes the computer instructions to cause the computer device to perform the aspects described above or methods provided in various alternative implementations of the aspects described above.

The embodiment of the disclosure provides a method for processing text data, which can quickly and accurately identify and classify text data such as titles, usernames, profiles and the like by utilizing light text processing models of three branch models, thereby helping a content aggregator to avoid recommending data sources such as exaggerated words, misleading, false, pornography, subordination, violating national policy and regulations and the like to users, and further improving the quality of content provided by a content aggregation platform.

The method for processing the text data provided by the embodiment of the disclosure further fuses the information in the complex text processing model into the lightweight text processing model, so that the lightweight text processing model can still quickly and accurately identify and classify the text data on the basis of low complexity, and the training speed and the reasoning speed of the lightweight text processing model are improved.

The embodiment of the disclosure provides a method for simplifying a complex text processing model into a lightweight text processing model, which improves the processing efficiency of text data. In industrial application, the simplified lightweight text processing model is utilized to process text data, so that the accuracy and recall rate similar to those of processing text data by utilizing the complex text processing model can be realized, the reasoning and training efficiency can be greatly improved, and the method can be more widely applied to equipment with insufficient computational power.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. The drawings in the following description are only exemplary embodiments of the present disclosure.

Fig. 1 is an example schematic diagram illustrating a scenario in which a data source is recommended to a user by analyzing text data related to the data source according to an embodiment of the present disclosure.

Fig. 2A is a flowchart illustrating a method of processing text data according to an embodiment of the present disclosure.

Fig. 2B is a schematic diagram illustrating a method of processing text data according to an embodiment of the present disclosure.

Fig. 2C and 2D are schematic diagrams illustrating a lightweight text processing model according to an embodiment of the present disclosure.

Fig. 3A is a flowchart illustrating a process of training a lightweight text processing model according to an embodiment of the disclosure.

Fig. 3B is a schematic diagram illustrating training of a lightweight text processing model according to an embodiment of the disclosure.

Fig. 3C is a schematic diagram illustrating a complex text processing model according to an embodiment of the present disclosure.

Fig. 3D is a schematic diagram illustrating a computational processing penalty according to an embodiment of the present disclosure.

Fig. 4A is a flowchart illustrating a method of simplifying a complex text processing model into a lightweight text processing model according to an embodiment of the present disclosure.

Fig. 4B is a schematic diagram illustrating a method of simplifying a complex text processing model into a lightweight text processing model according to an embodiment of the present disclosure.

Fig. 5 is a block diagram illustrating an apparatus for processing text data according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

In the present specification and drawings, steps and elements having substantially the same or similar are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first," "second," and the like are used merely to distinguish the descriptions, and are not to be construed as indicating or implying relative importance or order.

For purposes of describing the present disclosure, the following presents concepts related to the present disclosure.

The content aggregator described above may be artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) based. Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. For example, for an artificial intelligence based content aggregator, it can categorize data sources in a manner similar to text data that is relevant to human reading data sources. Artificial intelligence is used for researching the design principles and implementation methods of various intelligent machines, so that a content aggregator has the functions of sensing text data and reasoning and deciding the text data.

The title of the data source, the user name (nickname) and profile of the publisher, etc. may all be referred to as text data associated with the data source. Specifically, each data source also has different text data. For example, for data sources of picture and article classes, text data associated with the data source may include comments, titles, summaries, authors, author nicknames, and the like. For a data source of the music class, its associated text data may include singers, composers, album names, music reviews, lyrics, and the like. For video-type data sources, their associated text data may include actors, directors, dramas, speech, movie names, scripts, etc.

The content aggregator that processes the text data described above employs natural language processing (Nature Language processing, NLP) techniques. Natural language processing technology is an important direction in the fields of computer science and artificial intelligence, and can implement various theories and methods for effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like. Thus, based on natural language processing techniques, the content aggregator may analyze the text data associated with the data source, classify the text data, and identify inferior text data (e.g., text data with exaggerated word, misleading, false, pornography, colloquial, violating national policy regulations) for further processing by operators of the content community.

The natural language processing technique may also be based on machine learning (MACHINE LEARNING, ML) and deep learning. Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. Natural language processing techniques utilize machine learning to study how a computer simulates or implements the behavior of a human learning language by analyzing existing, categorized text data to obtain new knowledge or skills, reorganizing existing knowledge structures to continuously improve their performance. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.

Alternatively, each of the models available hereinafter for the content aggregator may be an artificial intelligence model, in particular an artificial intelligence based neural network model. Typically, artificial intelligence based neural network models are implemented as loop-free graphs, in which neurons are arranged in different layers. Typically, the neural network model includes an input layer and an output layer, which are separated by at least one hidden layer. The hidden layer transforms the input received by the input layer into a representation useful for generating an output in the output layer. The network nodes are fully connected to nodes in adjacent layers via edges, and there are no edges between nodes within each layer. Data received at a node of an input layer of the neural network is propagated to a node of an output layer via any one of a hidden layer, an active layer, a pooling layer, a convolutional layer, and the like. The input and output of the neural network model may take various forms, which is not limited by the present disclosure.

Embodiments of the present disclosure provide solutions related to techniques such as artificial intelligence, natural language processing, and machine learning, and are specifically described by the following embodiments.

FIG. 1 is an example schematic diagram illustrating a scenario 100 of recommending a data source to a user by analyzing text data related to the data source according to an embodiment of the present disclosure.

Currently, there are a number of content aggregation and sharing platforms. The data source publisher can upload the data source to the server of the content aggregation and sharing platform through the network, so that the video is published by the content aggregation and sharing platform. The network may be an internet of things (Internet of Things) based on the internet and/or a telecommunications network, which may be a wired network or a wireless network, for example, it may be an electronic network capable of implementing information exchange functions, such as a Local Area Network (LAN), metropolitan Area Network (MAN), wide Area Network (WAN), cellular data communications network, etc.

As shown in fig. 1, a server of a content aggregation and sharing platform may receive data sources published by a plurality of data source publishers. It is assumed that two data source publishers (the publisher of data source a and the publisher of data source B) have uploaded data source a and data source B, respectively, to the server. The text data associated with the data source A is as follows: "professionals teach you how to place an order, the winning rate is up to 90%)! ". The text data associated with data source B is: "gas-saving pictorials issue heavy rain yellow warning, most of the urban states in today province enter the warning range-! ".

The server may correlate the content of the decision data source a as gambling bids using the methods disclosed by embodiments of the present disclosure, which violates national policy regulations. Thus, the server may classify data source A as a bid, possibly subject to account restrictions on the publisher of data source A, while not publishing data source A on the content aggregation and sharing platform.

The server may determine that data source B is classified as a weather forecast using the methods disclosed by embodiments of the present disclosure, while the server may also identify the publisher of data source B as an organization number. Thus, if the data source recipient subscribes to weather-related information, the server may push data source B to the data source recipient.

The embodiment of the disclosure improves the processing efficiency of text data by providing a method for simplifying a complex text processing model into a lightweight text processing model. In industrial application, the simplified lightweight text processing model is utilized to process text data, so that the accuracy and recall rate similar to those of processing text data by utilizing the complex text processing model can be realized, the reasoning and training efficiency can be greatly improved, and the method can be more widely applied to equipment with insufficient computational power.

Fig. 2A is a flowchart illustrating a method 200 of processing text data according to an embodiment of the present disclosure. Fig. 2B is a schematic diagram illustrating a method 200 of processing text data according to an embodiment of the present disclosure. Fig. 2C and 2D are schematic diagrams illustrating a lightweight text processing model according to an embodiment of the present disclosure.

The method 200 of processing text data according to embodiments of the present disclosure may be applied to any electronic device. It is understood that the electronic device may be a different kind of hardware device, such as a Personal Digital Assistant (PDA), an audio/video device, a mobile phone, an MP3 player, a personal computer, a laptop computer, a server, etc. For example, the electronic device may be the server in FIG. 1, the application terminal of the publisher of data source A, the application terminal of the publisher of data source B, the application terminal of the data source recipient, and so on. Hereinafter, the present disclosure is described by taking a server as an example, and those skilled in the art should understand that the present disclosure is not limited thereto.

First, in step S201, the server acquires text data to be classified.

Optionally, the text data to be classified is associated with at least one data source, and the text data to be classified characterizes the data source in terms of text. The title of the data source, the user name (nickname) and profile of the publisher, etc. may all be referred to as text data associated with the data source. Specifically, each data source also has different text data. For example, for data sources of picture and article classes, text data associated with the data source may include comments, titles, summaries, authors, author nicknames, and the like. For a data source of the music class, its associated text data may include singers, composers, album names, music reviews, lyrics, and the like. For video-type data sources, their associated text data may include actors, directors, dramas, speech, movie names, scripts, etc.

Next, in step S202, the server converts the text data to be classified into a numeric vector.

Referring to fig. 2B, the server may convert text data to be classified into a numerical vector through an embedding (Embedding) operation. For example, the server may segment the text data to be classified into a plurality of segmented words, then convert the segmented words into word vectors by word embedding (word embedding), and finally splice the word vectors together as numerical vectors.

Optionally, the server may divide the text data to be classified into a plurality of word segments, where the plurality of word segments form a word segment sequence. For example, assuming that the text data to be classified is "professionals teach how you order," the word segmentation sequence may be { professionals teach, how you, order }.

The server then encodes each word in the sequence of words into a numerical value.

Alternatively, the server may convert each word segment into a numerical value using a preset dictionary, respectively. The pre-set dictionary may be a collection in which each element includes a word and its corresponding value. The elements in the dictionary are represented in the following manner of < word segmentation, numerical value >. Let the dictionary be { < professional, 5>, < teaching, 7>, < you, 4>, < how, 1>, < order, 2> }. Each word in the sequence of words may be converted in turn into: 5,7,4,1,2.

Optionally, the server may also dynamically encode each word in the sequence of words. For example, the server may dynamically construct a dictionary and, whenever a certain word is not included in the dictionary, encode the word as the number of elements of the dictionary plus one and add it to the dictionary. For example, for the word segmentation sequence described above: { professionals, teaching, you, how to place an order }, the server can dynamically build a dictionary { < professionals, 1>, < teaching, 2>, < you, 3>, < how, 4>, < place an order, 5> }. Thus, each word in the sequence of words may be converted in turn into: 1,2,3,4,5. After dynamically building the dictionary, if a new word segmentation sequence { professional, teaching, you, how, dressing }, is obtained, the dictionary will correspondingly become { < professional, 1>, < teaching, 2>, < you, 3>, < how, 4>, < order, 5>, < dressing, 6> }. According to the dictionary, each word in the new word-segmentation sequence can be converted into, in turn: 1,2,3,4,6.

Then, the server combines the numerical values corresponding to each word in the word segmentation sequence to convert the word segmentation sequence into a numerical vector.

Next, in step S203, the server converts the numeric vector into a sentence vector using a lightweight text processing model.

The lightweight text processing model may be as shown in fig. 2C and 2D. Referring to fig. 2C and 2D, the lightweight text processing model may include three leg models: a first leg model, a second leg model, and a third leg model.

Step S203 may include: obtaining a first clause vector representing sequence information of the text data from the numerical vector by using a first branch model of the lightweight text processing model; obtaining a second clause vector representing the association relationship between each word in the text data from the numerical vector by using a second branch model of the lightweight text processing model; obtaining a third clause vector representing keyword information in the text data from the numerical vector by using a third branch model of the lightweight text processing model; and merging the first clause vector, the second clause vector and the third clause vector into a sentence vector.

Alternatively, the first leg model may be a Bi-LSTM (Bi-directional Long Short-terminal Memory) model. The Bi-LSTM model is a time recurrent neural network, is suitable for processing and predicting sequence information in an ordered sequence, and can effectively solve the problem of long-path dependence of the traditional recurrent neural network. The Bi-LSTM model realizes two-way memory based on a normal long-short-term memory model (LSTM only forward memory, while Bi-LSTM can realize forward and reverse memory). Referring to fig. 2d, the bi-LSTM model fully considers the sequence relationship between the context words and the words, fully utilizes the bi-directional information, and thus can fully extract the sequence information of the text data, and constructs a first clause vector containing the sequence information of the text data.

Alternatively, the second leg model may be TextCNN (text convolutional neural network model). The TextCNN model is a text processing model based on a convolutional neural network model (CNN). A typical TextCNN model may include a convolution layer (con-version), a pooling layer (pooling), and a full connection layer (FC), which can effectively capture the pseudo-random features. Therefore, the TextCNN model can effectively extract information (as shown in fig. 2D, which is also called co-occurrence information) representing the association relationship between the words in the text data from the numerical value vector, so as to construct a second clause vector containing the co-occurrence information of the text data.

Optionally, the third leg model is FastText model. The FastText model is a fast text classification algorithm, which speeds up training and testing while maintaining high accuracy. The FastText model takes into account the similarity between words so that it is advantageous to train word vectors quickly. The FastText model typically has a layer 1 neural network, so that it has the advantage of fast learning and prediction speed.

Optionally, as shown in fig. 2D, the third leg model may also be a modified FastText model. For example, the third branch model may be a model obtained by mixing the FastText model and the TextBank model. For example, the third branch model may be a model obtained by connecting FastText models and TextBank models in series or in parallel. The TextBank model is a neural network model that can rapidly extract the weights of keywords in sentences, and extracts the weights of the keywords by using the similarity between sentences. The third branch model may also be a mixture of the FastText model and the TextBank model (e.g., by hashing to match the weights of the keywords to the keywords), and may more accurately capture keyword information in the text data, which may be a weighted average of the word vector obtained by the FastText model and the keyword weights obtained by the TextBank model to construct a third clause vector containing the keyword information of the text data.

The extraction method of the weight of the keyword is not limited in the present disclosure. The third leg model may also be a mixture of FastText models and N-gram models (word bag models), for example. The N-gram model extracts the weights of the keywords by the order of the words. The third branch model performs weighted average on the word vector obtained by the FastText model and the keyword weight obtained by the TextBank model to construct a third clause vector containing the keyword information of the text data. For example, the third leg model may also extract weights by TF-IDF (term frequency-inverse document frequency). TF-IDF gives higher weight to words that appear more frequently in text data and less frequently in the entire language environment. The third branch model may even integrate the weight values obtained by the above three methods to obtain a more accurate third clause vector.

The three branch models described above may be in parallel relationship. The numerical vectors are respectively input into the three branch models to obtain three clause vectors, and then the three clause vectors are fused, so that the clause vectors are obtained.

For example, as shown in FIG. 2C, each branch model may obtain a weighted clause vector through a full connection layer. The three clause vectors are combined and then passed through a pooling layer (i.e., averaging the three clause vectors) to fuse into the final sentence vector. Of course, other manners may be used to fuse clause vectors output by the three branch models, which is not limited by the present application.

Finally, in step S204, the server determines a category label of the text data based on the sentence vector.

For example, the server may calculate the probability that the text data belongs to each category label based on the sentence vector using a full concatenation layer/classification model. For example, the server may teach you how to place an order based on the text data associated with data source A, a winning rate of up to 90%)! ", sentence vectors of text data associated with data source a are calculated. And then calculating the probability that the category label of the sentence is gambling according to the numerical value of each element in the sentence vector to be far greater than other categories, and further determining that the category label of the text data is gambling. Similarly, if the text data associated with data source B is: "gas-saving pictorials issue heavy rain yellow warning, most of the urban states in today province enter the warning range-! The server may also determine that the category label of the text data is an organization number using a similar method.

Optionally, the method 200 further includes generating recommendation information for the data source based on the category labels of the text data to be categorized. The server utilizes the method 200 to automatically generate class labels for text data to be classified and recommend data sources that are compliant with national policy regulations, content benign to users interested in the class of data sources. Therefore, the exposure degree of benign content is improved, meanwhile, bad content is prevented from being recommended to a user, and the quality of the content provided by the content aggregation platform is improved.

According to the method for processing text data, the lightweight text processing models of the three branch models are utilized, and text data such as titles, user names, introduction and the like can be rapidly and accurately identified and classified, so that a content aggregator is helped to recommend data sources such as exaggerated words, misleading, false, pornography, low custom, violating national policy and regulations and the like to users, and the quality of content provided by a content aggregation platform is improved.

Fig. 3A is a flowchart illustrating a process 300 of training a lightweight text processing model according to an embodiment of the disclosure. Fig. 3B is a schematic diagram illustrating training of a lightweight text processing model according to an embodiment of the disclosure. Fig. 3C is a schematic diagram illustrating a complex text processing model according to an embodiment of the present disclosure. Fig. 3D is a schematic diagram illustrating a computational processing penalty according to an embodiment of the present disclosure.

The lightweight text processing model described above is trained based on a complex text processing model, wherein the lightweight text processing model is less complex than the complex text processing model.

Referring to fig. 3A, training a lightweight text processing model may include the following steps.

In step S301, a complex text processing model trained based on a first training text library is acquired, each sample in the first training text library including text data of the sample.

The complex text processing model may include a plurality of transducers (transducers). For example, the complex text processing model may be the BERT (Bidirectional Encoder Representations from Transformers, converter-based bi-directional encoder characterization) model as shown in fig. 3C. The BERT model is a semantic coding model, and after training, a word or a sentence is input to obtain corresponding semantic information (i.e., sentence vector). The BERT model uses bi-directional transformations for the language model. The conventional language model is a unidirectional language model. The BERT model can be understood more deeply than the unidirectional language model through the bidirectional conversion structure, namely, more context information between words in words and more context information between words in sentences are captured. Compared with the traditional language model, the BERT model has stronger learning ability and better prediction effect.

The BERT model may be a BERT-BASE model, which includes 12 operation layers, i.e., 12 converters, each of which may perform feature extraction on text data based on an attention mechanism, and encode and decode the text data. The BERT-BASE model may also include 768 hidden units and 12 Attention heads (Attention heads). The parameters of the BERT-BASE model are about 11 hundred million. The BERT model may also be a BERT-scope model, which includes 24 converters, 1024 hidden units, and 16 attention headers. The parameters of the BERT-LARGE model are about 34 billion. The present disclosure is not limited as to which BERT model is used.

If the BERT-BASE model is used, then its output will be 12 sub-vectors of dimension 768. The 12 sub-vectors may then be combined into a 768-dimensional vector as the sentence vector output by the complex text processing model using a pooling layer (e.g., a mean pooling layer). If the BERT-scale model is used, its output will be 24 sub-vectors of dimension 1024. The 24 sub-vectors may then be combined into a 1024-dimensional vector as the sentence vector output by the complex text processing model using a pooling layer (e.g., a mean pooling layer). The present disclosure does not limit the dimension of the final output sentence vector.

The complex text processing model described above may be trained in advance using a first training text library. Each sample in the first training text library includes text data for the sample. For example, the first training text library may be corpus information disclosed on a network. For example, where the complex text processing model is a BERT model, the first training text library may be a wikipedia that includes approximately 25 hundred million words and their corresponding interpretations. The first training sample library may also include text data obtained from various public/private databases, such as encyclopedia, dictionary, news, questions and answers, and the like. The sample size of the first training text library is typically large. Thus, a complex text processing model trained by a first training text library contains a general understanding of a priori knowledge (e.g., most words, terms, etc.). Thus, text data (especially short text data) is processed by using a complex text processing model, and a relatively accurate sentence vector containing enough information can be obtained.

Because of the large amount of parameters of complex text processing models, it is industrially inefficient to train and infer if complex text processing models are used directly. In order to industrially improve the processing efficiency of the text processing model, it is necessary to compress and simplify the complex text processing model to obtain a lightweight text processing model that can be industrially used. The lightweight text processing model obtained through compression and simplification has less parameter quantity, and meanwhile, the understanding of the complex text processing model to the first training text library is integrated, so that the training and reasoning efficiency is improved. Meanwhile, the light text processing model and the complex text processing model can obtain approximate sentence vectors under the condition of facing the same input, thereby ensuring the accuracy of text data processing.

In step S302, information about the first training text library in the complex text processing model is fused to the lightweight text processing model.

Optionally, step S302 may further include step S3021, step S3022, and step S3023.

In step S3021, a second training text library is obtained, where each sample in the second training text library includes a category label of the sample and a word segmentation sequence of the sample, and a sample size in the second training text library is smaller than a sample size in the first training text library.

Examples of samples in several second training text libraries are given below. Labels of some of the samples in the second training text library and word segmentation sequences of the samples are shown in the following examples. In the following example, each word in the word sequence is separated by a slash.

The samples in the second training text library may be collected from a content aggregation platform. For example, text data of a data source may be marked by manual auditing and user reporting. And then, the text data with the category labels are subjected to word segmentation processing and then stored in a second training text library. The present disclosure is not limited to the expression form of the word segmentation sequence and the acquisition manner of the sample in the second training text library.

In step S3022, the word segmentation sequences of the samples in the second training text library are converted into first sample sentence vectors using the complex text processing model.

Alternatively, the sequence of parts may be pre-processed before the samples in the second training text library are input into the complex text processing model. The word sequence may be encoded using BERT tokenizer (a word segmentation tool built into the BERT model), for example. For example, assume that the word segmentation sequence includes three words { foundation, composition, information }. Each of the three tokens is assigned a code, e.g., "fund" may have code 0, "combination" may have code 1, and "information" may have code 2. Thus, the word-encoding dictionary corresponding to the word sequence may be { Foundation-0, combination-1, information-2 }. It should be understood by those skilled in the art that the dictionary may be the same as or different from the dictionary in step S202, which is not limited by the present disclosure.

And replacing each word in the word segmentation sequence based on the dictionary, and converting one word into a numerical value to obtain a numerical value sequence corresponding to the one word segmentation sequence. The sequence of values is input into a complex text processing model. In the case where the complex text processing model is a BERT-BASE model, the sequence of values is encoded and decoded by 12 converters in the BERT-BASE model to extract text features of the sample and labeled with class labels to form a first sample sentence vector comprising the text features of the sample.

In step S3023, a lightweight text processing model is trained based on the class labels, the word segmentation sequences, and the first sample sentence vectors for each sample in the second training text base.

Referring to fig. 3B, training the lightweight text processing model includes: and converting the word segmentation sequence of each sample in the second training text library into a second sample sentence vector of the sample by using the lightweight text processing model, and determining the processing loss of the lightweight text processing model based on the first sample sentence vector and the second sample sentence vector.

The word sequences corresponding to the samples in the second training text library may be encoded using the method described in step S202 to obtain a vector of values. The numeric vector is then converted to a second sample sentence vector by a lightweight text processing model.

Processing penalty of the lightweight text processing model relative to the complex text processing model may then be determined by comparing the differences between the first sample sentence vector and the second sample sentence vector.

For example, a loss function may be utilized to determine a processing loss of the lightweight text processing model.

The processing penalty may be an L2 penalty (square penalty function) on the european space. For example, assume that i elements in a first sample sentence vector are denoted as T (x _i) and i elements in a second sample sentence vector are denoted as S (x _i). The first sample sentence vector and the second sample sentence vector have the same dimension N. i is less than or equal to N and greater than 0.

The loss function corresponding to the loss of L2 (denoted as L2) can be denoted as:

wherein j is equal to or less than N and about 0.

The processing penalty may also be a RKHS spatial (regenerated core hilbert space) core penalty. Wherein the loss function is a kernel function based on a regenerated kernel hilbert space, and the processing loss is a kernel loss. The calculation of the L2 loss involves a large number of inner product terms, and is huge in calculation amount. Thus, the L2 penalty on European space can be converted to a core penalty on RKHS space for computation.

In calculating RKHS the spatial kernel loss, kernel functions may be used instead of inner product terms to simplify the computation. Meanwhile, RKHS space is a high-dimensional space, which is often capable of capturing more association information between vectors.

It is assumed that a kernel function K (m, n) is used to calculate the kernel loss, where m and n represent different parameters. The loss function corresponding to the core loss (denoted as L _kernel) can be denoted as:

since the kernel function K (m, n) has reproducibility, forward nature and symmetry, the use of the kernel trickplay formula (2) can be written again as:

Where φ (T (x _i)) denotes mapping T (x _i) to RKHS space and φ (S (x _i)) denotes mapping S (x _i) to RKHS space.

For example, assuming that kernel function K (m, n) is a Gaussian kernel function, the Gaussian kernel function can be expressed as follows:

the above equation (2) can be written as:

it can be seen that the amount of calculation of equation (5) is greatly reduced relative to that of equation (1).

Furthermore, the processing penalty may also be a cosine penalty. The present disclosure does not further limit the processing penalty.

Alternatively, the penalty function may also calculate the first sample sentence vector and obtain the processing penalty component between the three sentence vectors by the three branch models of the lightweight text processing model, respectively, and add the three processing penalty components to obtain the final processing penalty.

That is, the model total processing loss L _{Loss of treatment} can be calculated using the following equation (6):

L_{Loss of treatment}＝L_{First branch model}+L_{second branch model}+L_{third branch model} (6)

where L _{First branch model} represents the processing penalty between the first sample sentence vector and the first clause vector, L _{second branch model} represents the processing penalty between the first sample sentence vector and the second clause vector, and L _{third branch model} represents the processing penalty between the first sample sentence vector and the third clause vector.

The parameters in the lightweight text processing model can then be updated to minimize the processing penalty. For example, parameters in a lightweight text processing model may be iteratively updated, each iteration minimizing processing loss. When the processing loss described above converges, it can be explained that training of the lightweight text processing model is completed using the second training text base.

Furthermore, in industrial practice, lightweight text processing models (i.e., downstream tasks in FIG. 3B) can also be trained dynamically. For example, more text data and its corresponding category labels are continually collected from the content aggregation platform, and then converted into numeric vectors for input into a lightweight text processing model. And calculating a prediction type label of the text data through the sentence vector output by the lightweight text processing model. The predicted category labels are compared to the text data and their corresponding category labels to adjust parameters in the lightweight text processing model.

Therefore, the method for processing the text data provided by the embodiment of the invention also fuses the information in the complex text processing model into the lightweight text processing model, so that the lightweight text processing model can still quickly and accurately identify and classify the text data on the basis of low complexity, and the training speed and the reasoning speed of the lightweight text processing model are improved.

Fig. 4A is a flowchart illustrating a method 400 of simplifying a complex text processing model into a lightweight text processing model according to an embodiment of the present disclosure. Fig. 4B is a schematic diagram illustrating a method 400 of simplifying a complex text processing model into a lightweight text processing model according to an embodiment of the present disclosure.

Referring to fig. 4A, the simplification of the complex text processing model into a lightweight text processing model may include the following steps.

In step S401, a complex text processing model trained based on a first training text library is acquired, each sample in the first training text library including text data of the sample.

The complex text processing model may include a plurality of transducers (transducers). For example, the complex text processing model may be a BERT (Bidirectional Encoder Representations from Transformers, converter-based bi-directional encoder characterization) model. The BERT model is a semantic coding model, and after training, a word or a sentence is input to obtain corresponding semantic information (i.e., sentence vector). The BERT model uses bi-directional transformations for the language model. The conventional language model is a unidirectional language model. The BERT model can be understood more deeply than the unidirectional language model through the bidirectional conversion structure, namely, more context information between words in words and more context information between words in sentences are captured. Compared with the traditional language model, the BERT model has stronger learning ability and better prediction effect.

In step S402, a second training text library is obtained, where each sample in the second training text library includes a category label of the sample and a word segmentation sequence of the sample, and the sample size in the second training text library is smaller than the sample size in the first training text library.

In step S403, the word segmentation sequence of the samples in the second training text library is converted into a first sample sentence vector using the complex text processing model.

In step S404, a lightweight text processing model is trained based on the class labels, the word segmentation sequences, and the first sample sentence vectors for each sample in the second training text base.

The lightweight text processing model includes a first branch model for extracting order information of a word sequence, a second branch model for extracting association information between words in the word sequence, and a third branch model for extracting keyword information in the word sequence.

Training the lightweight text processing model includes: and converting the class label and the word segmentation sequence of each sample in the second training text library into a second sample sentence vector of the sample by using the lightweight text processing model, and determining the processing loss of the lightweight text processing model based on the first sample sentence vector and the second sample sentence vector.

For example, the converting the word segmentation sequence of each sample in the second training text library into the second sample sentence vector of the sample using the lightweight text processing model further comprises: acquiring a first sample sentence vector representing sequence information of the text data from the word segmentation sequence by using a first branch model of the lightweight text processing model; obtaining a second sample clause vector representing the association relationship between each word in the text data from the word segmentation sequence by using a second branch model of the lightweight text processing model; obtaining a third sample clause vector representing keyword information in the text data from the word segmentation sequence by using a third branch model of the lightweight text processing model; and fusing the first sample clause vector, the second sample clause vector and the third sample clause vector into a second sample clause vector.

The processing penalty may be an L2 penalty (square penalty function) on the european space, a core penalty on the RKHS space (regenerated core hilbert space), or a cosine penalty. The present disclosure does not further limit the processing penalty.

The parameters in the lightweight text processing model can then be updated to minimize the processing penalty. When the processing loss described above converges, it can be explained that training of the lightweight text processing model is completed using the second training text base.

In addition, in industrial practice, lightweight text processing models can also be dynamically trained. For example, sentence vectors are obtained by continuously collecting more text data and their corresponding category labels from the content aggregation platform, and then converting the text data with the category labels into numeric vectors and then inputting the numeric vectors into a lightweight text processing model. And then, normalizing the sentence vector to calculate a prediction type label of the text data. The predicted category label is compared to the category label of the text data to adjust parameters in the lightweight text processing model.

Therefore, the method for simplifying the complex text processing model into the light text processing model provided by the embodiment of the invention can be used for quickly and accurately identifying and classifying the text data on the basis of low complexity by fusing the information in the complex text processing model into the light text processing model, so that the training speed and the reasoning speed of the light text processing model are improved.

Fig. 5 is a block diagram illustrating an apparatus 500 for processing text data according to an embodiment of the present disclosure.

Referring to fig. 5, a device 500 may include a processor 501 and a memory 502. The processor 501 and the memory 502 may be connected by a bus 503.

The processor 501 may perform various actions and processes in accordance with programs stored in the memory 502. In particular, the processor 501 may be an integrated circuit chip with signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be of the X87 architecture or ARM architecture.

Memory 502 has stored thereon computer instructions that, when executed by the microprocessor, implement method 200. The memory 502 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memory of the methods described in this disclosure is intended to comprise, without being limited to, these and any other suitable types of memory.

The apparatus for simplifying a complex text processing model into a lightweight text processing model provided by the embodiments of the present disclosure also has the same or similar structure as the apparatus 500, and thus the present disclosure will not be repeated.

It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The exemplary embodiments of the invention described in detail above are illustrative only and are not limiting. It will be appreciated by those skilled in the art that various modifications and combinations of the embodiments or features thereof can be made without departing from the principles and spirit of the invention, and such modifications are intended to be within the scope of the invention.

Claims

1. A method of processing text data, comprising:

Acquiring text data to be classified;

converting the text data to be classified into a numerical vector;

Converting the numerical vector into a sentence vector by using a lightweight text processing model; and

Determining a category label of the text data based on the sentence vector;

wherein said converting said numeric vector into a sentence vector using said lightweight text processing model comprises:

Obtaining a first clause vector representing sequence information of the text data from the numerical vector by using a first branch model of the lightweight text processing model;

Obtaining a second clause vector representing the association relationship between each word in the text data from the numerical vector by using a second branch model of the lightweight text processing model;

Obtaining a third clause vector representing keyword information in the text data from the numerical vector by using a third branch model of the lightweight text processing model;

And merging the first clause vector, the second clause vector and the third clause vector into a sentence vector.

2. The method of claim 1, wherein the lightweight text processing model is trained based on a complex text processing model, wherein the lightweight text processing model is less complex than the complex text processing model, the training comprising:

Acquiring a complex text processing model trained based on a first training text library, wherein each sample in the first training text library comprises text data of the sample;

information in the complex text processing model about the first training text library is fused to the lightweight text processing model.

3. The method of claim 2, the fusing information in the complex text processing model to the lightweight text processing model comprising:

Acquiring a second training text library, wherein each sample in the second training text library comprises a category label of the sample and a word segmentation sequence of the sample, and the sample size in the second training text library is smaller than that in the first training text library;

Converting the word segmentation sequence of the sample in the second training text library into a first sample sentence vector by using the complex text processing model; and

A lightweight text processing model is trained based on the class labels, the word segmentation sequences, and the first sample sentence vectors for each sample in the second training text base.

4. The method of claim 3, wherein the training a lightweight text processing model comprises:

Converting the word segmentation sequence of each sample in the second training text library into a second sample sentence vector of the sample by using a lightweight text processing model,

Determining a processing penalty of the lightweight text processing model based on the first sample sentence vector and the second sample sentence vector;

parameters in the lightweight text processing model are updated to minimize the processing penalty.

5. The method of claim 1, wherein the converting the text data to be classified into a numerical vector further comprises:

Dividing the text data to be classified into a plurality of word segments, wherein the word segments form word segment sequences;

Encoding each word in the word segmentation sequence into a numerical value;

And combining the numerical values corresponding to each word in the word segmentation sequence to convert the word segmentation sequence into a numerical value vector.

6. The method of claim 1, wherein the text data to be classified is associated with at least one data source, and the text data to be classified characterizes the data source in terms of text.

7. The method of claim 6, further comprising: and generating recommendation information of the data source based on the class label of the text data to be classified.

8. The method of claim 2, wherein the complex text processing model comprises a plurality of converters.

9. The method of claim 3, wherein the determining the processing penalty of the lightweight text processing model further comprises:

determining a processing penalty of the lightweight text processing model using a penalty function;

Wherein the loss function is a kernel function based on regenerated kernel hilbert space, and the loss is a kernel loss.

10. An apparatus for processing text data, comprising:

A processor; and

A memory, wherein the memory has stored therein a computer executable program which, when executed by the processor, performs the method of any of claims 1-9.

11. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any of claims 1-9.