CN110750987B

CN110750987B - Text processing method, device and storage medium

Info

Publication number: CN110750987B
Application number: CN201911032610.9A
Authority: CN
Inventors: 李快
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2021-02-05
Anticipated expiration: 2039-10-28
Also published as: CN110750987A

Abstract

The embodiment of the application provides a text processing method, a text processing device and a storage medium, wherein the method comprises the following steps: acquiring a text to be processed and a label of the text to be processed; respectively extracting features of the first part and the second part of the text to be processed according to the labels to correspondingly obtain a first word vector and a second word vector; determining a distance between the first word vector and the second word vector; and determining the matching degree between the first part and the second part according to the distance. By the method and the device, the characteristic data corresponding to the input text to be processed can be accurately extracted, so that the text to be processed is accurately matched or classified.

Description

Text processing method, device and storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, and relates to but is not limited to a text processing method, a text processing device and a storage medium.

Background

For the classification and matching processing of information streams (feed streams) which are continuously updated and presented to the user content, word vector features in the feed stream file need to be obtained first, text semantic analysis is performed, and a matching result is obtained based on a semantic analysis result, so that the feed streams are classified.

At present, a commonly used text semantic analysis method mainly includes: text semantic analysis by a bag-of-words model, text semantic analysis by a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN), text semantic analysis by a Bidirectional Encoder (BERT).

However, the above text semantic analysis methods in the related art cannot accurately extract the feature data corresponding to the input text, and therefore, the text cannot be accurately matched and classified.

Disclosure of Invention

The embodiment of the application provides a text processing method, a text processing device and a storage medium, which can accurately extract feature data corresponding to an input text to be processed, so that the text to be processed is accurately matched or classified.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a text processing method, which comprises the following steps:

acquiring a text to be processed and a label of the text to be processed;

respectively extracting features of the first part and the second part of the text to be processed according to the labels to correspondingly obtain a first word vector and a second word vector;

determining a distance between the first word vector and the second word vector;

and determining the matching degree between the first part and the second part according to the distance.

acquiring a text to be processed and a label of the text to be processed;

determining a first label corresponding to the first word vector and a second label corresponding to the second word vector;

and classifying the texts to be processed according to the first label and the second label to obtain a classification result of the texts to be processed.

An embodiment of the present application provides a text processing apparatus, including:

the first acquisition module is used for acquiring a text to be processed and a label of the text to be processed;

the first feature extraction module is used for respectively extracting features of the first part and the second part of the text to be processed according to the labels to correspondingly obtain a first word vector and a second word vector;

a first determining module for determining a distance between the first word vector and the second word vector;

and the second determining module is used for determining the matching degree between the first part and the second part according to the distance.

the second acquisition module is used for acquiring a text to be processed and a label of the text to be processed;

the second feature extraction module is used for respectively extracting features of the first part and the second part of the text to be processed according to the labels to correspondingly obtain a first word vector and a second word vector;

a third determining module, configured to determine a first tag corresponding to the first word vector and a second tag corresponding to the second word vector;

and the classification module is used for classifying the texts to be processed according to the first label and the second label to obtain a classification result of the texts to be processed.

a memory for storing executable instructions; and the processor is used for realizing the method when executing the executable instructions stored in the memory.

The embodiment of the application provides a storage medium, which stores executable instructions and is used for causing a processor to implement the method when executed.

The embodiment of the application has the following beneficial effects:

the method comprises the steps of obtaining a text to be matched, extracting the feature of the text to be matched according to the obtained feature of the text to be matched, and obtaining a first word vector and a second word vector.

Drawings

FIG. 1 is an alternative architectural diagram of a text processing system provided by an embodiment of the present application;

FIG. 2A is a schematic diagram of an alternative structure of a text processing system applied to a blockchain system according to an embodiment of the present disclosure;

FIG. 2B is an alternative block diagram according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a server provided in an embodiment of the present application;

FIG. 4 is a schematic flow chart of an alternative text processing method provided in the embodiments of the present application;

FIG. 5 is a schematic flow chart of an alternative text processing method provided in the embodiments of the present application;

fig. 6A is an alternative flowchart of a training method for a text feature extraction network model according to an embodiment of the present application;

FIG. 6B is a schematic diagram of an overall structure of a text processing model provided in an embodiment of the present application;

FIG. 7 is an alternative flowchart of a training method for a text feature extraction network model according to an embodiment of the present disclosure;

FIG. 8 is an alternative flowchart of a training method for a text feature extraction network model according to an embodiment of the present disclosure;

FIG. 9 is a schematic flow chart of an alternative text processing method according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a pseudo-twin neural network;

FIG. 11 is a schematic structural diagram of a text feature extraction network model provided in an embodiment of the present application;

fig. 12 is a schematic structural diagram of a BERT network structure provided in an embodiment of the present application;

fig. 13 is a schematic diagram of a classification case provided in the embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiments of the present application belong. The terminology used in the embodiments of the present application is for the purpose of describing the embodiments of the present application only and is not intended to be limiting of the present application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Convolutional Neural Networks (CNN), Convolutional Neural Networks: is a kind of feedforward neural network containing convolution calculation and having a deep structure, and is one of the representative algorithms of deep learning (deep learning). Convolutional Neural Networks have a characteristic learning capability and are capable of performing Shift-Invariant classification of input information according to their hierarchical structure, and are therefore also referred to as "Shift-Invariant Artificial Neural Networks (SIANN).

2) Pseudo-twinning neural network (pseudo-twinning network): in the pseudo-twin neural network, two neural networks corresponding to input data may be different neural networks, or two neural networks may be the same type of neural network. The pseudo-twin neural network is adapted to handle the case where two inputs are "somewhat different". The text processing method is realized in a pseudo-twin neural network.

3) Word vector (Word Embedding): a general term for a set of language modeling and feature learning techniques in word-embedded natural language processing, where words or phrases from a vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embedding from a one-dimensional space of each word to a continuous vector space with lower dimensions.

In order to better understand the text processing method provided in the embodiment of the present application, a text semantic analysis method in the related art is first described:

in the related art, when performing text semantic analysis, a commonly used text semantic analysis method mainly includes: performing text semantic analysis through a bag-of-words model, namely filling all words into a bag, and considering the problems of lexical methods and word orders, namely that each word is independent, and finally expressing text semantics by using discrete vectors; or performing text semantic analysis on the input text through a CNN or an RNN, wherein the CNN and the RNN can convert a word vector (word embedding) of each word in the text into a vector representation of a low-dimensional real space of the whole text through a self network structure; or text semantic analysis is performed through BERT, and the semantic representation of the text is learned through massive data by using a bidirectional transfomer coding network, which is the mainstream text semantic representation method at present.

However, in the above text semantic analysis method in the related art, the bag-of-words model has the disadvantages of high dimensionality, and no consideration of lexical and sequential relations among words, which may result in severe lack of expression among similar words; the CNN and RNN models cannot effectively abstract the semantic information of the deep text due to the weakness of the network structure or cannot train large-scale data in parallel due to the defect of the network structure, so that the models do not have strong generalization capability; the transform is a feature extractor which is mainstream at present in the field of Neuro-Linguistic Programming (NLP) and has strong abstract expression capability, and BERT uses an encoder part of the transform to learn text abstract representation on a large-scale corpus by adopting a specific task and uses the text abstract representation as a basic model of other NLP tasks. Therefore, the text semantic analysis methods in the related art cannot accurately provide the feature data corresponding to the input text, so that the text cannot be accurately matched and classified.

Based on at least one problem existing in the related art, the embodiment of the application provides a text processing method, by further obtaining a label of a text to be matched when obtaining the text to be matched, feature extraction can be performed on the text to be matched according to the label, and a first word vector and a second word vector are obtained respectively.

In addition, the solution provided in the embodiment of the present application relates to an artificial intelligence model building technology, for example, a text feature extraction network model and the like for respectively performing feature extraction on the first part and the second part are built, which will be described below.

Here, it should be noted that artificial intelligence is a theory, method, technique and application system that simulates, extends and expands human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

An exemplary application of the text processing device provided by the embodiments of the present application is described below, and the device provided by the embodiments of the present application is used for various types of terminals which can be implemented as a notebook computer, a tablet computer, a desktop computer, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), and the like, and can also be implemented as a server. In the following, an exemplary application will be explained when the device is implemented as a server.

Referring to fig. 1, fig. 1 is a schematic diagram of an alternative architecture of a text processing system 10 according to an embodiment of the present application. In order to support a feed stream push application, terminals (for example, a terminal 100-1 and a terminal 100-2) are connected to a server 300 through a network 200, the terminal acquires a text to be processed and a label of the text to be processed, and sends the text to be processed and the label of the text to be processed to the server 300 through the network 200, so that the server 300 respectively extracts features of a first part and a second part of the text to be processed according to the label to obtain a first word vector and a second word vector correspondingly, and determines a distance between the first word vector and the second word vector; and determining the matching degree between the first part and the second part according to the distance to obtain a matching result, and sending the matching result to the terminal through the network 200. The network 200 may be a wide area network or a local area network, or a combination thereof. The terminal may display the recognition result on the current page (the current page 110-1 and the current page 110-2 are exemplarily shown).

The text processing system 10 related To the embodiment of the present application may also be a distributed system 101 of a blockchain system, referring To fig. 2A, fig. 2A is an optional structural schematic diagram of the text processing system 10 provided in the embodiment of the present application applied To the blockchain system, where the distributed system 101 may be a distributed node formed by a plurality of nodes 102 (any form of computing devices in an access network, such as servers and user terminals) and clients 103, a Peer-To-Peer (P2P, Peer To Peer) network is formed between the nodes, and the P2P Protocol is an application layer Protocol operating on top of a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server or a terminal, can join to become a node, and the node comprises a hardware layer, a middle layer, an operating system layer and an application layer.

Referring to the functions of each node in the blockchain system shown in fig. 2A, the functions involved include:

1) routing, a basic function that a node has, is used to support communication between nodes.

Besides the routing function, the node may also have the following functions:

2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recording data, carrying a digital signature in the recording data to represent a source of task data, and sending the recording data to other nodes in the block chain system, so that the other nodes add the recording data to a temporary block when the source and integrity of the recording data are verified successfully.

For example, the services implemented by the application include:

2.1) wallet, for providing the function of transaction of electronic money, including initiating transaction (i.e. sending the transaction record of current transaction to other nodes in the blockchain system, after the other nodes are successfully verified, storing the record data of transaction in the temporary blocks of the blockchain as the response of confirming the transaction is valid; of course, the wallet also supports the querying of the electronic money remaining in the electronic money address.

And 2.2) sharing the account book, wherein the shared account book is used for providing functions of operations such as storage, query and modification of account data, record data of the operations on the account data are sent to other nodes in the block chain system, and after the other nodes verify the validity, the record data are stored in a temporary block as a response for acknowledging that the account data are valid, and confirmation can be sent to the node initiating the operations.

2.3) Intelligent contracts, computerized agreements, which can enforce the terms of a contract, implemented by codes deployed on a shared ledger for execution when certain conditions are met, for completing automated transactions according to actual business requirement codes, such as querying the logistics status of goods purchased by a buyer, transferring the buyer's electronic money to the merchant's address after the buyer signs for the goods; of course, smart contracts are not limited to executing contracts for trading, but may also execute contracts that process received information.

3) And the Block chain comprises a series of blocks (blocks) which are mutually connected according to the generated chronological order, new blocks cannot be removed once being added into the Block chain, and recorded data submitted by nodes in the Block chain system are recorded in the blocks.

4) Consensus (Consensus), a process in a blockchain network, is used to agree on transactions in a block among a plurality of nodes involved, the agreed block is to be appended to the end of the blockchain, and the mechanisms for achieving Consensus include Proof of workload (PoW, Proof of Work), Proof of rights and interests (PoS, Pr oof of stamp), Proof of equity authority (DPoS, released Proof of-of-stamp), Proof of Elapsed Time (PoET, Proof of Elapsed Time), and so on.

Referring to fig. 2B, fig. 2B is an optional schematic diagram of a Block Structure (Block Structure) provided in this embodiment, each Block includes a hash value of a transaction record (hash value of the Block) stored in the Block and a hash value of a previous Block, and the blocks are connected by the hash values to form a Block chain. The block may include information such as a time stamp at the time of block generation. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using cryptography, and each data block contains related information for verifying the validity (anti-counterfeiting) of the information and generating a next block.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a server 300 according to an embodiment of the present application, where the server 300 shown in fig. 3 includes: at least one processor 310, memory 350, at least one network interface 320, and a user interface 330. The various components in server 300 are coupled together by a bus system 340. It will be appreciated that the bus system 340 is used to enable communications among the components connected. The bus system 340 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 340 in fig. 3.

The Processor 310 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 330 includes one or more output devices 331, including one or more speakers and/or one or more visual display screens, that enable presentation of media content. The user interface 330 also includes one or more input devices 332, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 350 optionally includes one or more storage devices physically located remote from processor 310. The memory 350 may include either volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 350 described in embodiments herein is intended to comprise any suitable type of memory. In some embodiments, memory 350 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below, to support various operations.

An operating system 351 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 352 for communicating to other computing devices via one or more (wired or wireless) network interfaces 320, exemplary network interfaces 320 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

an input processing module 353 for detecting one or more user inputs or interactions from one of the one or more input devices 332 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 3 illustrates a text processing apparatus 354 stored in the memory 350, where the text processing apparatus 354 may be a text processing apparatus in the server 300, and may be software in the form of programs and plug-ins, and the like, and includes the following software modules: the first obtaining module 3541, the first feature extraction module 3542, the first determination module 3543, and the second determination module 3544 are logical and thus may be arbitrarily combined or further separated depending on the functionality implemented. The functions of the respective modules will be explained below.

In other embodiments, the text processing device 354 may also be disposed on another second server, may be a text processing device in the second server, and may also be software in the form of programs and plug-ins, and the like, including the following software modules: a second obtaining module, a second feature extracting module, a third determining module and a classifying module (not shown in the figure), which are logical, so that they can be arbitrarily combined or further separated according to the implemented functions. The functions of the respective modules will be explained below.

In other embodiments, the apparatus provided in the embodiments of the present Application may be implemented in hardware, and for example, the apparatus provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the text processing method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The text processing method provided by the embodiment of the present application will be described below with reference to an exemplary application and implementation of the server 300 provided by the embodiment of the present application. Referring to fig. 4, fig. 4 is an alternative flowchart of a text processing method provided in an embodiment of the present application, and will be described with reference to the steps shown in fig. 4.

Step S401, a text to be processed and a label of the text to be processed are obtained.

Here, the text to be processed includes at least a first part and a second part, where the first part is related or unrelated to the second part, and when the first part is related to the second part, the first part and the second part describe the same content, and when the first part is unrelated to the second part, the first part and the second part describe different content.

The label of the text to be processed is used for identifying the category to which the text to be processed belongs, for example, when the text to be processed is content about political figures, the label is a political field, when the text to be processed is the eight trigrams about stars, the label is an entertainment field, when the text to be processed is common general knowledge, the label is a life field, and the like. The label may be a primary category label of the text to be processed input by the user.

In the embodiment of the application, a user can input the text to be processed and the label of the text to be processed through a terminal, and the terminal sends the text to be processed and the label of the text to be processed to a server, so that the server can acquire the text to be processed and the label. A text processing application may be run on the terminal, and a user may input the text to be processed and the tag on a client of the text processing application.

Step S402, respectively extracting the characteristics of the first part and the second part of the text to be processed according to the labels to correspondingly obtain a first word vector and a second word vector.

And performing feature extraction on a second part of the text to be processed according to the label to obtain a second word vector. When feature extraction is carried out, feature extraction can be realized through a pre-trained text feature extraction network model, the first part and the second part of the text to be processed are respectively input into the text feature extraction network model, and a first word vector and a second word vector obtained after feature extraction are output.

Step S403, determining a distance between the first word vector and the second word vector.

Here, after obtaining the first word vector and the second word vector, the first word vector and the second word vector may be input into a preset loss model, and a loss calculation is performed on the first word vector and the second word vector through the preset loss model to obtain a distance between the first word vector and the second word vector, where the distance is used to represent a difference between a first portion corresponding to the first word vector and a second portion corresponding to the second word vector.

And S404, determining the matching degree between the first part and the second part according to the distance.

Here, when the distance is greater than a preset threshold, it is determined that there is a low degree of matching between the first portion and the second portion, and when the distance is less than the preset threshold, it is determined that there is a high degree of matching between the first portion and the second portion. And after the matching degree is obtained, taking the matching degree as a matching result of the text to be processed, and outputting the matching result.

According to the text processing method provided by the embodiment of the application, the labels of the texts to be matched are further acquired when the texts to be matched are acquired, so that the characteristics of the texts to be matched can be extracted according to the labels, the first word vectors and the second word vectors are respectively acquired, the characteristics of the texts to be matched of the given labels are extracted, effective words can be quickly and accurately extracted, the corresponding first word vectors and the corresponding second word vectors are acquired, the characteristic data corresponding to the input texts to be processed are accurately extracted, and the texts to be processed are accurately matched and classified.

In some embodiments, the embodiments of the present application may correspond to the following scenarios: the method comprises the steps that a title and a text in a feed stream text of the same category or the same field need to be matched, the title and the text in the feed stream text can be combined to form a title text pair, then the title text pair is used as the file to be processed through the text processing method provided by the embodiment of the application, wherein the first part is the title of the text to be processed, the second part is the other part except the title in the text to be processed, namely the text part in the title text pair, and the category of the feed stream text is obtained to be used as a label of the file to be processed, so that the method of the embodiment of the application is realized.

Fig. 5 is an alternative flowchart of a text processing method provided in an embodiment of the present application, and as shown in fig. 5, the method includes the following steps:

step S501, a text to be processed and a label of the text to be processed are obtained.

Here, the text to be processed is a headline-body pair, the headline-body pair is composed of a pair of headline and body, and the headline and the body may be from the same article or different articles. And the label of the text to be processed is a type label of the feed stream text corresponding to the title text.

Step S502, according to the labels, feature extraction is respectively carried out on the title part and other parts of the text to be processed, and a first word vector and a second word vector are correspondingly obtained.

Here, the other part is the remaining part other than the header part, and for example, the other part may be a body part.

In the embodiment of the application, feature extraction is performed on the title parts according to the labels respectively to obtain first word vectors, and feature extraction is performed on other parts according to the labels to obtain second word vectors. When feature extraction is carried out, feature extraction can be realized through a pre-trained text feature extraction network model, the title part and other parts of the text to be processed are respectively input into the text feature extraction network model, and a first word vector and a second word vector obtained after feature extraction are output.

Step S503, determining a distance between the first word vector and the second word vector.

And step S504, determining the matching degree between the title part and other parts according to the distance.

Here, after determining the distance between the first word vector and the second word vector, when the distance is greater than a preset threshold, it is determined that the title portion has a low degree of matching with the other portion, and when the distance is less than the preset threshold, it is determined that the title portion has a high degree of matching with the other portion. And after the matching degree is obtained, taking the matching degree as a matching result of the title text pair, and outputting the matching result.

The text processing method provided by the embodiment of the application can realize matching of the text and the title of the text and determine whether the input title and the text are from the same article, so that a large number of feed stream files in a large number of current applications with feed stream data can be quickly matched, articles which are suitable for users and correspond to the title and the text are screened out and pushed in time, and the user experience is improved.

In some embodiments, the first part and the second part are respectively subjected to feature extraction to obtain the first word vector and the second word vector, which can be implemented by adopting a text feature extraction network model, that is, the obtained text to be processed is subjected to feature extraction through the text feature extraction network model to obtain the first word vector and the second word vector, and the obtained first word vector and second word vector are output to a preset loss model.

Here, a method for training a text feature extraction network model is provided, as shown in fig. 6A, which is an optional flowchart illustration of a method for training a text feature extraction network model provided in an embodiment of the present application, and the method includes:

step S601, inputting the first part and the second part of the sample data into a BERT network model respectively, and correspondingly obtaining a third word vector and a fourth word vector.

Here, the sample data is input data for model training, and the sample data is text data including a related first part and a second part, or a unrelated first part and second part.

In the embodiment of the present application, two BERT network models are included, and the first part and the second part of the sample data are respectively input, that is, the first part of the sample data is input into one BERT network model, and the second part of the sample data is input into the other BERT network model. One BERT network model extracts the first part of features of the input sample data to obtain a third word vector, and the other BERT network model extracts the second part of features of the input sample data to obtain a fourth word vector.

Step S602, inputting the third word vector and the fourth word vector into a preset loss model, and obtaining a loss result.

Here, after obtaining the third word vector and the fourth word vector, the third word vector and the fourth word vector are input into a preset loss model, that is, in the embodiment of the present application, the whole model of the text feature extraction network model includes two BERT network models, and the text feature extraction network model and the preset loss model form a whole text processing model of text processing.

As shown in fig. 6B, which is a schematic diagram of an overall structure of the text processing model provided in the embodiment of the present application, the text processing model 600 includes a first BERT network model 601, a second BERT network model 602, and a preset loss model 603, where the first BERT network model 601 and the second BERT network model 602 form the text feature extraction network model.

The preset loss model is used for performing loss calculation on the input third word vector and the input fourth word vector, the preset loss model comprises a loss function, the distance between the third word vector and the fourth word vector is calculated through the loss function, and the preset loss model determines the calculated distance as the loss result. It should be noted that the loss function in the preset loss model is not fixed, and the adjustment of the parameters in the loss function and the replacement of the loss function can be performed according to the actual text processing requirement.

And S603, correcting the BERT network model according to the loss result to obtain the text feature extraction network model.

Here, when the loss result indicates that the distance between the third word vector and the fourth word vector is greater than the threshold, it indicates that the first part and the second part are not related, that is, the text processing model gives a processing result that the first part and the second part of the input sample data are not related. If the processing result is correct, the BERT network model is not corrected, or the BERT network model is finely adjusted according to the distance; and if the processing result is incorrect, correcting the BERT network model according to the distance.

When the loss result indicates that the distance between the third word vector and the fourth word vector is smaller than the threshold value, it indicates that the first part and the second part are related, that is, the processing result given by the text processing model is that the first part and the second part of the input sample data are related, and if the processing result is correct, the BERT network model is not corrected, or the BERT network model is finely adjusted according to the distance; and if the processing result is incorrect, correcting the BERT network model according to the distance.

In the embodiment of the application, a new BERT network model is formed after the BERT network model is corrected, and the text feature extraction network model is formed through the two new BERT network models.

According to the training method for the text feature extraction network model, the first part and the second part of sample data are respectively input into the BERT network model, feature extraction is carried out on the first part and the second part of the sample data through the two BERT network models, the distance between the extracted third word vector and the extracted fourth word vector is determined based on the preset loss model, the BERT network model is corrected, and the accurate text feature extraction network model can be obtained.

Based on fig. 6, in some embodiments, the preset loss model determines the loss result through a loss function, as shown in fig. 7, which is an optional flowchart of the training method for a text feature extraction network model provided in the embodiment of the present application, where the process of obtaining the loss result in step S602 may include the following steps:

step S701, inputting the third word vector and the fourth word vector into the preset loss model, and determining a distance between the third word vector and the fourth word vector through a loss function in the preset loss model.

After a third word vector corresponding to the first part of the sample data and a fourth word vector corresponding to the second part of the sample data are extracted through a text feature extraction network model, the third word vector and the fourth word vector are respectively input into a preset loss model, wherein the preset loss model comprises a loss function, and the distance between the third word vector and the fourth word vector is determined by calculating the third word vector and the fourth word vector through the loss function.

Step S702, determining the matching degree between the first part and the second part according to the distance.

Here, when the distance is greater than a preset threshold, it is determined that there is a low degree of matching between the first portion and the second portion, and when the distance is less than the preset threshold, it is determined that there is a high degree of matching between the first portion and the second portion.

Step S703, determining the matching degree as the loss result.

And after the matching degree is obtained, taking the matching degree as a loss result between the third word vector and the fourth word vector.

Based on fig. 6, in some embodiments, the sample data includes positive sample data and negative sample data, as shown in fig. 8, which is an optional flowchart of the training method for the text feature extraction network model provided in the embodiment of the present application, and the method includes the following steps:

step S801, inputting the first part and the second part of the positive sample data into a BERT network model respectively, and correspondingly obtaining a third word vector and a fourth word vector.

Here, the positive sample data is data in which the first part and the second part of the sample data match, and for example, when the first part is a headline part and the second part is a body part, the positive sample data is data in which the headline part and the body part are from the same article.

The embodiment of the application provides a method for determining positive sample data, which comprises the following steps:

step S8011, the first information pair formed by the header information and the body information having the same text type label is determined as the positive sample data.

Step S802, inputting the third word vector and the fourth word vector into the preset loss model, and determining a distance between the third word vector and the fourth word vector through a loss function in the preset loss model.

Step S803, determining the matching degree between the first part and the second part determined according to the distance as a positive sample matching degree.

In the embodiment of the application, the first part and the second part in the positive sample data are input into a BERT network model, the third word vector and the fourth word vector are output, and then the third word vector and the fourth word vector are input into a preset loss model, so that the distance between the third word vector and the fourth word vector is obtained, and the matching degree of the positive sample is determined according to the distance.

Step S804, the first part and the second part of the negative sample data are respectively input into the BERT network model, and a third word vector and a fourth word vector are correspondingly obtained.

Here, the negative sample data refers to data of which the first part and the second part of the sample data do not match, for example, when the first part is a title part and the second part is a body part, the negative sample data refers to that the title part and the body part are from different articles.

The embodiment of the application provides a method for determining negative sample data, which comprises the following steps:

step S8041, determining a second information pair formed by the header information and the body information having different text type labels as the negative sample data.

Step S805, inputting the third word vector and the fourth word vector into the preset loss model, and determining a distance between the third word vector and the fourth word vector through a loss function in the preset loss model.

Step S806, determining the matching degree between the first part and the second part determined according to the distance as a negative sample matching degree.

In the embodiment of the application, the first part and the second part in the negative sample data are input into a BERT network model, the third word vector and the fourth word vector are output, and then the third word vector and the fourth word vector are input into a preset loss model, so that the distance between the third word vector and the fourth word vector is obtained, and the matching degree of the negative sample is determined according to the distance.

Correspondingly, the method for obtaining the text feature extraction network model in step S603 can be implemented by the following steps:

step S807, when the matching degree of the negative sample is greater than a threshold value or the matching degree of the positive sample is less than the threshold value, correcting the BERT network model according to the matching degree of the positive sample and the matching degree of the negative sample.

Here, after the positive sample matching degree and the negative sample matching degree are determined, whether the positive sample matching degree is greater than a threshold value or not and whether the negative sample matching degree is less than the threshold value or not are determined, and when the negative sample matching degree is greater than the threshold value or the positive sample matching degree is less than the threshold value, it indicates that the current model cannot correctly extract features of the first part and the second part of the sample data, so that an accurate matching result cannot be obtained, and therefore, the BERT network model needs to be corrected.

In the embodiment of the application, the BERT network model can be corrected according to the positive sample matching degree and the negative sample matching degree, that is, the difference between the positive sample matching degree and the negative sample matching degree and the actual matching degree is determined, and the BERT network model is corrected according to the difference.

In some embodiments, the method further comprises: and step S808, when the matching degree of the negative sample is smaller than a threshold value or the matching degree of the positive sample is larger than a threshold value, determining the BERT network model as the text feature extraction network model.

Here, when the matching degree of the negative sample is smaller than the threshold or the matching degree of the positive sample is larger than the threshold, it indicates that the current model can perform correct feature extraction on the first part and the second part of the sample data, so that an accurate matching result can be obtained.

Fig. 9 is an optional flowchart of a text processing method provided in an embodiment of the present application, and as shown in fig. 9, the method implements classification of an input text to be processed, and includes the following steps:

step S901, a text to be processed and a tag of the text to be processed are acquired.

And S902, respectively extracting the characteristics of the first part and the second part of the text to be processed according to the labels to correspondingly obtain a first word vector and a second word vector.

It should be noted that step S901 and step S901 are the same as step S401 and step S402, and are not described again in this embodiment of the present application.

Step S903, determining a first label corresponding to the first word vector and a second label corresponding to the second word vector.

Here, the first tag is used to identify a category of the extracted first word vector, and the second tag is used to identify a category of the extracted second word vector. The first label and the second label may be the same as or different from the label of the text to be processed.

Step S904, classifying the text to be processed according to the first label and the second label, to obtain a classification result of the text to be processed.

Here, the first part and the second part of the text to be processed are classified into one class when the first tag is the same as the second tag, and are classified into a different class when the first tag is different from the second tag.

In some embodiments, the first part and the second part are respectively subjected to feature extraction to obtain the first word vector and the second word vector, which can be implemented by adopting a text feature extraction network model, that is, the obtained text to be processed is subjected to feature extraction through the text feature extraction network model to obtain the first word vector and the second word vector, and the obtained first word vector and second word vector are output to a preset loss model. The training process of the text feature extraction network model is the same as that in any of the above embodiments, and is not described in detail in the embodiments of the present application.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

The embodiment of the application provides a text processing method and a training method of a text feature extraction network model, wherein a pseudo-twin neural network (pseudo-twin-parameter network) is used as a model frame, and BERT is used as a basic model to realize semantic representation of a text part of a feed flow article. The key point of the embodiment of the application lies in the construction of training data, the training data can be a feed stream article with a first-level classification label and comprises positive sample data and negative sample data, wherein the positive sample data is data with consistent title and text, and the negative sample data is data with inconsistent title and text.

The pseudo-twin neural network is explained below: as shown in fig. 10, it is a schematic structural diagram of a pseudo-twin neural network 1000, wherein, in the pseudo-twin neural network, two neural networks corresponding to input data may be different neural networks (for example, fig. 10 exemplarily shows that one neural network is an LSTM network 1001, and the other neural network is a CNN network 1002), and of course, the two neural networks may also be the same type of neural networks. The pseudo-twin neural network is adapted to handle the case where two inputs are "somewhat different". For example, if it is verified whether the header is consistent with the description of the body (where the header and body are very different in length), or whether the body describes a picture (i.e., one input is a picture and one input is a body), a pseudo-twin neural network may be used. Referring to fig. 10, a loss model 1003 is further connected to the output ends of the two neural networks (LSTM network 1001 and CNN network 1002) for performing loss calculation on the output data of the LSTM network 1001 and CNN network 1002 to obtain a difference or a distance between the two output data, so as to verify whether the descriptions of the title and the text are consistent or whether the text describes a picture. That is, the outputs of LSTM network 1001 and CNN network 1002 are inputs to loss model 1003.

In the embodiment of the present application, a pseudo-twin neural network model is used in combination with a multi-task learning model of a classification model, that is, the text feature extraction network model, to implement the text processing method in the embodiment of the present application, where a sub-model may use a BERT network model, as shown in fig. 11, the sub-model is a schematic structural diagram of the text feature extraction network model 1100 provided in the embodiment of the present application, two neural networks for performing feature extraction on input data are both BERT network models, and fig. 11 exemplarily shows BERT network models 1101 and 1102, and the loss model 1003.

With reference to fig. 11, the two input data (i.e. the text or sample data to be processed) in the figure are the title and the body of the article, respectively, where the BERT network model 1101 and the BERT network model 1102 are two BERT network models that have the same structure and the same pre-training weights but are not shared. The BERT network structure is shown in fig. 12, wherein the BERT network model is a mainstream text feature extraction model at present, and adopts a two-stage training process, including a large-scale label-free data pre-training process 1201 and a downstream task-related fine-tuning process 1202.

In some embodiments, the present application provides two training tasks, and since the purpose of the present application is to learn precise semantic expression vectors of titles and texts, a reasonable downstream fine tuning task is specially designed for this purpose, wherein the downstream task is the multi-task training of title and text matching and the classification of titles and texts. The multi-task training system has the advantages that more supervision information is given to the model in the training process, and the model is more accurate. In the pre-training model section, the ernie1.0 base model of the source may be used in embodiments of the present application.

For the topic and text matching task, it is learned whether the paired topic and text in a given corpus match, i.e. whether the topic and a given article are speaking the same thing.

For the title and body classification tasks, the title and body are passed through the classification model, respectively, so that the model can give the correct class labels for a given title, body.

Through the multi-task learning of the embodiment of the application, the model can learn not only the category distribution of the articles, but also detailed semantic information.

In some embodiments, the present application provides different training data, for example, on-line data reported every day can be used as training data of a model, and the innovation of the present application is the construction of the data.

When data is constructed, positive and negative sample data are needed for a matching task, wherein the positive sample data are samples with matched titles and texts, and the negative sample data are samples with unmatched titles and texts. The positive sample data does not need to be specially constructed, the online real article can be used as the positive sample data, and the title and the text are randomly scattered when the negative sample data is constructed, so that the title and the text of the non-corresponding article are combined. The training data itself contains first-class classification information, so when randomly scattered, two aspects are mainly scattered: on the first hand, articles under the same category label are scattered randomly, and the purpose is to enable a model to learn subtle differences among different titles in the same field; in a second aspect, different types of tags are randomly scattered in order for the model to learn the essential differences between different titles in different domains.

For the classification task, the existing primary class labels in the data can be directly used as supervision information to train the model, and the classification task can enable the model to distinguish the same or similar subjects in different fields, such as a matching task which is used independently: the model may not be able to distinguish between "moral" and "Zhoujilun" who they believe are Chinese, but the primary class data may explicitly tell the model that "moral" is an artist and "Zhoujilun" is a recreational character. As shown in fig. 13, which is a schematic diagram of a classification case provided in the embodiment of the present application, a text "why pupils in the current year would like zhou jen? "input as input data 1301 to the trained model, the relevant data in the applied feed stream data will be matched, a matching similarity value 1303 will be displayed in each feed stream data 1302, and feed stream data 1302 belonging to the same category as the input data 1301 can be pushed to the user.

The text processing method provided by the embodiment of the application can be applied to the application with feed stream data, for example, a second-level and third-level classification corpus expansion part of products with information streams such as a heaven and earth flash report, a QQ viewpoint, a QQ browser and the like, and the importance of data in the deep learning era is self-evident, and the method provided by the embodiment of the application can be used for quickly and effectively realizing corpus expansion under the condition of more category numbers due to more category numbers of the three-level classification, 1000+ categories and extremely high cost of manually labeled data. After the given part of labeled data is given, articles similar to the content of the given part of labeled data can be rapidly and accurately expanded for training of the second-level and third-level classification. In the embodiment of the application, the index for evaluating the expansion effect mainly comprises the expansion proportion and the accuracy of expansion. The average value of the spreading ratio was 1: 40, namely, an article can find 40 articles similar to the article, the expansion ratio depends on the basic data quantity, and the accuracy reaches 50% in three-level classification granularity. The method has great significance for the construction of the three-level classified corpora.

Continuing with the exemplary structure of the implementation of the text processing device 354 as a software module provided in the embodiments of the present application, in some embodiments, as shown in fig. 3, the software module stored in the text processing device 354 of the memory 350 may be a text processing device in the server 300, including:

a first obtaining module 3541, configured to obtain a to-be-processed text and a tag of the to-be-processed text;

a first feature extraction module 3542, configured to perform feature extraction on the first part and the second part of the text to be processed respectively according to the tag, so as to obtain a first word vector and a second word vector correspondingly;

a first determining module 3543 for determining a distance between the first word vector and the second word vector;

a second determining module 3544, configured to determine a matching degree between the first portion and the second portion according to the distance.

In some embodiments, the first part is a title of the text to be processed, and the second part is the other part of the text to be processed except the title;

correspondingly, the first feature extraction module is further configured to perform feature extraction on the title portion and other portions of the text to be processed respectively according to the tag, so as to obtain a first word vector and a second word vector correspondingly; the first determining module is further configured to determine a distance between the first word vector and the second word vector; the second determining module is further configured to determine a matching degree between the title portion and the other portion according to the distance.

In some embodiments, feature extraction is performed on the first part and the second part respectively by using a text feature extraction network model, so as to obtain the first word vector and the second word vector; the text feature extraction network model is obtained by training through the following steps: respectively inputting the first part and the second part of the sample data into a BERT network model, and correspondingly obtaining a third word vector and a fourth word vector; inputting the third word vector and the fourth word vector into a preset loss model to obtain a loss result; and correcting the BERT network model according to the loss result to obtain a text feature extraction network model.

In some embodiments, the text feature extraction network model is trained by: inputting the third word vector and the fourth word vector into the preset loss model, and determining a distance between the third word vector and the fourth word vector through a loss function in the preset loss model; determining a degree of matching between the first portion and the second portion according to the distance; and determining the matching degree as the loss result.

In some embodiments, the sample data comprises positive and negative sample data; the text feature extraction network model is obtained by training through the following steps: determining the matching degree corresponding to the positive sample data as a positive sample matching degree; determining the matching degree corresponding to the negative sample data as a negative sample matching degree; and when the matching degree of the negative sample is greater than a threshold value or the matching degree of the positive sample is smaller than the threshold value, correcting the BERT network model according to the matching degree of the positive sample and the matching degree of the negative sample.

In some embodiments, the text feature extraction network model is trained by: determining a first information pair formed by the title information and the text information with the same text type label as the positive sample data; and determining a second information pair formed by the title information and the text information with different text type labels as the negative sample data.

In other embodiments, the text processing device 354 may also be disposed on another second server, may be a text processing device in the second server, and may also be software in the form of programs and plug-ins, and the like, including the following software modules:

It should be noted that the description of the apparatus in the embodiment of the present application is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated. For technical details not disclosed in the embodiments of the apparatus, reference is made to the description of the embodiments of the method of the present application for understanding.

Embodiments of the present application provide a storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform a method provided by embodiments of the present application, for example, the method as illustrated in fig. 4.

In some embodiments, the storage medium may be a Ferroelectric Random Access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), a charged Erasable Programmable Read Only Memory (EEPROM), a flash Memory, a magnetic surface Memory, an optical disc, or a compact disc Read Only Memory (CD-ROM), etc.; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of text processing, comprising:

acquiring a text to be processed and a label of the text to be processed;

according to the label, respectively performing feature extraction on the first part of the text to be processed and the second part of the text to be processed by adopting a text feature extraction network model to correspondingly obtain a first word vector and a second word vector;

the text feature extraction network model is obtained by training through the following steps: respectively inputting a first part of sample data and a second part of the sample data into a BERT network model, and correspondingly obtaining a third word vector and a fourth word vector; inputting the third word vector and the fourth word vector into a preset loss model, and determining the distance between the third word vector and the fourth word vector through a loss function in the preset loss model; determining the matching degree between the first part of the sample data and the second part of the sample data according to the distance between the third word vector and the fourth word vector to obtain a loss result; according to the loss result, correcting the BERT network model to obtain the text feature extraction network model;

and determining the matching degree between the first part of the text to be processed and the second part of the text to be processed according to the distance between the first word vector and the second word vector.

2. The method according to claim 1, wherein the first part of the text to be processed is a title of the text to be processed, and the second part of the text to be processed is the other part of the text to be processed except the title;

correspondingly, respectively extracting the characteristics of the title part and other parts of the text to be processed according to the labels to correspondingly obtain a first word vector and a second word vector;

and determining the matching degree between the title part and the other parts according to the distance.

3. The method of claim 1, wherein the BERT network model comprises a first BERT network model and a second BERT network model; the inputting the first part of the sample data and the second part of the sample data into the BERT network model respectively and correspondingly obtaining a third word vector and a fourth word vector comprises:

correspondingly inputting a first part of the sample data and a second part of the sample data into the first BERT network model and the second BERT network model respectively;

extracting features of the first part of the sample data through the first BERT network model to obtain the third word vector;

and performing feature extraction on the second part of the sample data through the second BERT network model to obtain the fourth word vector.

4. The method according to claim 1, wherein the modifying the BERT network model according to the loss result to obtain the text feature extraction network model comprises:

when the loss result shows that the distance between the third word vector and the fourth word vector is greater than a threshold value, determining whether a processing result given by a text processing model corresponding to the text feature extraction network model is correct;

if the processing result is correct, the BERT network model is not corrected, or the BERT network model is finely adjusted according to the distance;

and if the processing result is incorrect, correcting the BERT network model according to the distance.

5. The method according to claim 1, wherein the modifying the BERT network model according to the loss result to obtain the text feature extraction network model comprises:

when the loss result shows that the distance between the third word vector and the fourth word vector is smaller than a threshold value, determining whether a processing result given by a text processing model corresponding to the text feature extraction network model is correct;

6. The method of claim 1, wherein the sample data comprises positive and negative sample data; the method further comprises the following steps:

determining the matching degree corresponding to the positive sample data as a positive sample matching degree; determining the matching degree corresponding to the negative sample data as a negative sample matching degree;

and according to the loss result, correcting the BERT network model to obtain the text feature extraction network model, wherein the method comprises the following steps:

and when the matching degree of the negative sample is greater than a threshold value or the matching degree of the positive sample is smaller than the threshold value, correcting the BERT network model according to the matching degree of the positive sample and the matching degree of the negative sample.

7. The method of claim 6, further comprising:

determining a first information pair formed by the title information and the text information with the same text type label as the positive sample data;

and determining a second information pair formed by the title information and the text information with different text type labels as the negative sample data.

8. A method of text processing, comprising:

acquiring a text to be processed and a label of the text to be processed;

9. A text processing apparatus, comprising:

the first feature extraction module is used for respectively extracting features of the first part of the text to be processed and the second part of the text to be processed by adopting a text feature extraction network model according to the label to correspondingly obtain a first word vector and a second word vector; the text feature extraction network model is obtained by training through the following steps: respectively inputting a first part of sample data and a second part of the sample data into a BERT network model, and correspondingly obtaining a third word vector and a fourth word vector; inputting the third word vector and the fourth word vector into a preset loss model, and determining the distance between the third word vector and the fourth word vector through a loss function in the preset loss model; determining the matching degree between the first part of the sample data and the second part of the sample data according to the distance between the third word vector and the fourth word vector to obtain a loss result; according to the loss result, correcting the BERT network model to obtain the text feature extraction network model;

and the second determining module is used for determining the matching degree between the first part of the text to be processed and the second part of the text to be processed according to the distance between the first word vector and the second word vector.

10. A text processing apparatus, comprising:

the second feature extraction module is used for respectively extracting features of the first part of the text to be processed and the second part of the text to be processed by adopting a text feature extraction network model according to the label to correspondingly obtain a first word vector and a second word vector; the text feature extraction network model is obtained by training through the following steps: respectively inputting a first part of sample data and a second part of the sample data into a BERT network model, and correspondingly obtaining a third word vector and a fourth word vector; inputting the third word vector and the fourth word vector into a preset loss model, and determining the distance between the third word vector and the fourth word vector through a loss function in the preset loss model; determining the matching degree between the first part of the sample data and the second part of the sample data according to the distance between the third word vector and the fourth word vector to obtain a loss result; according to the loss result, correcting the BERT network model to obtain the text feature extraction network model;

11. A storage medium having stored thereon executable instructions for causing a processor to perform the method of any one of claims 1 to 7 or 8 when executed.