CN112598039A

CN112598039A - Method for acquiring positive sample in NLP classification field and related equipment

Info

Publication number: CN112598039A
Application number: CN202011480250.1A
Authority: CN
Inventors: 魏万顺
Original assignee: Ping An Puhui Enterprise Management Co Ltd
Current assignee: Heliang Technology Shanghai Co ltd; Shenzhen Lian Intellectual Property Service Center
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-04-02
Anticipated expiration: 2040-12-15
Also published as: CN112598039B

Abstract

The embodiment of the application belongs to the technical field of artificial intelligence, and relates to a method for obtaining a positive sample in the NLP classification field and related equipment. The method comprises the following steps: acquiring a public data pre-training model and a special data pre-training model; splicing coding layers of the public data pre-training model and the special data pre-training model to obtain a vector coding model; acquiring a text to be identified in a seed sample and the special text data, inputting the text to be identified into a vector coding model for coding, determining a seed vector and a special text vector, and constructing an index for the special text vector; and performing similar vector search in the proprietary data set based on the seed vector, and acquiring a corresponding proprietary text through vector indexes to update the seed sample set to obtain positive samples with expected quantity. In addition, the present application also relates to blockchain techniques in which positive samples can be stored. The positive samples which cannot be matched in various prior arts can be screened out, and the model has high recall rate.

Description

Method for acquiring positive sample in NLP classification field and related equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method for obtaining positive samples in the NLP classification field and related equipment.

Background

In recent years, with the development of artificial intelligence technology, Natural Language Processing (NLP) technology stands out from the field of many artificial intelligence and becomes an important direction. NLP has many advantages over traditional template-based language generation techniques, in that the generation technique minimizes human involvement and can automatically learn input-to-output mappings from data. In the NLP classification data labeling process, labeling is required for positive samples and negative samples, and generally, the number of negative samples is much larger than that of positive samples without business attributes compared with that of negative samples with business attributes, so that such sample distribution results in a lot of time wasted in the labeling process of negative samples.

The main current solution is through promoting the density of positive sample, promotes marking efficiency, and main technical scheme has three kinds: regular screening, dictionary-based keyword filtering and full-text retrieval based on BM25, however, in the case of some samples with text contents exceeding the rule range and positive semantics, the above scheme cannot detect such positive samples, and depending on the model built by such data, the model recall rate is low.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device, computer equipment and a storage medium for obtaining positive samples in the NLP classification field, and solves the technical problem that the positive samples cannot be detected when some samples are positive in semantics and text contents exceed a rule range.

In order to solve the above technical problem, an embodiment of the present application provides a method for obtaining a positive sample in an NLP classification field, which adopts the following technical scheme:

a method for obtaining positive samples in the NLP classification field comprises the following steps:

acquiring a public data pre-training model and a special data pre-training model;

splicing coding layers of the public data pre-training model and the special data pre-training model to obtain a vector coding model;

acquiring texts to be identified in a seed sample set and a proprietary data set, encoding the seed sample and the proprietary text data, determining a seed vector and a proprietary text vector, and constructing an index for the proprietary text vector, wherein the seed sample set is composed of positive samples;

and performing similar vector search in a proprietary data set based on the seed vector, and acquiring a corresponding proprietary text through the vector index so as to update the seed sample set and obtain the expected number of positive samples.

Further, the steps of obtaining the text to be identified in the seed sample and the proprietary text data, encoding the seed sample and the proprietary text data, determining the seed vector and the proprietary text vector, and constructing an index for the proprietary text vector specifically include:

acquiring a text to be identified in a seed sample and the special text data, inputting the text to be identified into a vector coding model for coding, and acquiring a seed vector and a special text vector;

and establishing a vector index for the private text vector, and storing the corresponding relation between the private text vector and the private text.

Further, the step of performing similar vector search in a proprietary dataset based on the seed vector, and obtaining a corresponding proprietary text through the vector index to update the seed sample set to obtain a desired number of positive samples specifically includes:

step A: in the proprietary data set, searching the proprietary texts corresponding to the similar vectors by using the seed vectors, and marking the searched proprietary texts as positive samples;

and B: merging the marked positive samples into the seed sample set to serve as a new seed sample set;

and repeating the step A to the step B until the number of positive samples in the new seed sample set reaches a preset number.

Further, the step of searching for the proprietary text corresponding to the similar vector by using the seed vector in the proprietary dataset specifically includes:

sequentially comparing the distance between the target vector and each clustering center in the proprietary data set, and selecting a plurality of clustering centers which are closest to the target vector;

acquiring all vectors in a cluster corresponding to the cluster center, sequentially calculating the distance between each vector and a target vector, and selecting a plurality of similar vectors with the closest distances;

and determining the proprietary text corresponding to the similar vector through the corresponding relation between the proprietary text vector and the proprietary text.

Further, the steps of obtaining the text to be recognized in the seed sample and the proprietary text data, and encoding the seed sample and the proprietary text data, and obtaining the seed vector and the proprietary text vector specifically include:

acquiring a text to be recognized in a seed sample and proprietary text data, and determining a plurality of coding types corresponding to the text to be recognized;

identifying characters in a text to be identified, and determining a language used by the text to be identified;

and determining the encoding type corresponding to the text to be recognized according to the preset corresponding relation between various languages and the encoding type.

Further, the step of obtaining the public data pre-training model and the proprietary data pre-training model specifically includes:

pre-training the pre-training model by adopting an open data set to obtain an open data pre-training model;

and extracting a special data set under a special scene from a preset database, and pre-training the pre-training model to obtain the special data pre-training model.

Further, the pre-training of the pre-training model by using the public data set to obtain the public data pre-training model or the pre-training of the pre-training model by using the proprietary data set to obtain the proprietary data pre-training model specifically includes:

acquiring an initial training model, an initial denoising self-coding model and an initial sequence model, wherein the initial denoising self-coding model and the initial sequence-to-sequence model are respectively connected with the output end of the initial training model;

acquiring a public data set and a private data set as a training sample set, wherein the training sample set comprises sample data, shielded words in an original text and original text rhyme information;

inputting sample data in a public data set or a special data set into the initial training model, predicting words which are randomly modified in an input text through the initial denoising self-coding model, and predicting output text data containing the input text through the initial sequence model;

and taking the shaded words in the original text as the expected output of the initial denoising self-coding model, taking the original text phonological information as the expected output of the initial sequence model, respectively calculating the loss values from the initial denoising self-coding model and the initial sequence to the sequence model, and carrying out weighting averaging until the value after weighting averaging meets the preset convergence condition to obtain a trained public data pre-training model or a special data pre-training model.

In order to solve the above technical problem, an embodiment of the present application further provides an apparatus for obtaining a positive sample in the NLP classification field, which adopts the following technical scheme:

an apparatus for obtaining positive samples in NLP classification field, comprising:

the acquisition module is used for acquiring a public data pre-training model and a special data pre-training model;

the splicing module is used for connecting the coding layers of the public data pre-training model and the special data pre-training model to obtain a vector coding model;

the construction module is used for acquiring texts to be identified in the seed sample set and the special data set, inputting the texts to be identified into the vector coding model for coding, determining a seed vector and a special text vector, and constructing an index for the special text vector, wherein the seed sample set is composed of positive samples;

and the searching module is used for searching similar vectors in a proprietary data set based on the seed vectors, and acquiring corresponding proprietary texts through the vector indexes so as to update the seed sample set and obtain positive samples with expected quantity.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

a computer device comprising at least one memory having computer-readable instructions stored therein and at least one processor which, when executed, implements the steps of the method of obtaining NLP classification domain positive samples as described above.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

a computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of the method of obtaining NLP classification domain positive samples as described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:

acquiring a public pre-training model and a special data pre-training model, and splicing the public pre-training model and the special data pre-training model to obtain a vector coding model; encoding the seed samples and the special text data to obtain seed vectors and special text vectors, constructing indexes for the text vectors, and performing vector search to obtain an expected number of positive samples; various positive samples which cannot be matched in the prior art can be screened out, the situation that some positive samples with text data exceeding a rule range and positive semantics cannot be detected is avoided, and the model has higher recall rate.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is a flow diagram of one embodiment of a method of obtaining NLP classification domain positive samples according to the present application;

FIG. 2 is a schematic structural diagram of an embodiment of an apparatus for obtaining NLP classification domain positive samples according to the present application;

FIG. 3 is a schematic block diagram of one embodiment of a computer device according to the present application.

Reference numerals: 2. acquiring a positive sample device in the NLP classification field; 201. an acquisition module; 202. a splicing module; 203. building a module; 204. a search module; 3. a computer device; 301. a memory; 302. a processor; 303. a network interface.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The system architecture may include a terminal device, a network, and a server. The network serves as a medium for providing a communication link between the terminal device and the server. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use a terminal device to interact with a server over a network to receive or send messages, etc. The terminal device can be provided with various communication client applications, such as a web browser application, a shopping application, a searching application, an instant messaging tool, a mailbox client, social platform software and the like.

The terminal device may be various electronic devices having a display screen and supporting web browsing, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), a laptop portable computer, a desktop computer, and the like.

The server may be a server providing various services, such as a background server providing support for pages displayed on the terminal device.

It should be noted that, the method for obtaining the positive sample in the NLP classification field provided in the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the apparatus for obtaining the positive sample in the NLP classification field is generally disposed in the server/terminal device. It should be understood that there may be any number of end devices, networks, and servers, as desired for an implementation.

Referring to fig. 1, a flow diagram of one embodiment of a method of obtaining NLP classification domain positive samples according to the present application is shown. The method for acquiring the positive sample in the NLP classification field comprises the following steps:

and step S1, acquiring a public data pre-training model and a special data pre-training model.

Specifically, a pre-training model with the vector size as large as possible is selected according to the actual machine performance, wherein the model selection range can be tested in a transform-base series model, the model capability and the upper limit of the machine capability need to be considered at the same time, and an excessively large model, such as T5 or gpt2, is not recommended.

In some optional implementations, the step S1 specifically includes:

and pre-training the pre-training model by adopting a special data set to obtain the special data pre-training model.

The method comprises the steps of adopting a proprietary data set to pre-train to obtain a proprietary data pre-training model, adopting a Musklangugagemodel (mask language model) to pre-train, and extracting the proprietary text data set under a proprietary scene from a preset database by the proprietary data set, such as the text data set in an internal database of each company. The text data needs to guarantee the following two points: the original appearance of the text data is kept, the same text data needs to be consistent with the text data in the production environment, and any pretreatment cannot be carried out manually; the randomness of the text data is kept, and the text data is guaranteed to be a result of random sampling from a real environment. The public data set can be a text data set which can be collected in public occasions, such as text data in databases like microblogs. Both the public and proprietary datasets consisted of positive and negative samples. In the NLP text classification, a positive sample can be specific text, e.g., abusive text, while other, non-abusive text belongs to a negative sample, or a positive sample can be other sensitive word text.

Giving a sentence or a paragraph as input, firstly converting each word in the input sequence into a corresponding word vector, and simultaneously adding a position vector of each word to reflect the position of the word in the sequence; inputting word vectors into a Multi-layer transform network, learning the relation between words, coding context information of the words, outputting vector representation of each word integrating context characteristics through nonlinear change of a feedforward network, wherein each layer of the transform network mainly comprises a Multi-head self-attention layer and a feedforward network layer, the Multi-head self-attention-oriented can execute self-attributes of a plurality of different parameters in parallel, and the results of the self-attributes are spliced to be used as the input of a subsequent network; the representation of each word containing the current context information is obtained and input to the feed-forward network layer to calculate the characteristics of the non-linear hierarchy.

In some embodiments of the present application, the pre-training model is pre-trained by using the public data set, and the pre-training model is obtained, or the pre-training model is pre-trained by using the proprietary data set, and the step of obtaining the proprietary data pre-training model includes:

and acquiring an initial training model and a training sample set for training the initial training model, and pre-training the initial training model to acquire a pre-training model.

The method comprises the following steps of obtaining an initial training model and a training sample set for training the initial training model, and pre-training the initial training model, wherein the step of obtaining the pre-training model specifically comprises the following steps:

The initial training model is used for determining the incidence relation among words contained in text data input to the initial language model, and the output end of the initial training model is respectively connected with the initial denoising self-coding model and the initial sequence-to-sequence model. The initial training model is a neural network language model constructed according to a neural network algorithm. In some alternatives, the initial language model includes: a character coder and a language model constructed based on the BERT mechanism.

The character encoder is used for converting each word input into the text data of the initial language model into a corresponding word vector, and adding the word vector, the sentence vector where each word is located and the position vector of each word to obtain the input vector of the language model constructed based on the BERT mechanism. The initial language model constructed based on the BERT mechanism adopts a Transformer Encoder, namely a Transformer Encoder, as a main body model structure. The Transformer Encoder is used for enabling the input vector to pass through a multi-head self-attention layer to obtain a vector matrix; and multiplying the vector matrix by the coefficient matrix, compressing to obtain a first feature matrix, sequentially performing residual connection and normalization on the feature matrix and the input vector to obtain a second feature matrix, inputting the second feature matrix to a fully-connected feedforward neural network, and sequentially performing residual connection and normalization to obtain a pre-trained semantic vector.

Wherein, the obtaining the vector matrix by passing the input vector through the multi-head self-attention layer specifically may include: the input vector passes through multiple-head self-attention layers, linear change is carried out on the input vector in each attention layer to obtain a query vector, a key vector and a value vector, wherein the linear change comprises multiplying the input vector by a first weight matrix to obtain the query vector, multiplying the input vector by a second weight matrix to obtain the key vector and multiplying the input vector by a third weight matrix to obtain the value vector. The attention weight of other words to the word to be coded is obtained through the query vector of the word to be coded and the key vectors of the other words, the values obtained by multiplying the attention weight by each value vector are accumulated to obtain the self-attention output of each word, and the self-attention outputs in all attention layers are spliced to obtain the vector matrix of the multi-head attention layer.

The method comprises the steps of obtaining a training sample set for training an initial training model, namely a training sample set of a public data set and a training sample set of a private data set, wherein training samples in the training sample set comprise sample data, first information and second information.

Pre-training the initial training model, the initial denoising self-coding model and the initial sequence-to-sequence model until the initial denoising self-coding model and the initial sequence-to-sequence model meet preset conditions, such as loss values meet the preset conditions or the maximum iteration times is reached; for example, loss values of the initial denoising self-coding model and the initial sequence to sequence model are respectively calculated, then weighted averaging is performed, and if the weighted averaged value meets a preset convergence condition, a trained pre-training model is obtained.

The sample data of the training sample comprises masked original text and original text phonological information, the phonological information can comprise phonological information and tone information or only phonological information, for example, "i want to go to a Chinese class", the phonological information is "wo 3yao4shang4yu3wen2ke 4", letters represent phonological information of the original text, numbers represent tone information of the original text, the first information of the training sample is masked words in the original text, and the second information of the training sample is original text phonological information; the original text rhyme information is not only sample data, but also second information. The denoising self-coding model is mainly used for predicting words which are randomly modified in an input text, such as replaced, shielded or deleted words, and the sequence-to-sequence model is used for predicting output text data containing information of the input text data according to the input text data. For example, the sample data is "i want to go # class wo yaoshangyuwenke", the corresponding original text is "i want to go to a chinese class", so the masked word is "chinese", accordingly, the first information is "chinese", and the second information, i.e., the original text rhyme information is "woyaoshangyuwenke". Taking sample data as input, taking first information such as 'language' as expected output of an initial denoising self-coding model connected with an output end of an initial training model, and taking second information such as 'woyaoshangyuwenke' as expected output of an initial sequence-to-sequence model connected with an output end of the initial training model. And respectively calculating loss values of the initial denoising self-coding model and the initial sequence to sequence model, carrying out weighting averaging, and obtaining a trained public data pre-training model or a trained proprietary data pre-training model if the denoising self-coding model and the sequence to sequence model meet preset convergence conditions until the weighted averaged values meet the preset convergence conditions.

And step S2, splicing the coding layers of the public data pre-training model and the special data pre-training model, and inputting the text to be recognized into the vector coding model for coding to obtain the vector coding model.

In the embodiment of the application, the public pre-training model and the proprietary data pre-training model are loaded simultaneously, that is, the weights of the coding layers (encoders) of the public pre-training model and the proprietary data pre-training model are loaded simultaneously, the public data pre-training model and the proprietary data pre-training model are spliced, and the spliced model is a vector coding model, so that the output end of the vector coding model can simultaneously output data output by the coding layers of the public pre-training model and the proprietary data pre-training model. The coding is realized through a coding layer in an automatic coder, NLP language pre-training models in a transformer base can be regarded as the automatic coder, and the first 6-12 layers in the pre-training models are coding layers and are specifically determined according to the pre-training models. After the weights of the coding layers of the public pre-training model and the proprietary data pre-training model are loaded at the same time, the two model coding layers are output and spliced to be used as a vector coding model, and the vector coding model can be used as a text encoder for realizing text coding, so that the input text can be encoded through the vector coding model.

Step S3, obtaining the text to be identified in the seed sample set and the proprietary data set, coding the seed sample and the proprietary text data, determining the seed vector and the proprietary text vector, and constructing an index for the proprietary text vector, wherein the seed sample set is composed of positive samples.

In the embodiment of the present application, the seed sample set is composed of positive samples, that is, the seed sample set only contains positive samples, and the seed sample set is composed of positive samples that are as comprehensive as possible from scenes collected manually, in the NLP text classification, the positive samples may be specific texts, such as abuse texts, and other non-abuse texts belong to the negative samples, and the positive samples may also be texts of other sensitive words.

The method comprises the steps of obtaining texts to be recognized in a seed sample set and a special data set, inputting the texts to be recognized into a vector coding model for coding, obtaining corresponding text vectors after the texts to be recognized are coded, namely coding the seed samples in the seed sample set and the special text data in the special data set, obtaining the seed vectors and the special text vectors, and then constructing a vector index for the special text vectors.

The step S3 specifically includes:

In the embodiment of the application, the text to be recognized in the seed sample and the proprietary text data is obtained, the text to be recognized is input into the vector coding model for coding, and is output after being coded by the coding layer of the vector coding model, so that the corresponding seed vector and the proprietary text vector are obtained.

And constructing a vector index for the exclusive text vector, and keeping the corresponding relation between the vector and the text. The vector index is constructed by a clustering method, vectors in a vector set are divided, the proprietary text vector set can be divided into a plurality of clusters by k-means and other clustering methods, the requirement that the vector similarity in the same cluster is high and the vector similarity in different clusters is low is met, the coordinates of the center point of each cluster are recorded, and the clustering result is used as the basis for establishing the vector index.

The steps of obtaining the text to be identified in the seed sample and the proprietary text data, encoding the seed sample and the proprietary text data, and obtaining the seed vector and the proprietary text vector specifically include:

identifying characters in a text to be identified to obtain a language used by the text to be identified;

In the embodiment of the application, the text to be identified in the seed sample and the proprietary text data is obtained, and a plurality of coding types corresponding to the text to be identified are determined; the text to be recognized refers to a text composed of characters of any language and any coding type, and before the text is recognized, the adopted coding of the text cannot be known; after the text to be recognized is obtained, the language used by the text can be known by recognizing characters in the text; and determining the encoding type corresponding to the text to be recognized according to the preset corresponding relation between various languages and encoding types. If it is recognized that the text to be recognized includes 3 languages, 3 kinds of encoding can be determined.

Extracting character strings from a text to be recognized, respectively coding the character strings according to a plurality of coding types, and generating a vector coding result corresponding to each coding type; the method for extracting the character strings from the text to be recognized can adopt various modes, including but not limited to extracting a preset number of character strings, extracting a preset percentage of character strings, extracting an input number of character strings in a system interface or extracting an input percentage of character strings in the system interface. In a specific extraction method, a sequential extraction method, a reverse extraction method, a random extraction method, or the like may be selected. For example, if the text to be recognized includes 3 languages, the character strings of each language can be proportionally extracted according to the proportion of each language character to the total number of characters of the text. Encoding refers to the process of converting information from one form or format to another. One text has one or more coding types which can correctly identify the text, and the correct coding result can be generated by coding the text to be identified through the correct coding types. The encoding result refers to the seed vector and the proprietary text vector.

Step S4, similar vector search is carried out in a proprietary data set based on the seed vector, and a corresponding proprietary text is obtained through the vector index, so as to update the seed sample set and obtain positive samples with expected quantity.

In the embodiment of the application, similar vector search is performed in a proprietary data set based on a seed vector, a corresponding proprietary text is obtained through vector indexes, the proprietary text is used as a positive sample and is added to the seed sample set until a positive sample with an expected number is obtained, in the embodiment, the number of the positive samples reaches 10/100 of the negative sample, for example, the proprietary data set has 10000 proprietary text data, and finally the number of the positive samples obtained through similar vector search in the proprietary data set is 300, the number of the negative samples is 9700, the number of the positive samples reaches 3/100 of the negative samples, and therefore the number of the positive samples does not reach the expected number. After the vector coding model is iterated for multiple times, a positive sample obtained by performing similar vector search in the proprietary dataset is 1000, the negative sample is 9000, the positive sample reaches 11/100 of the negative sample, the positive sample reaches 10/100 of the negative sample, and thus the expected number of positive samples is obtained.

According to the method for obtaining the positive samples in the NLP classification field, a public pre-training model and a special data pre-training model are obtained, and the public pre-training model and the special data pre-training model are spliced to obtain a vector coding model; encoding the seed samples and the special text data to obtain seed vectors and special text vectors, constructing indexes for the text vectors, and performing vector search to obtain an expected number of positive samples; the method can screen out various positive samples which cannot be matched in the prior art, avoids the situation that some positive samples with text data exceeding the rule range and positive semantics cannot be detected, and has higher recall rate.

The step S4 specifically includes:

In the embodiment of the application, the seed vector is used in a proprietary text vector index space, namely a proprietary data set, vector search is performed to obtain a text corresponding to the similar vector, namely a positive sample, and the searched positive sample is labeled.

The step of searching for the proprietary text corresponding to the similar vector by the seed vector in the proprietary dataset comprises:

In the embodiment of the application, when the vector is searched, the distances between the target vector and each clustering center are sequentially compared, and a plurality of clustering centers closest to the target vector are selected. And then all vectors in the clusters corresponding to the cluster centers are obtained, the distance between each vector and the target vector is calculated in sequence, and a plurality of vectors with the closest distance are selected. The method divides the data set by adopting a clustering method, thereby eliminating vectors with low similarity with target vectors in the searching process.

Merging the marked positive samples into the seed sample set to serve as a new seed sample set, and then carrying out vector search on the special data set by using the seed vectors in the new seed sample set to obtain the positive texts corresponding to the similar vectors until the number of the positive samples in the new seed sample set reaches a preset number.

It is emphasized that the positive samples may also be stored in a node of a blockchain in order to further ensure the privacy and security of the positive samples.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

According to the method for acquiring the positive samples in the NLP classification field, an open pre-training model and a special data pre-training model are acquired, and the open pre-training model and the special data pre-training model are spliced to obtain a vector coding model; encoding the seed samples and the special text data to obtain seed vectors and special text vectors, constructing indexes for the text vectors, and performing vector search to obtain an expected number of positive samples; the method can screen out various positive samples which cannot be matched in the prior art, avoids the situation that some positive samples with text data exceeding the rule range and positive semantics cannot be detected, and has higher recall rate.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, can include processes of the embodiments of the methods described above. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 2, as an implementation of the method shown in fig. 1, the present application provides an embodiment of an apparatus for obtaining NLP classification domain positive samples, which corresponds to the embodiment of the method shown in fig. 1, and which can be applied in various computer devices.

As shown in fig. 2, the apparatus 2 for acquiring NLP classification domain positive samples according to the present embodiment includes:

an acquisition module 201, a concatenation module 202, a construction module 203, and a search module 204.

The obtaining module 201 is configured to obtain a public data pre-training model and a private data pre-training model.

The obtaining module 201 includes a first obtaining sub-module and a second obtaining sub-module.

The first obtaining submodule is used for adopting an open data set to pre-train the pre-training model to obtain an open data pre-training model;

and the second acquisition submodule is used for pre-training the pre-training model by adopting a special data set to acquire the special data pre-training model.

The method comprises the steps of adopting a proprietary data set to pre-train to obtain a proprietary data pre-training model, adopting a Musklangugagemodel to pre-train, and extracting the proprietary text data set under a proprietary scene from a preset database by the proprietary data set, such as text data in internal databases of various companies.

The public data set can be a text data set which can be collected in public occasions, such as text data in databases like microblogs. Both the public and proprietary datasets consisted of positive and negative samples. In the NLP text classification, a positive sample can be specific text, e.g., abusive text, while other, non-abusive text belongs to a negative sample, or a positive sample can be other sensitive word text.

The splicing module is used for receiving coding layers of the public data pre-training model and the special data pre-training model to obtain a vector coding model.

In the embodiment of the application, the public pre-training model and the proprietary data pre-training model are loaded simultaneously, that is, the weights of the coding layers (encoders) of the public pre-training model and the proprietary data pre-training model are loaded simultaneously, the public data pre-training model and the proprietary data pre-training model are spliced, and the spliced model is a vector coding model, so that the output end of the vector coding model can simultaneously output data output by the coding layers of the public pre-training model and the proprietary data pre-training model.

The construction module 203 is configured to obtain a seed sample set and texts to be identified in a proprietary data set, encode the seed sample and the proprietary text data, determine a seed vector and a proprietary text vector, and construct an index for the proprietary text vector, where the seed sample set is composed of positive samples.

The building module 203 comprises an encoding module and a building module.

The coding module is used for acquiring a text to be identified in the seed sample and the proprietary text data, inputting the text to be identified into the vector coding model for coding, and acquiring a seed vector and a proprietary text vector;

The establishing module 204 is configured to establish a vector index for the private text vector, and store a corresponding relationship between the private text vector and the private text.

In the embodiment of the application, a vector index is constructed for the exclusive text vector, and the corresponding relation between the vector and the text is reserved. The vector index is constructed by a clustering method, vectors in a vector set are divided, the proprietary text vector set can be divided into a plurality of clusters by k-means and other clustering methods, the requirement that the vector similarity in the same cluster is high and the vector similarity in different clusters is low is met, the coordinates of the center point of each cluster are recorded, and the clustering result is used as the basis for establishing the vector index.

The searching module 204 is configured to perform similar vector search in a proprietary data set based on the seed vector, and obtain a corresponding proprietary text through the vector index to update the seed sample set, so as to obtain a desired number of positive samples.

In the embodiment of the application, similar vector search is performed in a proprietary data set based on a seed vector, a corresponding proprietary text is obtained through vector indexing, the proprietary text is used as a positive sample and added to the seed sample set until an expected number of positive samples are obtained, and in the embodiment, the positive samples reach 10/100 of negative samples.

The search module 204 includes a labeling module, a merging module, and a repeating module.

The labeling module is used for searching the proprietary texts corresponding to the similar vectors by using the seed vectors in the proprietary data sets, and labeling the searched proprietary texts as positive samples;

the merging module is used for merging the marked positive samples into the seed sample set to serve as a new seed sample set;

the repeating module is used for repeating the steps of searching the proprietary texts corresponding to the similar vectors by using the seed vectors in the proprietary data sets, marking the searched proprietary texts as the positive samples, and combining the marked positive samples into the seed sample sets by using the combining module as a new seed sample set to obtain the expected number of positive samples.

In the embodiment of the application, the seed vector is used in a proprietary text vector index space, namely a proprietary data set, vector search is performed to obtain a text corresponding to the similar vector, namely a positive sample, and the searched positive sample is labeled. When searching the vector, firstly, the distances between the target vector and each clustering center are sequentially compared, and a plurality of clustering centers closest to the target vector are selected. And then all vectors in the clusters corresponding to the cluster centers are obtained, the distance between each vector and the target vector is calculated in sequence, and a plurality of vectors with the closest distance are selected. The method divides the data set by adopting a clustering method, thereby eliminating vectors with low similarity with target vectors in the searching process.

According to the device for acquiring the positive samples in the NLP classification field, an open pre-training model and a special data pre-training model are acquired, and the open pre-training model and the special data pre-training model are spliced to obtain a vector coding model; encoding the seed samples and the special text data to obtain seed vectors and special text vectors, constructing indexes for the text vectors, and performing vector search to obtain an expected number of positive samples; the method can screen out various positive samples which cannot be matched in the prior art, avoids the situation that some positive samples with text data exceeding the rule range and positive semantics cannot be detected, and has higher recall rate.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 3, fig. 3 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 3 comprises a memory 301, a processor 302 and a network interface 303 which are mutually connected in communication through a system bus, wherein the memory 301 stores computer readable instructions, and the processor 302 realizes the steps of the method for obtaining the positive samples in the NLP classification field when executing the computer readable instructions. It is noted that only computer device 3 having

components

301 and 303 is shown, but it is understood that not all of the shown components are required and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a mobile phone, a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 301 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 301 may be an internal storage unit of the electronic device 3, such as a hard disk or a memory of the computer device 3. In other embodiments, the memory 301 may also be an external storage device of the computer device 3, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device 3. Of course, the memory 301 may also comprise both an internal storage unit of the computer device 3 and an external storage device thereof. In this embodiment, the memory 301 is generally used to store an operating system and various types of application software installed in the computer device 3, such as readable instruction codes for a method for obtaining positive samples in the NLP classification field. In addition, the memory 301 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 302 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 302 is typically used to control the overall operation of the computer device 3. In this embodiment, the processor 302 is configured to execute the readable instruction code stored in the memory 301 or process data, for example, execute the readable instruction code of the method for acquiring the positive sample in the NLP classification domain.

The network interface 303 may comprise a wireless network interface or a wired network interface, and the network interface 303 is typically used for establishing a communication connection between the computer device 3 and other electronic devices.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing instructions for obtaining NLP classification domain positive samples, which are executable by at least one processor 302 to cause the at least one processor 302 to perform the steps of the method for obtaining NLP classification domain positive samples as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A method for obtaining positive samples in the NLP classification field is characterized by comprising the following steps:

acquiring texts to be recognized in a seed sample set and a special data set, inputting the texts to be recognized into a vector coding model for coding, determining a seed vector and a special text vector, and constructing an index for the special text vector, wherein the seed sample set is composed of positive samples;

2. The method for obtaining the positive sample in the NLP classification field according to claim 1, wherein the steps of obtaining the text to be recognized in the seed sample and the proprietary text data, encoding the seed sample and the proprietary text data, determining the seed vector and the proprietary text vector, and constructing the index for the proprietary text vector specifically include:

3. The method according to claim 2, wherein the step of performing a similar vector search in a proprietary dataset based on the seed vector and obtaining a corresponding proprietary text through the vector index to update the seed sample set to obtain a desired number of positive samples specifically comprises:

4. The method according to claim 3, wherein the step of searching for the proprietary text corresponding to the similar vector with the seed vector in the proprietary data set specifically comprises:

5. The method for obtaining the positive sample in the NLP classification field according to claim 2, wherein the steps of obtaining the seed sample and the text to be recognized in the proprietary text data, encoding the seed sample and the proprietary text data, and obtaining the seed vector and the proprietary text vector specifically include:

6. The method for obtaining NLP classification domain positive samples according to claim 1, wherein the steps of obtaining the public data pre-training model and the proprietary data pre-training model specifically include:

7. The method for obtaining positive samples in the NLP classification field according to claim 6, wherein the pre-training of the pre-training model with the public data set, the pre-training of the pre-training model with the public data set or the pre-training of the pre-training model with the proprietary data set, the pre-training of the proprietary data with the proprietary data set, comprises:

8. The utility model provides an obtain positive sample device in NLP categorised field which characterized in that includes:

the splicing module is used for receiving the public data pre-training model and the special data pre-training model to obtain a vector coding model;

the construction module is used for acquiring the seed sample and the text to be identified in the special text data, determining a seed vector and a special text vector, and constructing an index for the special text vector;

and the searching module is used for carrying out vector searching in the proprietary data set based on the seed vector, and acquiring the corresponding proprietary text through the vector index to obtain the positive samples with the expected quantity.

9. A computer device comprising at least one memory having computer-readable instructions stored therein and at least one processor which, when executed, performs the steps of the method of obtaining NLP classification domain positive samples of any one of claims 1 to 7.

10. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of the method of acquiring NLP classification domain positive samples as claimed in any one of claims 1 to 7.