CN113486671B - Regular expression coding-based data expansion method, device, equipment and medium - Google Patents

Regular expression coding-based data expansion method, device, equipment and medium Download PDF

Info

Publication number
CN113486671B
CN113486671B CN202110850687.8A CN202110850687A CN113486671B CN 113486671 B CN113486671 B CN 113486671B CN 202110850687 A CN202110850687 A CN 202110850687A CN 113486671 B CN113486671 B CN 113486671B
Authority
CN
China
Prior art keywords
corpus
historical
regular expression
coding
expanded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110850687.8A
Other languages
Chinese (zh)
Other versions
CN113486671A (en
Inventor
殷子墨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110850687.8A priority Critical patent/CN113486671B/en
Publication of CN113486671A publication Critical patent/CN113486671A/en
Application granted granted Critical
Publication of CN113486671B publication Critical patent/CN113486671B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a data expansion method, a device, equipment and a medium based on regular expression coding, which relate to an artificial intelligence technology, and can train an countermeasure network model for expanding the corpus data amount through a historical corpus if the corpus sample amount of each type in the historical corpus is insufficient, and realize an expansion corpus subset corresponding to each historical corpus in the historical corpus through the trained countermeasure network model to form an expansion corpus. In addition, in the extraction of the corpus input vector corresponding to each corpus in the expanded corpus, not only is the regular expression coding result of the corpus extracted, but also the word embedding vector of the corpus is extracted, the regular expression coding result of the corpus and the word embedding vector are combined into the corpus input vector, so that the manually edited expression information is realized, and meanwhile, the data learning-based capacity is realized. The problem of insufficient training data is solved, and the training data combines the regular coding information to improve the prediction accuracy of the model obtained through training.

Description

Regular expression coding-based data expansion method, device, equipment and medium
Technical Field
The invention relates to the technical field of intelligent decision making of artificial intelligence, in particular to a data expansion method, device, equipment and medium based on regular expression coding.
Background
A neural network based text classification algorithm, in the case of sufficient training data volume, a good effect can be obtained. However, in practical terms, the acquisition of annotation data is often very difficult. In the absence of labeling data, neural network-based algorithms have difficulty achieving good performance.
Regular expressions are a method which is often used when training data are lacking, but the flexibility of the method is poor, and because the expression mode of a real scene always is unexpected by a person writing rules, the expansion corpus which is completely dependent on the regular expressions cannot necessarily ensure a good expansion effect, so that the neural network model is not fully trained due to the low data volume of the training data, and then the accuracy for prediction or classification is low.
Disclosure of Invention
The embodiment of the invention provides a data expansion method, device, equipment and medium based on regular expression coding, which aim to solve the problem that a neural network model is not fully trained due to lower data volume of training data if a text classification algorithm based on a neural network in the prior art depends on regular expression expansion corpus to ensure a good expansion effect.
In a first aspect, an embodiment of the present invention provides a data extension method based on regular expression encoding, which includes:
responding to a model training instruction, and acquiring a historical corpus according to the model training instruction; wherein the historical corpus set comprises a plurality of historical corpora;
training the countermeasure network model by taking the historical corpus as a training sample to obtain a trained countermeasure network model;
inputting each historical corpus in the historical corpus set to the trained countermeasure network model for operation to obtain an expanded corpus subset of each historical corpus, wherein the expanded corpus subset comprises the expanded corpus sets;
acquiring a preset regular expression corresponding to each historical corpus, and setting a corresponding regular expression for the expanded corpus subset of each historical corpus according to the regular expression of the historical corpus;
coding each expansion corpus according to the regular expression and a preset regular expression hit coding strategy to obtain a regular expression coding result;
performing word embedding conversion on each expanded corpus to obtain word embedding vectors;
acquiring word embedding vectors and regular expression coding results of each expanded corpus, and combining the word embedding vectors and the regular expression coding results of each expanded corpus to obtain corpus input vectors;
Acquiring a preset labeling value corresponding to each historical corpus, and setting a corresponding labeling value for the expanded corpus subset of each historical corpus according to the labeling value of the historical corpus;
acquiring corpus input vectors and labeling values of each expanded corpus to form training data of each expanded corpus, and training the training data of each expanded corpus on a neural network to be trained to obtain a trained neural network; and
and obtaining the corpus to be classified uploaded by the user side, and inputting the corpus to be classified into the trained neural network for operation to obtain a classification result.
In a second aspect, an embodiment of the present invention provides a data expansion apparatus based on regular expression encoding, including:
the historical corpus acquisition unit is used for responding to the model training instruction and acquiring a historical corpus according to the model training instruction; wherein, the liquid crystal display device comprises a liquid crystal display device, the historical corpus set comprises a plurality of historical corpora;
the first model training unit is used for training the countermeasure network model by taking the historical corpus as a training sample to obtain a trained countermeasure network model;
the expanded corpus acquisition unit is used for inputting each historical corpus in the historical corpus into the trained countermeasure network model for operation to obtain an expanded corpus subset of each historical corpus, and the expanded corpus subset is formed into an expanded corpus set;
The first mapping unit is used for acquiring a preset regular expression corresponding to each historical corpus, and setting a corresponding regular expression for the expanded corpus subset of each historical corpus according to the regular expression of the historical corpus;
the regular coding unit is used for coding each expansion corpus according to the regular expression and a preset regular expression hit coding strategy to obtain a regular expression coding result;
the word embedding vector acquisition unit is used for carrying out word embedding conversion on each expanded corpus to obtain word embedding vectors;
the vector combination unit is used for acquiring word embedding vectors and regular expression coding results of each expanded corpus and combining the word embedding vectors and the regular expression coding results of each expanded corpus to obtain corpus input vectors;
the second mapping unit is used for obtaining a preset labeling value corresponding to each historical corpus, and setting a corresponding labeling value for the expanded corpus subset of each historical corpus according to the labeling value of the historical corpus;
the second model training unit is used for acquiring the corpus input vector and the labeling value of each expanded corpus to form training data of each expanded corpus, and training the training data of each expanded corpus on the neural network to be trained to obtain a trained neural network; and
The corpus classifying unit is used for acquiring the corpus to be classified uploaded by the user terminal, inputting the corpus to be classified into the trained neural network for operation, and obtaining a classifying result.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the regular expression encoding-based data expansion method according to the first aspect when executing the computer program.
In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program when executed by a processor causes the processor to perform the data expansion method based on regular expression encoding described in the first aspect above.
The embodiment of the invention provides a data expansion method, device, equipment and medium based on regular expression coding, which can train an countermeasure network model for expanding the data quantity of corpus through a historical corpus if the corpus sample quantity of each type in the historical corpus is insufficient, and realize an expanded corpus subset corresponding to each historical corpus in the historical corpus through the trained countermeasure network model to form an expanded corpus. In addition, in the extraction of the corpus input vector corresponding to each corpus in the expanded corpus, not only is the regular expression coding result of the corpus extracted, but also the word embedding vector of the corpus is extracted, and the regular expression coding result of the corpus and the word embedding vector are combined to form the corpus input vector, so that the corpus input vector is expression information capable of realizing manual editing, has certain data learning-based capacity, and realizes the combination of expressions and networks. The problem of insufficient training data is solved, and the training data combines the regular coding information to improve the prediction accuracy of the model obtained through training.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an application scenario of a regular expression encoding-based data expansion method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a data expansion method based on regular expression encoding according to an embodiment of the present invention;
FIG. 3 is a schematic block diagram of a regular expression encoding-based data expansion device provided by an embodiment of the present invention;
fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic diagram of an application scenario of a data expansion method based on regular expression coding according to an embodiment of the present invention; fig. 2 is a flow chart of a data expansion method based on regular expression coding, which is provided by an embodiment of the present invention, and the data expansion method based on regular expression coding is applied to a server, and is executed by application software installed in the server.
As shown in fig. 2, the method includes steps S101 to S110.
S101, responding to a model training instruction, and acquiring a historical corpus according to the model training instruction; the historical corpus set comprises a plurality of historical corpora.
In order to more clearly understand the technical solutions of the present application, the following describes the execution subject concerned in detail. The technical scheme is described by taking a server as an execution main body.
And the server stores a large amount of historical corpuses which form a historical corpus set. If the corpus sample size of each type in the historical corpus is insufficient, an countermeasure network model for expanding the corpus data size can be trained through the historical corpus, and an expanded corpus subset corresponding to each historical corpus in the historical corpus is realized through the trained countermeasure network model to form an expanded corpus. In addition, in the extraction of the corpus input vector corresponding to each corpus in the expanded corpus, not only is the regular expression coding result of the corpus extracted, but also the word embedding vector (which can be understood as semantic vector) of the corpus is extracted, and the corpus input vector is formed by combining the regular expression coding result of the corpus and the word embedding vector, so that the corpus input vector is expression information capable of realizing manual editing, has certain data learning-based capacity, and realizes the combination of expressions and networks. The training method comprises the steps that a large number of training samples are formed in a server based on corpus input vectors and labeling values of corpora corresponding to a large number of expanded corpora, and the neural network to be trained can be trained, so that the trained neural network is obtained. Finally, user classification can be performed based on the obtained trained neural network.
And the user end can edit and upload the corpus to be classified by the user so as to classify the user based on the trained neural network in the server.
When a person operation triggering model training instruction is detected in the server, a historical corpus is acquired in a designated storage area (such as a corpus database in the server) in the server, and a plurality of historical corpuses are included in the historical corpus.
And S102, training the countermeasure network model by taking the historical corpus as a training sample to obtain a trained countermeasure network model.
In this embodiment, the trained countermeasure network model is a practical method for expanding the data set, for example, GAN (i.e. generated countermeasure network), cycle-GAN (i.e. cyclic countermeasure network) and the like are common trained countermeasure network models.
As a specific embodiment of the trained countermeasure network model, a cycle-GAN model can be selected for corpus expansion. The cycle-GAN model is essentially two mirror-symmetrical GAN models, forming a torus network. Two generators and two discriminants are included in the cycle-GAN model, the two GAN models in the cycle-GAN model share the two generators, and the two GAN models each have one discriminant.
In one embodiment, step S102 includes:
acquiring semantic vectors of each historical corpus in the historical corpus set;
obtaining vector similarity among semantic vectors of each historical corpus in the historical corpus, and grouping the historical corpus according to a preset grouping strategy to obtain a historical corpus grouping result; the history corpus grouping result comprises a plurality of history corpus sub-groups, and the history corpus sub-groups are respectively recorded as a 1 st history corpus sub-group to a k-th history corpus sub-group, wherein k is the total number of the history corpus sub-groups included in the history corpus sub-group result;
counting to obtain the total number of the historical corpuses correspondingly included in each historical corpus sub-group, and obtaining the historical corpus sub-group with the largest total number of the historical corpuses as a target historical corpus sub-group;
according to the preset first corpus obtaining number, semantic vectors of two historical corpuses are arbitrarily obtained from the target historical corpus set sub-group so as to train the cycle-GAN model to be trained, and when the cycle-GAN model to be trained converges, the semantic vectors of the two historical corpuses are stopped being arbitrarily obtained from the target historical corpus set sub-group so as to obtain the cycle-GAN model as an countermeasure network model after training; wherein the first corpus extraction number is equal to 2.
In this embodiment, after the historical corpora are grouped, the historical corpora with similar semantics are grouped into the same historical corpus sub-group. At this time, a history corpus sub-group in which the total number of corpora included in the plurality of history corpus sub-groups is the maximum is selected as a target history corpus sub-group. At this time, the target historical corpus sub-group can be used as a training sample after screening to train the cycle-GAN model to be trained.
For example, semantic vectors of two initial corpora arbitrarily selected in the target corpus sub-group are respectively recorded as a sample a and a sample b, and two generators G need to be trained at this time AB 、G BA Two discriminators D A 、D B For sample a, pass through generator G AB Generating false samples
Figure BDA0003182373840000061
By means of a discriminator D B Discrimination ofDummy sample->
Figure BDA0003182373840000062
Whether to approximate sample b and will falsify sample +.>
Figure BDA0003182373840000063
Through generator G BA Generate sample->
Figure BDA0003182373840000064
And judge sample->
Figure BDA0003182373840000065
Whether to approximate the original true sample a. Likewise, for sample b, pass through generator G BA Generating false samples->
Figure BDA0003182373840000066
By means of a discriminator D A Discriminating false samples->
Figure BDA0003182373840000067
Is similar to sample a and will be a dummy sample +.>
Figure BDA0003182373840000068
Through generator G AB Generate sample->
Figure BDA0003182373840000069
And judge sample->
Figure BDA00031823738400000610
Whether to approximate the original true sample b. Finally, through iteration, the discriminator cannot discriminate whether the sample generated by the generator is a real sample or not.
Respectively optimizing training generators and discriminators, wherein the two generators share weights, the two discriminators share weights, and the final goal is to obtain a generator G which minimizes the goal AB And G BA
After the cycle-GAN model to be trained is trained through the historical corpus, the cycle-GAN model for expanding the corpus can be obtained, so that the sample size of the training set is increased, and the problem of insufficient model training caused by insufficient data size of training data is avoided.
In one embodiment, the grouping strategy is K-means clustering; the obtaining the vector similarity between semantic vectors of each historical corpus in the historical corpus, and grouping the historical corpus according to a preset grouping strategy to obtain a historical corpus grouping result comprises the following steps:
and acquiring the historical corpus, and carrying out K-means clustering on the Euclidean distance according to the Euclidean distance between semantic vectors of each historical corpus to obtain a historical corpus grouping result.
In this embodiment, when the historical corpuses are grouped according to a preset grouping strategy, the grouping strategy may be set to be K-means clustering, and after the number of the expected classification groups is preset at this time, K-means clustering may be performed according to euclidean distances between semantic vectors of each historical corpus as vector similarity, so as to obtain a grouping result of the historical corpuses. K-means clustering is a common prior art and will not be described in detail herein.
S103, inputting each historical corpus in the historical corpus set into the trained countermeasure network model for operation to obtain an expanded corpus subset of each historical corpus, and forming an expanded corpus set by the expanded corpus subsets.
In this embodiment, after model training is completed to obtain a trained countermeasure network model, the historical corpus may still be used as input data of the trained countermeasure network model to generate an expanded corpus.
S104, acquiring a preset regular expression corresponding to each historical corpus, and setting a corresponding regular expression for the expanded corpus subset of each historical corpus according to the regular expression of the historical corpus.
In this embodiment, after the expansion of the history corpus is completed, each history corpus is correspondingly expanded to form a plurality of similar expanded corpus data, for example, one history corpus a and the expanded corpus data corresponding to the expansion correspond to the same regular expression, that is, each expanded corpus subset can correspond to the same preset regular expression because of similar semantics.
For example, a regular expression corresponding to a certain historical corpus a is (today |tomorrow|postnatal) is (empty|time), "" is not a character in the regular expression), and the expanded corpus subset corresponding to the historical corpus a includes the following text: (1) little king, tomorrow you are empty; (2) Xiaowang, mingshan respectively; (3) instant food is eaten today. At this time, the regular expression (today|tomorrow|postnatal) is mapped and bound with the expansion corpus subset corresponding to the history corpus A, and the regular expression corresponding to each expansion corpus subset is bound to ensure that the encoding result of the subsequent regular expression hit encoding is more accurate.
S105, coding each expansion corpus according to the regular expression and a preset regular expression hit coding strategy to obtain a regular expression coding result.
In this embodiment, in order to more accurately perform regular expression hit encoding, the following regular expression hit encoding strategy needs to be preset:
a1 Each character of the text corresponding to the corpus and the corresponding regular expression are encoded to a first encoded value before hit (e.g., the first encoded value is set to 0 when implemented);
a2 If each character of the text corresponding to the corpus has a partial hit but not a complete hit with the corresponding regular expression, encoding the text as a second encoding value (e.g., setting the second encoding value to 1 when implemented);
a3 If each character of the text corresponding to the corpus has a partial hit or a partial miss with the corresponding regular expression, the text is encoded into a third encoded value (for example, the third encoded value is set to 2 in the implementation);
a4 A full hit of each character of the text corresponding to the corpus with the corresponding regular expression is encoded as a fourth encoded value (e.g., the fourth encoded value is set to 3 when implemented).
For example, the regular expression corresponding to the historical corpus a, still enumerated above, is "(today |tomorrow|postnatal) you have (empty|time), and the expanded corpus subset corresponding to the historical corpus a includes the following text: (1) little king, tomorrow you are empty; (2) Xiaowang, mingshan respectively; (3) instant today; (4) You have what is, at this time, the result of continuing to encode each corpus in the expanded corpus subset corresponding to the historical corpus a is as follows:
Text 1: little king, tomorrow you are empty
Coding 1:0 0 0 1 1 1 1 1 3
Text 2: king, mingtian come from
Coding 2:0 0 0 1 1 2 2 2
Text 3-eating a instant today
Coding 3:1 1 2 2
Text 4: you have what you are
Coding 4:0 0 0 0
The regular expression coding result obtained by the regular expression-based coding mode provides more key information for each corpus, so that the corpus serving as training data is mined with more features.
In an embodiment, taking a regular expression encoding result corresponding to a corpus as an example, step S105 includes:
dividing the expanded corpus according to characters to obtain a character division result;
comparing each character in the character division result with the regular expression corresponding to the expanded corpus in sequence, and determining whether each character hits the regular expression;
if the character is determined to miss the regular expression and other characters in front of the character miss the regular expression, outputting a sub-coding result of the character as a first coding value;
if the character is determined to hit the regular expression and other characters before the character are not completely hit the regular expression, outputting a sub-coding result of the character as a second coding value;
If the character is determined to miss the regular expression and the one-bit character before the character hits the regular expression, outputting a sub-coding result of the character as a third coding value;
if the character is determined to hit the regular expression and all other continuous characters before the character hit the regular expression, outputting a sub-coding result of the character as a fourth coding value;
and splicing the sub-coding results of each character obtained in sequence to obtain a regular expression coding result of the expanded corpus.
In this embodiment, referring to the above process of performing regular expression encoding on text, it is necessary to split the corpus (which may also be understood as text) into a plurality of characters, and then match the whole regular expression with the characters as a unit to determine whether to hit the regular expression. For example, the text "little king, tomorrow you have a space" after being split into characters includes 9 characters of "little", "king", "bright", "day", "you", "have", "empty", "mock", and the regular expression corresponding to the corpus is "(today|tomorrow|postnatal) mock you have (blank|time). The sub-coding result corresponding to the small miss regular expression is output as 0, the sub-coding result corresponding to the king miss regular expression is output as 0, ", the sub-coding result corresponding to the missing regular expression is output as 0, the sub-coding result corresponding to the bright hit regular expression is output as 1, the sub-coding result corresponding to the day hit regular expression is output as 1, the sub-coding result corresponding to the you hit regular expression is output as 1, the sub-coding result corresponding to the have hit regular expression is output as 1, the sub-coding result corresponding to the null hit regular expression is output as 1, the sub-coding result corresponding to the mock hit regular expression is output as 3, and the sub-coding result corresponding to the 6 character full hit regular expression of the previous Ming, tian, you have and null is output as 3. The regular expression coding sequence is input into the neural network for training, so that flexibility can be given to the training sequence, namely, more key information is provided for each corpus, and the corpus serving as training data is mined with more features.
S106, carrying out word embedding conversion on each expansion corpus to obtain word embedding vectors.
In this embodiment, after the regular expression coding result corresponding to each corpus in the expanded corpus is obtained, text sequence vectorization processing is further required to be performed on each corpus in the expanded corpus, and the most common processing is to convert each corpus into a corresponding word embedding vector (which can also be understood as a semantic vector), so that the piece-changing information in the corpus can also be effectively extracted.
In one embodiment, step S106 includes:
and sequentially carrying out word segmentation, keyword extraction, word vector conversion and word vector weighted summation on each expanded corpus according to a preset word embedding conversion strategy to obtain word embedding vectors of each expanded corpus.
In this embodiment, text Word segmentation is performed on a Word material based on a probability statistics Word segmentation model to obtain a text Word segmentation result, keyword extraction is performed on the text Word segmentation result through a Word frequency-reverse file frequency model to obtain a corresponding keyword set, each keyword in the keyword set is correspondingly converted into a Word vector through a Word2Vec model (which is an existing Word vector conversion model), and then weighted summation is performed by combining weight values corresponding to each Word vector, so that a Word embedded vector corresponding to the Word material can be obtained. By the aid of the semantic vector acquisition mode, original key information which is not processed in redundant mode in the corpus is extracted.
S107, acquiring word embedding vectors and regular expression coding results of each expanded corpus, and combining the word embedding vectors and the regular expression coding results of each expanded corpus to obtain corpus input vectors.
In this embodiment, after the word embedding vector and the regular expression encoding result corresponding to each corpus are obtained, the word embedding vector and the regular expression encoding result are combined, so that the corpus input vector corresponding to the corpus can be obtained. For example, if the word embedding vector corresponding to the corpus a is a column vector of 1×300, and the regular expression corresponding to the corpus a is a column vector of 1×10, the word embedding vector and the regular expression can be concatenated into a column vector of 1×310. Through the splicing process, a multi-dimensional corpus input vector with original key information of corpus and regular expression coding information is obtained.
In one embodiment, step S107 includes:
and splicing the word embedding vector and the regular expression coding result of each expanded corpus into a corpus input vector according to the sequence that the word embedding vector is positioned before the regular expression coding result.
In this embodiment, word embedding vectors and regular expression encoding results of the corpus are spliced according to the sequence, so that the input vector of each corpus has two dimensions, namely the word embedding vector and the regular expression encoding result, and a multi-dimensional corpus input vector with both original key information of the corpus and regular expression encoding information is obtained.
S108, obtaining a preset labeling value corresponding to each historical corpus, and setting a corresponding labeling value for the expanded corpus subset of each historical corpus according to the labeling value of the historical corpus.
In this embodiment, in order to label each corpus more quickly, the corpora included in the same expanded corpus subset may correspond to the same labeling value, and this labeling value inherits the labeling value of the history corpus corresponding to the expanded corpus subset. In specific time, once a certain labeling value and the expanded corpus subset correspond to the same historical corpus, the labeling value corresponding to the same historical corpus and the expanded corpus subset are mapped and bound, so that each expanded corpus subset is labeled correspondingly and rapidly.
S109, acquiring corpus input vectors and labeling values of each expanded corpus to form training data of each expanded corpus, and training the training data of each expanded corpus on the neural network to be trained to obtain the trained neural network.
In this embodiment, after the corpus input vector and the labeling value corresponding to each corpus in the expanded corpus are obtained, training data corresponding to each corpus is composed of the corpus input vector and the labeling value corresponding to each corpus, and the training data form a training set to train the neural network to be trained, so as to obtain the trained neural network. The neural network to be trained can be a BERT model, and the trained neural network obtained through training has a process of predicting and classifying based on input corpus.
S110, acquiring the corpus to be classified uploaded by the user side, and inputting the corpus to be classified into the trained neural network for operation to obtain a classification result.
In this embodiment, after training the neural network to be trained to obtain the trained neural network is completed in the server, whether the user uploads the corpus to be classified can be detected, and once the corpus to be classified uploaded by the user is detected and received, the corpus to be classified is operated based on the trained neural network, so as to obtain a classification result.
In one embodiment, step S110 includes:
and obtaining the word embedding vector to be classified and the regular expression coding result to be classified of the corpus to be classified to form the corpus input vector to be classified, and inputting the corpus input vector to be classified into the trained neural network to perform operation to obtain the classification result.
In this embodiment, after the corpus to be classified uploaded by the user is obtained, the word embedding vector and the regular expression encoding result of the historical corpus to be classified are obtained first by referring to the word embedding vector and the regular expression encoding result of the historical corpus to be classified, then corresponding corpus input vectors to be classified are formed, and finally the corpus input vectors to be classified are input to the trained neural network to perform operation to obtain the classification result. By means of the input vector integrating the word embedding vector and the regular expression coding result, the classification result can be predicted more accurately.
According to the method, an countermeasure network model for expanding the data volume of the corpus is trained through the historical corpus, the trained countermeasure network model corresponds to an expanded corpus subset for each historical corpus in the historical corpus to form the expanded corpus, corpus input vectors corresponding to each corpus in the expanded corpus are extracted, regular expression coding results of the corpus are extracted, word embedding vectors of the corpus are extracted, the problem of insufficient training data is solved, and the prediction accuracy of the trained model is improved by combining the training data with regular coding information.
The embodiment of the invention also provides a data expansion device based on regular expression coding, which is used for executing any embodiment of the data expansion method based on regular expression coding. Specifically, referring to fig. 3, fig. 3 is a schematic block diagram of a data expansion apparatus based on regular expression encoding according to an embodiment of the present invention. The regular expression encoding-based data expansion apparatus 100 may be configured in a server.
As shown in fig. 3, the regular expression encoding-based data expansion apparatus 100 includes: the corpus classifying unit comprises a history corpus acquiring unit 101, a first model training unit 102, an expanded corpus acquiring unit 103, a first mapping unit 104, a regular encoding unit 105, a word embedding vector acquiring unit 106, a vector combining unit 107, a second mapping unit 108, a second model training unit 109 and a corpus classifying unit 110.
A historical corpus acquisition unit 101, configured to respond to a model training instruction, and acquire a historical corpus according to the model training instruction; the historical corpus set comprises a plurality of historical corpora.
In this embodiment, when a person operation trigger model training instruction is detected in the server, a history corpus is first obtained in a specified storage area in the server (for example, a corpus database in the server), and a plurality of history corpuses are included in the history corpus.
The first model training unit 102 is configured to train the countermeasure network model by using the historical corpus as a training sample, so as to obtain a trained countermeasure network model.
In this embodiment, the trained countermeasure network model is a practical method for expanding the data set, for example, GAN (i.e. generated countermeasure network), cycle-GAN (i.e. cyclic countermeasure network) and the like are common trained countermeasure network models.
As a specific embodiment of the trained countermeasure network model, a cycle-GAN model can be selected for corpus expansion. The cycle-GAN model is essentially two mirror-symmetrical GAN models, forming a torus network. Two generators and two discriminants are included in the cycle-GAN model, the two GAN models in the cycle-GAN model share the two generators, and the two GAN models each have one discriminant.
In one embodiment, the first model training unit 102 includes:
a semantic vector acquisition unit, the semantic vector is used for acquiring the semantic vector of each historical corpus in the historical corpus set;
the historical corpus grouping unit is used for acquiring the vector similarity among semantic vectors of each historical corpus in the historical corpus and grouping the historical corpuses according to a preset grouping strategy to obtain a historical corpus grouping result; the history corpus grouping result comprises a plurality of history corpus sub-groups, and the history corpus sub-groups are respectively recorded as a 1 st history corpus sub-group to a k-th history corpus sub-group, wherein k is the total number of the history corpus sub-groups included in the history corpus sub-group result;
the sub-group acquisition unit is used for counting and acquiring the total number of the historical corpuses correspondingly included in each historical corpus sub-group, and acquiring the historical corpus sub-group with the largest total number of the historical corpuses as a target historical corpus sub-group;
the counternetwork model training unit is used for arbitrarily acquiring semantic vectors of two historical corpuses from the target historical corpus set sub-group according to the preset first corpus-taking number so as to train the cycle-GAN model to be trained, and stopping arbitrarily acquiring the semantic vectors of the two historical corpuses from the target historical corpus set sub-group when the cycle-GAN model to be trained converges to obtain the cycle-GAN model as a trained counternetwork model; wherein the first corpus extraction number is equal to 2.
In this embodiment, after the historical corpora are grouped, the historical corpora with similar semantics are grouped into the same historical corpus sub-group. At this time, a history corpus sub-group in which the total number of corpora included in the plurality of history corpus sub-groups is the maximum is selected as a target history corpus sub-group. At this time, the target historical corpus sub-group can be used as a training sample after screening to train the cycle-GAN model to be trained.
For example, in the target corpusThe semantic vectors of two initial corpus selected arbitrarily in the aggregate sub-group are respectively recorded as a sample a and a sample b, and two generators G need to be trained at the moment AB 、G BA Two discriminators D A 、D B For sample a, pass through generator G AB Generating false samples
Figure BDA0003182373840000131
By means of a discriminator D B Discriminating false samples->
Figure BDA0003182373840000132
Whether to approximate sample b and will falsify sample +.>
Figure BDA0003182373840000135
Through generator G BA Generate sample->
Figure BDA0003182373840000134
And judge sample->
Figure BDA0003182373840000133
Whether to approximate the original true sample a. Likewise, for sample b, pass through generator G BA Generating false samples->
Figure BDA0003182373840000136
By means of a discriminator D A Discriminating false samples->
Figure BDA0003182373840000137
Is similar to sample a and will be a dummy sample +.>
Figure BDA0003182373840000138
Through generator G AB Generate sample->
Figure BDA0003182373840000139
And judge sample->
Figure BDA00031823738400001310
Whether to approximate the original true sample b.Finally, through iteration, the discriminator cannot discriminate whether the sample generated by the generator is a real sample or not.
Respectively optimizing training generators and discriminators, wherein the two generators share weights, the two discriminators share weights, and the final goal is to obtain a generator G which minimizes the goal AB And G BA
After the cycle-GAN model to be trained is trained through the historical corpus, the cycle-GAN model for expanding the corpus can be obtained, so that the sample size of the training set is increased, and the problem of insufficient model training caused by insufficient data size of training data is avoided.
In an embodiment, the grouping strategy is K-means clustering, and the historical corpus grouping unit is further configured to:
and acquiring the historical corpus, and carrying out K-means clustering on the Euclidean distance according to the Euclidean distance between semantic vectors of each historical corpus to obtain a historical corpus grouping result.
In this embodiment, when the historical corpuses are grouped according to a preset grouping strategy, the grouping strategy may be set to be K-means clustering, and after the number of the expected classification groups is preset at this time, K-means clustering may be performed according to euclidean distances between semantic vectors of each historical corpus as vector similarity, so as to obtain a grouping result of the historical corpuses. K-means clustering is a common prior art and will not be described in detail herein.
The expanded corpus acquisition unit 103 is configured to input each historical corpus in the historical corpus set to the trained countermeasure network model for operation, obtain an expanded corpus subset of each historical corpus, and form an expanded corpus set from the expanded corpus subsets.
In this embodiment, after model training is completed to obtain a trained countermeasure network model, the historical corpus may still be used as input data of the trained countermeasure network model to generate an expanded corpus.
The first mapping unit 104 is configured to obtain a preset regular expression corresponding to each historical corpus, and set the expanded corpus subset of each historical corpus to a corresponding regular expression according to the regular expression of the historical corpus.
In this embodiment, after the expansion of the history corpus is completed, each history corpus is correspondingly expanded to form a plurality of similar expanded corpus data, for example, one history corpus a and the expanded corpus data corresponding to the expansion correspond to the same regular expression, that is, each expanded corpus subset can correspond to the same preset regular expression because of similar semantics.
For example, a regular expression corresponding to a certain historical corpus a is (today |tomorrow|postnatal) is (empty|time), "" is not a character in the regular expression), and the expanded corpus subset corresponding to the historical corpus a includes the following text: (1) little king, tomorrow you are empty; (2) Xiaowang, mingshan respectively; (3) instant food is eaten today. At this time, the regular expression (today|tomorrow|postnatal) is mapped and bound with the expansion corpus subset corresponding to the history corpus A, and the regular expression corresponding to each expansion corpus subset is bound to ensure that the encoding result of the subsequent regular expression hit encoding is more accurate.
The regular coding unit 105 is configured to code each expansion corpus according to a regular expression and a preset regular expression hit coding strategy, so as to obtain a regular expression coding result.
In this embodiment, in order to more accurately perform regular expression hit encoding, the following regular expression hit encoding strategy needs to be preset:
a1 Each character of the text corresponding to the corpus and the corresponding regular expression are encoded to a first encoded value before hit (e.g., the first encoded value is set to 0 when implemented);
a2 If each character of the text corresponding to the corpus has a partial hit but not a complete hit with the corresponding regular expression, encoding the text as a second encoding value (e.g., setting the second encoding value to 1 when implemented);
a3 If each character of the text corresponding to the corpus has a partial hit or a partial miss with the corresponding regular expression, the text is encoded into a third encoded value (for example, the third encoded value is set to 2 in the implementation);
a4 A full hit of each character of the text corresponding to the corpus with the corresponding regular expression is encoded as a fourth encoded value (e.g., the fourth encoded value is set to 3 when implemented).
For example, the regular expression corresponding to the historical corpus a, still enumerated above, is "(today |tomorrow|postnatal) you have (empty|time), and the expanded corpus subset corresponding to the historical corpus a includes the following text: (1) little king, tomorrow you are empty; (2) Xiaowang, mingshan respectively; (3) instant today; (4) You have what is, at this time, the result of continuing to encode each corpus in the expanded corpus subset corresponding to the historical corpus a is as follows:
Text 1: little king, tomorrow you are empty
Coding 1:0 0 0 1 1 1 1 1 3
Text 2: king, mingtian come from
Coding 2:0 0 0 1 1 2 2 2
Text 3-eating a instant today
Coding 3:1 1 2 2
Text 4: you have what you are
Coding 4:0 0 0 0
The regular expression coding result obtained by the regular expression-based coding mode provides more key information for each corpus, so that the corpus serving as training data is mined with more features.
In an embodiment, taking an example of obtaining a regular expression encoding result corresponding to a corpus, the regular encoding unit 105 includes:
the character dividing unit is used for dividing the expanded corpus according to characters to obtain a character dividing result;
the hit comparison unit is used for sequentially comparing each character in the character division result with the regular expression corresponding to the expanded corpus and determining whether each character hits the regular expression;
the first coding unit is used for outputting the sub-coding result of the character as a first coding value if the character is determined to miss the regular expression and other characters in front of the character miss the regular expression;
the second coding unit is used for outputting the sub-coding result of the character as a second coding value if the character is determined to hit the regular expression and other characters before the character are not completely hit the regular expression;
The third coding unit is used for outputting the sub-coding result of the character as a third coding value if the character is determined to miss the regular expression and the one-bit character before the character hits the regular expression;
the fourth coding unit is used for outputting the sub-coding result of the character as a fourth coding value if the character is determined to hit the regular expression and all other continuous characters before the character hit the regular expression;
and the coding combination unit is used for splicing the sub-coding results of each character acquired in sequence to obtain a regular expression coding result of the expanded corpus.
In this embodiment, referring to the above process of performing regular expression encoding on text, it is necessary to split the corpus (which may also be understood as text) into a plurality of characters, and then match the whole regular expression with the characters as a unit to determine whether to hit the regular expression. For example, the text "little king, tomorrow you have a space" after being split into characters includes 9 characters of "little", "king", "bright", "day", "you", "have", "empty", "mock", and the regular expression corresponding to the corpus is "(today|tomorrow|postnatal) mock you have (blank|time). The sub-coding result corresponding to the small miss regular expression is output as 0, the sub-coding result corresponding to the king miss regular expression is output as 0, ", the sub-coding result corresponding to the missing regular expression is output as 0, the sub-coding result corresponding to the bright hit regular expression is output as 1, the sub-coding result corresponding to the day hit regular expression is output as 1, the sub-coding result corresponding to the you hit regular expression is output as 1, the sub-coding result corresponding to the have hit regular expression is output as 1, the sub-coding result corresponding to the null hit regular expression is output as 1, the sub-coding result corresponding to the mock hit regular expression is output as 3, and the sub-coding result corresponding to the 6 character full hit regular expression of the previous Ming, tian, you have and null is output as 3. The regular expression coding sequence is input into the neural network for training, so that flexibility can be given to the training sequence, namely, more key information is provided for each corpus, and the corpus serving as training data is mined with more features.
The word embedding vector obtaining unit 106 is configured to perform word embedding conversion on each expanded corpus to obtain a word embedding vector.
In this embodiment, after the regular expression coding result corresponding to each corpus in the expanded corpus is obtained, text sequence vectorization processing is further required to be performed on each corpus in the expanded corpus, and the most common processing is to convert each corpus into a corresponding word embedding vector (which can also be understood as a semantic vector), so that the piece-changing information in the corpus can also be effectively extracted.
In an embodiment, the word embedding vector obtaining unit 106 is further configured to:
and sequentially carrying out word segmentation, keyword extraction, word vector conversion and word vector weighted summation on each expanded corpus according to a preset word embedding conversion strategy to obtain word embedding vectors of each expanded corpus.
In this embodiment, text Word segmentation is performed on a Word material based on a probability statistics Word segmentation model to obtain a text Word segmentation result, keyword extraction is performed on the text Word segmentation result through a Word frequency-reverse file frequency model to obtain a corresponding keyword set, each keyword in the keyword set is correspondingly converted into a Word vector through a Word2Vec model (which is an existing Word vector conversion model), and then weighted summation is performed by combining weight values corresponding to each Word vector, so that a Word embedded vector corresponding to the Word material can be obtained. By the aid of the semantic vector acquisition mode, original key information which is not processed in redundant mode in the corpus is extracted.
The vector combining unit 107 is configured to obtain a word embedding vector and a regular expression encoding result of each expanded corpus, and combine the word embedding vector and the regular expression encoding result of each expanded corpus to obtain a corpus input vector.
In this embodiment, after the word embedding vector and the regular expression encoding result corresponding to each corpus are obtained, the word embedding vector and the regular expression encoding result are combined, so that the corpus input vector corresponding to the corpus can be obtained. For example, if the word embedding vector corresponding to the corpus a is a column vector of 1×300, and the regular expression corresponding to the corpus a is a column vector of 1×10, the word embedding vector and the regular expression can be concatenated into a column vector of 1×310. Through the splicing process, a multi-dimensional corpus input vector with original key information of corpus and regular expression coding information is obtained.
In an embodiment, the vector combining unit 107 is further configured to:
and splicing the word embedding vector and the regular expression coding result of each expanded corpus into a corpus input vector according to the sequence that the word embedding vector is positioned before the regular expression coding result.
In this embodiment, word embedding vectors and regular expression encoding results of the corpus are spliced according to the sequence, so that the input vector of each corpus has two dimensions, namely the word embedding vector and the regular expression encoding result, and a multi-dimensional corpus input vector with both original key information of the corpus and regular expression encoding information is obtained.
The second mapping unit 108 is configured to obtain a preset labeling value corresponding to each historical corpus, and set the corresponding labeling value for the expanded corpus subset of each historical corpus according to the labeling value of the historical corpus.
In this embodiment, in order to label each corpus more quickly, the corpora included in the same expanded corpus subset may correspond to the same labeling value, and this labeling value inherits the labeling value of the history corpus corresponding to the expanded corpus subset. In specific time, once a certain labeling value and the expanded corpus subset correspond to the same historical corpus, the labeling value corresponding to the same historical corpus and the expanded corpus subset are mapped and bound, so that each expanded corpus subset is labeled correspondingly and rapidly.
The second model training unit 109 is configured to obtain training data of each expanded corpus composed of a corpus input vector and a labeling value of each expanded corpus, and train the training data of each expanded corpus on the neural network to be trained, thereby obtaining a trained neural network.
In this embodiment, after the corpus input vector and the labeling value corresponding to each corpus in the expanded corpus are obtained, training data corresponding to each corpus is composed of the corpus input vector and the labeling value corresponding to each corpus, and the training data form a training set to train the neural network to be trained, so as to obtain the trained neural network. The neural network to be trained can be a BERT model, and the trained neural network obtained through training has a process of predicting and classifying based on input corpus.
The corpus classifying unit 110 is configured to obtain a corpus to be classified uploaded by the user, and input the corpus to be classified to the trained neural network for operation, so as to obtain a classification result.
In this embodiment, after training the neural network to be trained to obtain the trained neural network is completed in the server, whether the user uploads the corpus to be classified can be detected, and once the corpus to be classified uploaded by the user is detected and received, the corpus to be classified is operated based on the trained neural network, so as to obtain a classification result.
In an embodiment, the regular expression encoding-based data expansion apparatus 100 further includes:
the classification result acquisition unit is used for acquiring the word embedding vector to be classified and the regular expression coding result to be classified of the corpus to form the corpus input vector to be classified, and inputting the corpus input vector to be classified into the trained neural network to perform operation to obtain the classification result.
In this embodiment, after the corpus to be classified uploaded by the user is obtained, the word embedding vector and the regular expression encoding result of the historical corpus to be classified are obtained first by referring to the word embedding vector and the regular expression encoding result of the historical corpus to be classified, then corresponding corpus input vectors to be classified are formed, and finally the corpus input vectors to be classified are input to the trained neural network to perform operation to obtain the classification result. By means of the input vector integrating the word embedding vector and the regular expression coding result, the classification result can be predicted more accurately.
According to the device, an countermeasure network model for expanding the data volume of the corpus is trained through the historical corpus, the trained countermeasure network model corresponds to an expanded corpus subset for each historical corpus in the historical corpus to form the expanded corpus, corpus input vectors corresponding to each corpus in the expanded corpus are extracted, regular expression coding results of the corpus are extracted, word embedding vectors of the corpus are extracted, the problem of insufficient training data is solved, and the prediction accuracy of the trained model is improved by combining the training data with regular coding information.
The above-described regular expression encoding-based data expansion apparatus may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 4.
Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 is a server, and the server may be a stand-alone server or a server cluster formed by a plurality of servers.
With reference to FIG. 4, the computer device 500 includes a processor 502, a memory, and a network interface 505, connected by a system bus 501, where the memory may include a storage medium 503 and an internal memory 504.
The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a data expansion method based on regular expression encoding.
The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.
The internal memory 504 provides an environment for the execution of a computer program 5032 in the storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a data expansion method based on regular expression encoding.
The network interface 505 is used for network communication, such as providing for transmission of data information, etc. It will be appreciated by those skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting of the computer device 500 to which the present inventive arrangements may be implemented, and that a particular computer device 500 may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
The processor 502 is configured to execute a computer program 5032 stored in a memory, so as to implement the regular expression encoding-based data expansion method disclosed in the embodiment of the present invention.
Those skilled in the art will appreciate that the embodiment of the computer device shown in fig. 4 is not limiting of the specific construction of the computer device, and in other embodiments, the computer device may include more or less components than those shown, or certain components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may include only a memory and a processor, and in such embodiments, the structure and function of the memory and the processor are consistent with the embodiment shown in fig. 4, and will not be described again.
It should be appreciated that in embodiments of the present invention, the processor 502 may be a central processing unit (Central Processing Unit, CPU), the processor 502 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In another embodiment of the invention, a computer-readable storage medium is provided. The computer readable storage medium may be a nonvolatile computer readable storage medium or a volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program when executed by a processor implements the regular expression encoding-based data expansion method disclosed in the embodiments of the present invention.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units may be stored in a storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (8)

1. A data expansion method based on regular expression coding, comprising:
responding to a model training instruction, and acquiring a historical corpus according to the model training instruction; wherein the historical corpus set comprises a plurality of historical corpora;
training the countermeasure network model by taking the historical corpus as a training sample to obtain a trained countermeasure network model;
inputting each historical corpus in the historical corpus set to the trained countermeasure network model for operation to obtain an expanded corpus subset of each historical corpus, wherein the expanded corpus subset comprises the expanded corpus sets; wherein each expansion language material subset comprises a plurality of expansion language materials;
acquiring a preset regular expression corresponding to each historical corpus, and setting a corresponding regular expression for the expanded corpus subset of each historical corpus according to the regular expression of the historical corpus;
Coding each expansion corpus according to the regular expression and a preset regular expression hit coding strategy to obtain a regular expression coding result;
performing word embedding conversion on each expanded corpus to obtain word embedding vectors;
acquiring word embedding vectors and regular expression coding results of each expanded corpus, and forming corpus input vectors by the word embedding vectors and the regular expression coding results of each expanded corpus;
acquiring a preset labeling value corresponding to each historical corpus, and setting a corresponding labeling value for the expanded corpus subset of each historical corpus according to the labeling value of the historical corpus;
acquiring corpus input vectors and labeling values of each expanded corpus to form training data of each expanded corpus, and training the training data of each expanded corpus on a neural network to be trained to obtain a trained neural network; and
acquiring corpus to be classified uploaded by a user terminal, inputting the corpus to be classified into a trained neural network for operation, and obtaining a classification result;
training the countermeasure network model by taking the historical corpus as a training sample to obtain a trained countermeasure network model, wherein the training step comprises the following steps of:
acquiring semantic vectors of each historical corpus in the historical corpus set;
Obtaining vector similarity among semantic vectors of each historical corpus in the historical corpus, and grouping the historical corpus according to a preset grouping strategy to obtain a historical corpus grouping result; the history corpus grouping result comprises a plurality of history corpus sub-groups, and the history corpus sub-groups are respectively recorded as a 1 st history corpus sub-group to a k-th history corpus sub-group, wherein k is the total number of the history corpus sub-groups included in the history corpus sub-group result;
counting to obtain the total number of the historical corpuses correspondingly included in each historical corpus sub-group, and obtaining the historical corpus sub-group with the largest total number of the historical corpuses as a target historical corpus sub-group;
training a cycle-GAN model to be trained by arbitrarily acquiring semantic vectors of two historical corpuses from a target historical corpus sub-group according to a preset first corpus acquisition number, and stopping arbitrarily acquiring the semantic vectors of the two historical corpuses from the target historical corpus sub-group when the cycle-GAN model to be trained converges to obtain the cycle-GAN model as a trained countermeasure network model; wherein the first corpus extraction number is equal to 2;
Encoding each expansion corpus according to the regular expression and a preset regular expression hit encoding strategy to obtain a regular expression encoding result, wherein the method comprises the following steps:
dividing the expanded corpus according to characters to obtain a character division result;
comparing each character in the character division result with the regular expression corresponding to the expanded corpus in sequence, and determining whether each character hits the regular expression;
if the character is determined to miss the regular expression and other characters in front of the character miss the regular expression, outputting a sub-coding result of the character as a first coding value;
if the character is determined to hit the regular expression and other characters before the character are not completely hit the regular expression, outputting a sub-coding result of the character as a second coding value;
if the character is determined to miss the regular expression and the one-bit character before the character hits the regular expression, outputting a sub-coding result of the character as a third coding value;
if the character is determined to hit the regular expression and all other continuous characters before the character hit the regular expression, outputting a sub-coding result of the character as a fourth coding value;
and splicing the sub-coding results of each character obtained in sequence to obtain a regular expression coding result of the expanded corpus.
2. The regular expression encoding-based data expansion method of claim 1, wherein the grouping strategy is K-means clustering;
the obtaining the vector similarity between semantic vectors of each historical corpus in the historical corpus, and grouping the historical corpus according to a preset grouping strategy to obtain a historical corpus grouping result comprises the following steps:
and acquiring the historical corpus, and carrying out K-means clustering on the Euclidean distance according to the Euclidean distance between semantic vectors of each historical corpus to obtain a historical corpus grouping result.
3. The regular expression coding-based data expansion method according to claim 1, wherein the performing word embedding transformation on each expansion corpus to obtain word embedding vectors comprises:
and sequentially carrying out word segmentation, keyword extraction, word vector conversion and word vector weighted summation on each expanded corpus according to a preset word embedding conversion strategy to obtain word embedding vectors of each expanded corpus.
4. The regular expression coding-based data expansion method according to claim 1, wherein the obtaining the word embedding vector and the regular expression coding result of each expanded corpus, and combining the word embedding vector and the regular expression coding result of each expanded corpus to obtain the corpus input vector, comprises:
And splicing the word embedding vector and the regular expression coding result of each expanded corpus into a corpus input vector according to the sequence that the word embedding vector is positioned before the regular expression coding result.
5. The regular expression coding-based data expansion method according to claim 1, wherein the inputting the corpus to be classified into the trained neural network for operation to obtain the classification result comprises:
and obtaining the word embedding vector to be classified and the regular expression coding result to be classified of the corpus to be classified to form the corpus input vector to be classified, and inputting the corpus input vector to be classified into the trained neural network to perform operation to obtain the classification result.
6. A regular expression encoding-based data expansion device, comprising:
the historical corpus acquisition unit is used for responding to the model training instruction and acquiring a historical corpus according to the model training instruction; wherein the historical corpus set comprises a plurality of historical corpora;
the first model training unit is used for training the countermeasure network model by taking the historical corpus as a training sample to obtain a trained countermeasure network model;
The expanded corpus acquisition unit is used for inputting each historical corpus in the historical corpus into the trained countermeasure network model for operation to obtain an expanded corpus subset of each historical corpus, and the expanded corpus subset is formed into an expanded corpus set;
the first mapping unit is used for acquiring a preset regular expression corresponding to each historical corpus, and setting a corresponding regular expression for the expanded corpus subset of each historical corpus according to the regular expression of the historical corpus;
the regular coding unit is used for coding each expansion corpus according to the regular expression and a preset regular expression hit coding strategy to obtain a regular expression coding result;
the word embedding vector acquisition unit is used for carrying out word embedding conversion on each expanded corpus to obtain word embedding vectors;
the vector combination unit is used for acquiring word embedding vectors and regular expression coding results of each expanded corpus and combining the word embedding vectors and the regular expression coding results of each expanded corpus to obtain corpus input vectors;
the second mapping unit is used for obtaining a preset labeling value corresponding to each historical corpus, and setting a corresponding labeling value for the expanded corpus subset of each historical corpus according to the labeling value of the historical corpus;
The second model training unit is used for acquiring the corpus input vector and the labeling value of each expanded corpus to form training data of each expanded corpus, and training the training data of each expanded corpus on the neural network to be trained to obtain a trained neural network; and
the corpus classifying unit is used for acquiring the corpus to be classified uploaded by the user terminal, inputting the corpus to be classified into the trained neural network for operation, and obtaining a classifying result;
the first model training unit is specifically configured to:
acquiring semantic vectors of each historical corpus in the historical corpus set;
obtaining vector similarity among semantic vectors of each historical corpus in the historical corpus, and grouping the historical corpus according to a preset grouping strategy to obtain a historical corpus grouping result; the history corpus grouping result comprises a plurality of history corpus sub-groups, and the history corpus sub-groups are respectively recorded as a 1 st history corpus sub-group to a k-th history corpus sub-group, wherein k is the total number of the history corpus sub-groups included in the history corpus sub-group result;
counting to obtain the total number of the historical corpuses correspondingly included in each historical corpus sub-group, and obtaining the historical corpus sub-group with the largest total number of the historical corpuses as a target historical corpus sub-group;
Training a cycle-GAN model to be trained by arbitrarily acquiring semantic vectors of two historical corpuses from a target historical corpus sub-group according to a preset first corpus acquisition number, and stopping arbitrarily acquiring the semantic vectors of the two historical corpuses from the target historical corpus sub-group when the cycle-GAN model to be trained converges to obtain the cycle-GAN model as a trained countermeasure network model; wherein the first corpus extraction number is equal to 2;
the regular coding unit is specifically configured to:
dividing the expanded corpus according to characters to obtain a character division result;
comparing each character in the character division result with the regular expression corresponding to the expanded corpus in sequence, and determining whether each character hits the regular expression;
if the character is determined to miss the regular expression and other characters in front of the character miss the regular expression, outputting a sub-coding result of the character as a first coding value;
if the character is determined to hit the regular expression and other characters before the character are not completely hit the regular expression, outputting a sub-coding result of the character as a second coding value;
if the character is determined to miss the regular expression and the one-bit character before the character hits the regular expression, outputting a sub-coding result of the character as a third coding value;
If the character is determined to hit the regular expression and all other continuous characters before the character hit the regular expression, outputting a sub-coding result of the character as a fourth coding value;
and splicing the sub-coding results of each character obtained in sequence to obtain a regular expression coding result of the expanded corpus.
7. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the regular expression encoding-based data expansion method of any of claims 1 to 5 when the computer program is executed.
8. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the regular expression encoding-based data expansion method of any of claims 1 to 5.
CN202110850687.8A 2021-07-27 2021-07-27 Regular expression coding-based data expansion method, device, equipment and medium Active CN113486671B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110850687.8A CN113486671B (en) 2021-07-27 2021-07-27 Regular expression coding-based data expansion method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110850687.8A CN113486671B (en) 2021-07-27 2021-07-27 Regular expression coding-based data expansion method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN113486671A CN113486671A (en) 2021-10-08
CN113486671B true CN113486671B (en) 2023-06-30

Family

ID=77942869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110850687.8A Active CN113486671B (en) 2021-07-27 2021-07-27 Regular expression coding-based data expansion method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN113486671B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162770A (en) * 2018-10-22 2019-08-23 腾讯科技(深圳)有限公司 A kind of word extended method, device, equipment and medium
CN110288980A (en) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 Audio recognition method, the training method of model, device, equipment and storage medium
CN111354354A (en) * 2018-12-20 2020-06-30 深圳市优必选科技有限公司 Training method and device based on semantic recognition and terminal equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162770A (en) * 2018-10-22 2019-08-23 腾讯科技(深圳)有限公司 A kind of word extended method, device, equipment and medium
CN111354354A (en) * 2018-12-20 2020-06-30 深圳市优必选科技有限公司 Training method and device based on semantic recognition and terminal equipment
CN110288980A (en) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 Audio recognition method, the training method of model, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113486671A (en) 2021-10-08

Similar Documents

Publication Publication Date Title
CN110188223B (en) Image processing method and device and computer equipment
CN110222218B (en) Image retrieval method based on multi-scale NetVLAD and depth hash
CN111259144A (en) Multi-model fusion text matching method, device, equipment and storage medium
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN112613308A (en) User intention identification method and device, terminal equipment and storage medium
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN116049412B (en) Text classification method, model training method, device and electronic equipment
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN114358188A (en) Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment
CN113691542A (en) Web attack detection method based on HTTP request text and related equipment
CN115037805A (en) Unknown network protocol identification method, system, device and storage medium based on deep clustering
CN115392357A (en) Classification model training and labeled data sample spot inspection method, medium and electronic equipment
CN115795065A (en) Multimedia data cross-modal retrieval method and system based on weighted hash code
CN113541834B (en) Abnormal signal semi-supervised classification method and system and data processing terminal
CN111291807A (en) Fine-grained image classification method and device and storage medium
CN113704473A (en) Media false news detection method and system based on long text feature extraction optimization
CN113515593A (en) Topic detection method and device based on clustering model and computer equipment
CN112926647A (en) Model training method, domain name detection method and device
CN113486671B (en) Regular expression coding-based data expansion method, device, equipment and medium
Farhangi et al. Informative visual words construction to improve bag of words image representation
US20230281247A1 (en) Video retrieval method and apparatus using vectorizing segmented videos
CN111581377A (en) Text classification method and device, storage medium and computer equipment
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
CN115688789A (en) Entity relation extraction model training method and system based on dynamic labels
CN112953914A (en) DGA domain name detection and classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant