CN109784407A - The method and apparatus for determining the type of literary name section - Google Patents
The method and apparatus for determining the type of literary name section Download PDFInfo
- Publication number
- CN109784407A CN109784407A CN201910043827.3A CN201910043827A CN109784407A CN 109784407 A CN109784407 A CN 109784407A CN 201910043827 A CN201910043827 A CN 201910043827A CN 109784407 A CN109784407 A CN 109784407A
- Authority
- CN
- China
- Prior art keywords
- feature
- character
- deep
- name section
- characteristic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses the method and apparatus for the type for determining literary name section, are related to field of computer technology.One specific embodiment of this method includes: that the primitive character of literary name section is divided into attributive character and value tag;Eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic;According to the training set training neural network in the conversion characteristic;Deep learning is carried out according to the test set in the converting characteristic and the neural network for completing training, determines the type of literary name section.The embodiment solves the technological deficiency that recognition accuracy and recall rate of the prior art based on Keywords matching and conventional machines learning method are all relatively low, cost of labor is excessively high, so reach deep learning more targetedly, make full use of the primitive character of literary name section to make the more accurate technical effect of the type of determining literary name section.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of method and apparatus of the type of determining literary name section.
Background technique
Whether Min Gan judgement is extremely important before storing to database for the type determination of literary name section and literary name section,
In terms of secret protection or information security.Therefore before literary name section is loaded into database, need quick to being related to
The literary name section of sense information (including but not limited to: name, identification card number, cell-phone number, bank's card number etc.) is encrypted.
The prior art is used based on keyword (such as: name, address, address etc.) matching and conventional machines study, is added
It whether is that sensitive field identifies to the type and literary name section of literary name section in the method that artificial judgment assists.
In realizing process of the present invention, at least there are the following problems in the prior art for inventor's discovery:
1. recognition accuracy and recall rate based on Keywords matching and conventional machines learning method are all relatively low.
2. being subject to artificial judgment in the lower situation of recognition accuracy and assisting in identifying, cost of labor is excessively high.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of method and apparatus of the type of determining literary name section, it can reach deep
Degree study is more targeted, the primitive character of literary name section is made full use of to make the more accurate technology effect of the type of determining literary name section
Fruit.
To achieve the above object, according to an aspect of an embodiment of the present invention, a kind of type of determining literary name section is provided
Method, comprising:
The primitive character of literary name section is divided into attributive character and value tag;
Eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic;
According to the training set training neural network in the conversion characteristic;
Deep learning is carried out according to the test set in the converting characteristic and the neural network for completing training, determines literary name section
Type.
Optionally, the attributive character includes: numerical characteristics, characteristic of division, text feature;
The value tag includes text feature.
Optionally, eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic, is wrapped
It includes:
Numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is width characteristics;
And/or by the attributive character text feature and value tag carry out eigentransformation, obtained conversion characteristic is
Deep text feature;
And/or the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is that deep classification is special
Sign and/or width characteristics.
Optionally, the numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is that width is special
The transformation for mula of sign are as follows:
Wherein, wide_feature indicates the width characteristics after conversion, and raw_feature indicates primitive character, max_
Value indicates the maximum value of the width characteristics, min indicate the numerical characteristics with it is lesser in the maximum value of the width characteristics
Numerical value.
Optionally, the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is that width is special
Sign, comprising:
The characteristic of division is encoded using one-hot coding,
Result after coding is spliced into the vector that a numerical value is 0 or 1;
Spliced described 0 or 1 vector is width characteristics.
Optionally, the text feature and value tag being subjected to eigentransformation, obtained conversion characteristic is deep text feature,
Include:
Final character will be added after text in text feature or value tag;
For text feature, preset length is set;
When the length plus final character of the text is greater than preset length, then deletion exceeds the part of maximum length,
Remainder is deep text feature;
When the length plus final character of the text is less than preset length, then by the part benefit of insufficient preset length
Character is filled to supply to obtain deep text feature.
Optionally, the characteristic of division is subjected to eigentransformation, obtained conversion characteristic is deep category feature, comprising:
Characteristic of division is spliced;
A vector is converted by the result of splicing;
The vector is deep category feature.
Optionally, according to the training set training neural network in the conversion characteristic, comprising:
Using the width characteristics in the training set as the input of broadband network in training neural network;
Using in the training set deep text feature and deep category feature as in trained neural network depth network it is defeated
Enter;
According to the broadband network and depth network, neural network is determined.
Optionally, the width characteristics and the deep category feature are for full Connection Neural Network progress deep learning;
Convolutional neural networks of the depth text feature for character level carry out deep learning.
Optionally, for training the function model of neural network for flexible maximum value cross entropy loss function.
Optionally, each sample data in the training set of the neural network deep learning is special by attributive character and value
Sign matching determines.
Optionally, deep learning is carried out according to the conversion characteristic, determines the type of literary name section, comprising:
Determine the prediction result of deep learning;
Determine the confidence interval of the prediction result;
According to voting mechanism and maximum confidence interval, the type of literary name section is determined.
Another aspect according to an embodiment of the present invention provides a kind of device of the type of determining literary name section, comprising:
Primitive character division module, for the primitive character of literary name section to be divided into attributive character and value tag;
Conversion characteristic module determines transformed for carrying out eigentransformation respectively to the attributive character and value tag
Conversion characteristic;
Neural metwork training module, for according to the training set training neural network in the conversion characteristic;
The determination type module of literary name section, for the nerve net according to test set and completion training in the converting characteristic
Network carries out deep learning, determines the type of literary name section.
Other side according to an embodiment of the present invention provides a kind of type electronic device of determining literary name section, comprising:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processing
Device realizes the type method of determining literary name section provided by the invention.
Still another aspect according to an embodiment of the present invention provides a kind of computer-readable medium, is stored thereon with calculating
Machine program realizes the type method of determining literary name section provided by the invention when described program is executed by processor.
One embodiment in foregoing invention have the following advantages that or the utility model has the advantages that
The present invention by using in literary name section primitive character attributive character and value tag carry out deep learning technology hand
Section, solves recognition accuracy and recall rate of the prior art based on Keywords matching and conventional machines learning method all relatively
Technological deficiency low, cost of labor is excessively high, and then reach and the primitive character of literary name section is made full use of to make determining literary name section
Type is more acurrate;
By carrying out eigentransformation respectively to attributive character and value tag, so that transformed conversion data can be applied to
Different neural networks does further training so that deep learning more targetedly, reach and further increase accuracy rate
Technical effect overcomes the defect that the prior art determines the type inaccuracy of literary name section.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment
With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is the schematic diagram of the main flow of the method for the type of determining literary name section according to an embodiment of the present invention;
Fig. 2 is improved width according to an embodiment of the present invention and depth network structure;
Fig. 3 is the other convolutional neural networks of character level according to an embodiment of the present invention;
Fig. 4 is the detailed process of training and the prediction of the method for the type of determining literary name section according to an embodiment of the present invention
Figure;
Fig. 5 is the schematic diagram of the main modular of the device of the type of determining literary name section according to an embodiment of the present invention;
Fig. 6 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Fig. 7 is adapted for the structural representation of the computer system for the terminal device or server of realizing the embodiment of the present invention
Figure.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention
Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize
It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together
Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Fig. 1 is a kind of schematic diagram of the main flow of the method for the type of determining literary name section according to an embodiment of the present invention,
As shown in Figure 1, comprising:
Step S101, the primitive character of literary name section is divided into attributive character and value tag;
Step S102, eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic;
Step S103, according to the training set training neural network in the conversion characteristic;
Step S104, according to the test set in the converting characteristic and trained neural network progress deep learning is completed,
Determine the type of literary name section.
The type of the literary name section includes: sensitive field and non-sensitive field.In particular, when literary name section is sensitive field
When, the purposes in data warehouse is extremely important.For example, being needed quick to being related to before clear data is loaded into data warehouse
The field of sense information (including but not limited to: name, identification card number, cell-phone number, bank's card number etc.) is encrypted.It is described non-
Sensitive field is exactly the field in addition to sensitive field, comprising: year information, time information etc..
Technological means of the present invention by attributive character and value tag progress deep learning in literary name section primitive character, solution
Recognition accuracy and recall rate of the prior art of having determined based on Keywords matching and conventional machines learning method be all relatively low, people
The excessively high technological deficiency of work cost, and then reach the type for making full use of the primitive character of literary name section to make determining literary name section more
Accurate technical effect.
Eigentransformation is carried out respectively to attributive character and value tag, so that transformed conversion data can be applied to difference
Neural network do further training so that deep learning more targetedly, reach the technology for further increasing accuracy rate
Effect overcomes the defect of the type inaccuracy for the literary name section that the prior art determines.
Optionally, the attributive character (the namely attribute information of literary name section: for example: table name, table note, table type, word
Section name, field annotation, field type etc.) include: numerical characteristics (such as: 1000,1.0 etc.), characteristic of division (such as: "Yes" and
"No" etc.), text feature (such as: " this is the annotation information of a field ");
The value tag includes text feature.
Specifically, as shown in the primitive character example of table 1.For example, database name is known as attributive character, characteristic value is pair
The test data library name answered, further belongs to text feature;The corresponding text feature of value tag may include an array,
Such as, " value 1 ", " value 2 " ..., " value n ".
Other examples in table 1 repeat no more, and follow above-mentioned principle of classification.
1 primitive character example of table
Feature name | Tagsort | Characteristic value | Feature Value Types |
Database-name | Attributive character | Test data library name | Text feature |
Database annotation | Attributive character | Test database annotation | Text feature |
Database annotation length | Attributive character | 7 | Numerical characteristics |
Type of database | Attributive character | MYSQL | Characteristic of division |
… | … | … | … |
Field name | Attributive character | Test field name | Text feature |
Field annotation | Attributive character | Test field annotation | Text feature |
Field value | Value tag | [" value 1 ", " value 2 " ..., " value n "] | Text feature |
Fig. 2 is improved width according to an embodiment of the present invention and depth network structure.As shown in Fig. 2, the width and depth
The input of degree network may include: width (wide) feature, deep text (Deep-Text) feature, deep classification (Deep-
Category) feature.
Optionally, eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic, is wrapped
It includes: the numerical characteristics in the attributive character being subjected to eigentransformation, obtained conversion characteristic is width (wide) feature;
And/or by the attributive character text feature and value tag carry out eigentransformation, obtained conversion characteristic is
Deep text (Deep-Text) feature;
And/or the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is deep classification
(Deep-Category) feature and/or width (wide) feature.
Wherein, optionally, the numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is width
Spend the transformation for mula of feature are as follows:
Wherein, wide_feature indicates the width characteristics after conversion, and raw_feature indicates primitive character, max_
Value indicates the maximum value of the width characteristics, and min indicates lesser numerical value in the numerical characteristics and the maximum value.
By the feature transformation of above-mentioned logarithm value tag, transformed result is obtained between 0 to 1, and then reaches special
Levy normalized technical effect.All dimension values of namely transformed wide characteristic point are between zero and one.Optionally,
Wide feature can be used for full Connection Neural Network.
Optionally, the text feature and value tag are subjected to eigentransformation, obtained conversion characteristic is deep text feature
(Deep-Text), comprising:
Final character will be added after text in text feature or value tag;
For each text, one preset length is set;
When the length plus final character of the text is greater than preset length, then deletion exceeds the part of maximum length;
When the length plus final character of the text is less than preset length, then by the part benefit of insufficient preset length
Character is filled to supply.
The determination process of deep text feature is introduced in the form of specific embodiment below:
Assuming that preset length is 6;Primitive character to be converted may include attributive character and value tag.
As shown in table 5 below, in example 1, final character<EOS>is added first after primitive character, obtains four characters.
Due to supplemented with still without reaching 6 characters of preset length, therefore also needing to be mended with secondary characters<PAD>after final character
Foot is until reach preset length.
In example 2, just reach preset length after primitive character is plus final character, so there is no need to add supplement word
Symbol.
In example 3, preset length has been exceeded after primitive character is plus final character, therefore has needed to delete and exceeds part
Character.
Feature after the conversion of the single text feature of table 2
Because the maximum length of this article eigen setting is 6, the part beyond the 6th character point will be deleted in upper table
It removes.
Optionally, it when deep text feature corresponds to multiple text features or value tag, can be generated by the way of splicing
Corresponding depth text, length (or dimension) are WhereinFor the preset length of each text feature setting.
Corresponding deep text is generated in such a way that a specific embodiment is described in detail using splicing below:
In the present embodiment, 2 text features are shared are as follows: table 6 and table 7, each text feature and include 3 examples.
Wherein, the preset length of table 6 is 4, and the preset length of table 7 is 5.Therefore it needs to be the 5th character in deletion table 6 during converting
Part in part, deletion table 7 after point after the 6th character point.Treated table 6 and table 7 are done into splicing again, obtain table
Deep text feature after 8 conversions.Wherein, the 1-4 of table 8 is classified as the text feature after first conversion, and 5-9 is classified as second conversion
Text feature afterwards.
Deep-Text feature 1 before the conversion of table 3
Index | 1 | 2 | 3 | 4 | 5 | 6 |
Example 1 | It is special | Sign | 1 | <EOS> | ||
Example 2 | Also | It is | It is special | Sign | <EOS> | |
Example 3 | It is long | 's | It is special | Sign | Value | <EOS> |
Deep-Text feature 2 before the conversion of table 4
Index | 1 | 2 | 3 | 4 | 5 | 6 |
Example 1 | It is special | Sign | 2 | <EOS> | <PAD> | |
Example 2 | It is special | Sign | <EOS> | <PAD> | <PAD> | |
Example 3 | ? | It is | It is long | It is special | Sign | <EOS> |
Deep-Text feature after the conversion of table 5
Index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Example 1 | It is special | Sign | 1 | <EOS> | It is special | Sign | 2 | <EOS> | <PAD> |
Example 2 | Also | It is | It is special | Sign | It is special | Sign | <EOS> | <PAD> | <PAD> |
Example 3 | It is long | 's | It is special | Sign | ? | It is | It is long | It is special | Sign |
For Deep-Text feature, dimension has determined as dimDeep-Text, maintenance one includes all possible characters
Dictionary table, the dictionary table by any one character mapping become dimText-EmbeddingThe vector of dimension, such as:
Query result of 6 one words of table in dictionary table
Index | 1 | 2 | 3 | … | … | i | … | … | dimembedding |
Value | 0.012 | 0.231 | 0.986 | … | … | 0.123 | … | … | 0.689 |
The corresponding vector random initializtion of each word in the dictionary table, constantly study is more in the training process for specific value
Newly.Deep-Text is text-type feature, the convolutional neural networks of character level is corresponded to, for extracting the internal information of text.Its
In, Fig. 3 is the other convolutional neural networks of character level according to an embodiment of the present invention.
As shown in figure 3, first converting feature vector for text feature, and his positive vector is stitched together by multiple texts
Multiple dimensioned convolution is carried out, then pond is carried out to the result of convolution afterwards, finally by the knot of the output layer output training of neural network
Fruit.
Optionally, the characteristic of division is encoded using one-hot coding,
Result after coding is spliced into the vector that a numerical value is 0 or 1;
Spliced described 0 or 1 vector is width characteristics.
Since category feature can preferably embody the corresponding classification of characteristic of division by the way of one-hot coding.It is optional
The maximum length (or dimension) after eigentransformation is set dim by groundWide。
As follows, it is assumed that the shared m kind of the type of this feature is possible, for i-th kind of characteristic type, is translated into as follows
It only include 0 and 1 vector, as shown in the value in the following table 2:
Table 7 includes the one-hot coding of the i-th seed type of the possible type of m kind
Index | 1 | 2 | … | i | … | m |
Value | 0 | 0 | … | 1 | … | 0 |
Include below numerical characteristics and the concrete operations that characteristic of division is converted with a specific embodiment introduction:
Specifically, there are 2 value type feature (f in primitive character1And f2), 2 classification type feature (f3And f4), often
The statistical information of a feature is as follows:
1)f1Maximum value be 20, minimum value 0
2)f2Maximum value be 100, minimum value 0
3)f3All possibilities are as follows: A, B, totally 2 kinds of possibility
4)f4All possibilities are as follows: A, B, C, totally 3 kinds of possibility
Wide examples of features before conversion is as shown in table 3 below:
Wide examples of features before table 8 is converted
Feature | f1 | f2 | f3 | f4 |
Value | 3 | 78 | B | A |
Due to f1And f2For the value type feature in wide feature, then eigentransformation is carried out according to normalized mode;f3
And f4For the classification type feature in wide feature, then eigentransformation is carried out according to the mode of above-mentioned one-hot coding.Therefore after converting
Wide examples of features it is as shown in table 4 below:
Wide examples of features after table 9 is converted
Wherein, f '1For f1Feature after conversion, f '2For f2Feature after conversion,WithFor f3Spy after conversion
Sign,WithFor f4Feature after conversion.
Optionally, the characteristic of division is subjected to eigentransformation, obtained conversion characteristic is deep classification (Deep-
Category) feature, comprising:
Characteristic of division is spliced;
A vector is converted by the result of splicing;
The vector is deep category feature.
Deep-Category feature source is the classification type in attributive character.For class of classifying in attributive character
The primitive character of type does not do other processing, is only spliced into a long vector, and length (or dimension) is dimDeep-Category°
Optionally, according to the training set training neural network in the conversion characteristic, comprising:
Using the width characteristics in the training set as the input of broadband network in training neural network;
Using in the training set deep text feature and deep category feature as in trained neural network depth network it is defeated
Enter;
According to the broadband network and depth network, neural network is determined.
Deep learning is carried out by broadband network and depth network, so that the parameter of entire model is by broadband network and depth
Network joint effect.Broadband network carries out deep learning by fully-connected network, and depth network passes through Embedded convolutional Neural
Network carries out deep learning, so that the accuracy of study is higher, model size and complexity can be controlled, and then whole
The effect of study can be also obviously improved.
Optionally, the width characteristics and the deep category feature are for full Connection Neural Network progress deep learning;
Convolutional neural networks of the depth text feature for character level carry out deep learning.
Due to conversion after all dimensions of Wide feature value between zero and one, Wide feature can be used to connect entirely
Neural network be trained.
Text feature corresponds to Deep-Text feature, and the dimension of Deep-Text feature has determined as dimDeep-Text.It will be literary
Eigen, which is converted into deep text feature, can rely on the dictionary table comprising all possible characters, and the dictionary table is by any one character
Mapping becomes dimText-EmbeddingThe vector of dimension, such as:
Query result of 10 1 words of table in dictionary table
Index | 1 | 2 | 3 | … | … | i | … | dimembedding |
Value | 0.012 | 0.231 | 0.986 | … | … | 0.123 | … | 0.689 |
The corresponding vector random initializtion of each word in dictionary table, constantly study is more in the training process for specific value
Newly.Deep-Text is text-type feature, can be used for the convolutional neural networks of character level, and then facilitates the inherence for extracting text
Information.
As shown in figure 3, Deep-Category feature corresponding one includes all types of dictionary tables, the dictionary table is by one
Seed type mapping becomes dimCategory-EmbeddingThe vector of dimension.Unlike Deep-Text feature, Deep-Category is special
It takes over for use in the neural network connected entirely.After again, by Wide feature, the output of Deep-Text feature and Deep-Category feature
It is spliced into a vector.Final output is finally obtained by a full articulamentum again.
Optionally, for training the function model of neural network for flexible maximum value cross entropy loss function.Wherein, it uses
Flexible maximum value intersects entropy loss (softmax cross entropy) function and, as loss function, utilizes stochastic gradient descent
Algorithm is trained function model to model.
Optionally, the sample of the neural network is determined by attributive character and value tag matching.
Due to including necessarily value tag in primitive character, a numerical value in value tag can be instructed in conjunction with attributive character
Practice neural network.
Optionally, the text feature and value tag being subjected to eigentransformation, obtained conversion characteristic is deep text feature,
Include:
Final character will be added after text in text feature or value tag;
For text feature, preset length is set;
When the length plus final character of the text is greater than preset length, then deletion exceeds the part of maximum length,
Remainder is deep text feature;
When the length plus final character of the text is less than preset length, then by the part benefit of insufficient preset length
Character is filled to supply to obtain deep text feature.
Optionally, deep learning is carried out according to the conversion characteristic, determines the type of literary name section, comprising:
Determine the prediction result of deep learning;
Determine the confidence interval of the prediction result;
According to voting mechanism and maximum confidence interval, the type of literary name section is determined.
Specifically, n can be constructed using the attributive character of literary name section and n value tag (taking not as n empty value)
Sample.For judging whether a literary name section is sensitive field, predicted respectively using trained model to this n sample
It is predicted.Finally determine that (i.e. the type of the maximum probability of type of prediction is for the sensitive kind of the literary name section using voting mechanism
The sensitive kind of the literary name section).
Specifically, it is assumed that prediction result is as shown in the table:
11 1 groups of example prediction results of table
Group | 1 | 2 | 3 | … | i | … | n |
Prediction | Cell-phone number | Cell-phone number | Fixed line | … | Cell-phone number | … | Cell-phone number |
Assuming that only the 3rd group of feature prediction result is " fixed line " in prediction result, other groups are predicted as " cell-phone number ", then
Determine that the final prediction result of the literary name section is " cell-phone number " using voting mechanism.The mechanism of the ballot, i.e. type of prediction number
The most tag along sort of mesh is final prediction result.
Optionally, the confidence level of final prediction result, the formula of the confidence level are calculated are as follows:
Wherein, max_freq_label is the highest tag along sort of frequency of occurrence;Counter is count operator, that is, is calculated
The frequency of appearance;N is the group number of input feature vector, and conf value is bigger, indicates that the confidence level of model result is higher;Conf is the prediction
As a result confidence level.It can be weighed between accuracy rate and coverage rate by adjusting the size of conf value in use
Weighing apparatus.
Illustrate detailed process of the invention with a specific embodiment below.
Fig. 4 is the detailed process of training and the prediction of the method for the type of determining literary name section according to an embodiment of the present invention
Figure.It is as shown in Figure 4:
Primitive character is subjected to eigentransformation.It wherein needs to classify primitive character, is specifically divided into attributive character
And value tag.Feature after obtained conversion includes: wide feature, Deep-Text feature, Deep-Category feature.According to
The wide feature, Deep-Text feature, Deep-Category feature carry out deep learning, obtain Wide&Deep model.
When user issues predictions request, it is also desirable to carry out eigentransformation to primitive character to be predicted, and will be after conversion
Feature for Wide&Deep model determine model export.Final predicated response is finally determined according to voting mechanism.
Fig. 5 is the schematic diagram of the main modular of the device of the type of determining literary name section according to an embodiment of the present invention.Such as figure
It is shown, a kind of device 500 of the type of determining literary name section, comprising:
Module 501, primitive character division module, for the primitive character of literary name section to be divided into attributive character and value spy
Sign;
Module 502, conversion characteristic module are determined for carrying out eigentransformation respectively to the attributive character and value tag
Transformed conversion characteristic;
Module 503, neural metwork training module, for according to the training set training neural network in the conversion characteristic;
The determination type module of module 504, literary name section, for being trained according to the test set in the converting characteristic with completion
Neural network carry out deep learning, determine the type of literary name section.
Optionally, the attributive character includes: numerical characteristics, characteristic of division, text feature;
The value tag includes text feature.
Optionally, eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic, is wrapped
It includes:
Numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is width characteristics;
And/or by the attributive character text feature and value tag carry out eigentransformation, obtained conversion characteristic is
Deep text feature;
And/or the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is that deep classification is special
Sign and/or width characteristics.
Optionally, the numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is that width is special
The transformation for mula of sign are as follows:
Wherein, wide_feature indicates the width characteristics after conversion, and raw_feature indicates primitive character, max_
Value indicates the maximum value of the width characteristics, min indicate the numerical characteristics with it is lesser in the maximum value of the width characteristics
Numerical value.
Optionally, the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is that width is special
Sign, comprising:
The characteristic of division is encoded using one-hot coding,
Result after coding is spliced into the vector that a numerical value is 0 or 1;
Spliced described 0 or 1 vector is width characteristics.
Optionally, the text feature and value tag being subjected to eigentransformation, obtained conversion characteristic is deep text feature,
Include:
Final character will be added after text in text feature or value tag;
For text feature, preset length is set;
When the length plus final character of the text is greater than preset length, then deletion exceeds the part of maximum length,
Remainder is deep text feature;
When the length plus final character of the text is less than preset length, then by the part benefit of insufficient preset length
Character is filled to supply to obtain deep text feature.
Optionally, the characteristic of division is subjected to eigentransformation, obtained conversion characteristic is deep category feature, comprising:
Characteristic of division is spliced;
A vector is converted by the result of splicing;
The vector is deep category feature.
Optionally, according to the training set training neural network in the conversion characteristic, comprising:
Using the width characteristics in the training set as the input of broadband network in training neural network;
Using in the training set deep text feature and deep category feature as in trained neural network depth network it is defeated
Enter;
According to the broadband network and depth network, neural network is determined.
Optionally, the width characteristics and the deep category feature are for full Connection Neural Network progress deep learning;
Convolutional neural networks of the depth text feature for character level carry out deep learning.
Optionally, for training the function model of neural network for flexible maximum value cross entropy loss function.
Optionally, each sample data in the training set of the neural network deep learning is special by attributive character and value
Sign matching determines.
Optionally, deep learning is carried out according to the conversion characteristic, determines the type of literary name section, comprising:
Determine the prediction result of deep learning;
Determine the confidence interval of the prediction result;
According to voting mechanism and maximum confidence interval, the type of literary name section is determined.
Fig. 6 is shown can be using the type method of the determination literary name section of the embodiment of the present invention or the type of determining literary name section
The exemplary system architecture 600 of device.
As shown in fig. 6, system architecture 600 may include terminal device 601,602,603, network 604 and server 605.
Network 604 between terminal device 601,602,603 and server 605 to provide the medium of communication link.Network 604 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 601,602,603 and be interacted by network 604 with server 605, to receive or send out
Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 601,602,603
(merely illustrative) such as the application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform softwares.
Terminal device 601,602,603 can be the various electronic equipments with display screen and supported web page browsing, packet
Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 605 can be to provide the server of various services, such as utilize terminal device 601,602,603 to user
The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to reception
To the data such as information query request analyze etc. processing, and by processing result (such as target push information, product letter
Breath -- merely illustrative) feed back to terminal device.
It should be noted that determining the type method of literary name section generally by server 605 provided by the embodiment of the present invention
It executes, correspondingly, determines that the types of devices of literary name section is generally positioned in server 605.
It should be understood that the number of terminal device, network and server in Fig. 6 is only schematical.According to realization need
It wants, can have any number of terminal device, network and server.
Below with reference to Fig. 7, it illustrates the computer systems 700 for the terminal device for being suitable for being used to realize the embodiment of the present invention
Structural schematic diagram.Terminal device shown in Fig. 7 is only an example, function to the embodiment of the present invention and should not use model
Shroud carrys out any restrictions.
As shown in fig. 7, computer system 700 includes central processing module (CPU) 701, it can be read-only according to being stored in
Program in memory (ROM) 702 or be loaded into the program in random access storage device (RAM) 703 from storage section 708 and
Execute various movements appropriate and processing.In RAM 703, also it is stored with system 700 and operates required various programs and data.
CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input/output (I/O) interface 705 is also connected to always
Line 704.
I/O interface 705 is connected to lower component: the importation 706 including keyboard, mouse etc.;It is penetrated including such as cathode
The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 708 including hard disk etc.;
And the communications portion 709 of the network interface card including LAN card, modem etc..Communications portion 709 via such as because
The network of spy's net executes communication process.Driver 710 is also connected to I/O interface 705 as needed.Detachable media 711, such as
Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 710, in order to read from thereon
Computer program be mounted into storage section 708 as needed.
Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention
Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer
Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.?
In such embodiment, which can be downloaded and installed from network by communications portion 709, and/or from can
Medium 711 is dismantled to be mounted.When the computer program is executed by central processing module (CPU) 701, system of the invention is executed
The above-mentioned function of middle restriction.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not
Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter
The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires
Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey
The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this
In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for
By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium
Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned
Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more
Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box
The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical
On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants
It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule
The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction
It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard
The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet
It includes sending module, obtain module, determining module and first processing module.Wherein, the title of these modules is under certain conditions simultaneously
The restriction to the module itself is not constituted, for example, sending module is also described as " sending picture to the server-side connected
The module of acquisition request ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be
Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating
Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes
Obtaining the equipment includes:
The primitive character of literary name section is divided into attributive character and value tag;
Eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic;
According to the training set training neural network in the conversion characteristic;
Deep learning is carried out according to the test set in the converting characteristic and the neural network for completing training, determines literary name section
Type.
Technical solution according to an embodiment of the present invention, can achieve it is following the utility model has the advantages that
The present invention is by using the attributive character and the determining instruction of value tag progress deep learning in the primitive character of literary name section
Practice model technological means, solve recognition accuracy of the prior art based on Keywords matching and conventional machines learning method and
The technological deficiency that recall rate is all relatively low, cost of labor is excessively high, and then reach and the primitive character of literary name section is made full use of to make
The type of determining literary name section is more acurrate;
By carrying out eigentransformation respectively to attributive character and value tag, so that transformed conversion data can be applied to
Different neural networks does further training, so that deep learning is more targeted, further increases the technology of accuracy rate
Effect overcomes the defect of the type inaccuracy for the literary name section that the prior art determines.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright
It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any
Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention
Within.
Claims (15)
1. a kind of method of the type of determining literary name section characterized by comprising
The primitive character of literary name section is divided into attributive character and value tag;
Eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic;
According to the training set training neural network in the conversion characteristic;
Deep learning is carried out according to the test set in the converting characteristic and the neural network for completing training, determines the class of literary name section
Type.
2. the method according to claim 1, wherein the attributive character include: numerical characteristics, characteristic of division,
Text feature;
The value tag includes text feature.
3. according to the method described in claim 2, it is characterized in that, carrying out feature change respectively to the attributive character and value tag
It changes, determines transformed conversion characteristic, comprising:
Numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is width characteristics;
And/or by the attributive character text feature and value tag carry out eigentransformation, obtained conversion characteristic is deep text
Eigen;
And/or the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is deep category feature
And/or width characteristics.
4. according to the method described in claim 3, it is characterized in that, the numerical characteristics in the attributive character are carried out feature change
It changes, obtained conversion characteristic is the transformation for mula of width characteristics are as follows:
Wherein, wide_feature indicates the width characteristics after conversion, and raw_feature indicates primitive character, max_value table
Show the maximum value of the width characteristics, min indicates lesser numerical value in the maximum value of the numerical characteristics and the width characteristics.
5. according to the method described in claim 4, it is characterized in that, the characteristic of division in the attributive character is carried out feature change
It changes, obtained conversion characteristic is width characteristics, comprising:
The characteristic of division is encoded using one-hot coding,
Result after coding is spliced into the vector that a numerical value is 0 or 1;
Spliced described 0 or 1 vector is width characteristics.
6. according to the method described in claim 4, it is characterized in that, the text feature and value tag are carried out eigentransformation,
Obtained conversion characteristic is deep text feature, comprising:
Final character will be added after text in text feature or value tag;
For text feature, preset length is set;
When the length plus final character of the text is greater than preset length, then deletion exceeds the part of maximum length, remaining
Part is deep text feature;
When the length plus final character of the text is less than preset length, then by the part of insufficient preset length supplement word
Symbol is supplied to obtain deep text feature.
7. according to the method described in claim 4, what is obtained turns it is characterized in that, the characteristic of division is carried out eigentransformation
Changing feature is deep category feature, comprising:
Characteristic of division is spliced;
A vector is converted by the result of splicing;
The vector is deep category feature.
8. according to the method described in claim 3, it is characterized in that, according to the training set training nerve net in the conversion characteristic
Network, comprising:
Using the width characteristics in the training set as the input of broadband network in training neural network;
Using the deep text feature in the training set with deep category feature as the input of depth network in training neural network;
According to the broadband network and depth network, neural network is determined.
9. according to the method described in claim 8, it is characterized in that, the width characteristics are with the deep category feature for connecting entirely
It connects neural network and carries out deep learning;
Convolutional neural networks of the depth text feature for character level carry out deep learning.
10. according to the method described in claim 3, it is characterized in that, for train the function model of neural network for it is flexible most
Big value cross entropy loss function.
11. according to the method described in claim 5, it is characterized in that, every in the training set of the neural network deep learning
A sample data is determined by attributive character and value tag matching.
12. determining table the method according to claim 1, wherein carrying out deep learning according to the conversion characteristic
The type of field, comprising:
Determine the prediction result of deep learning;
Determine the confidence interval of the prediction result;
According to voting mechanism and maximum confidence interval, the type of literary name section is determined.
13. a kind of device of the type of determining literary name section characterized by comprising
Primitive character division module, for the primitive character of literary name section to be divided into attributive character and value tag;
Conversion characteristic module determines transformed conversion for carrying out eigentransformation respectively to the attributive character and value tag
Feature;
Neural metwork training module, for according to the training set training neural network in the conversion characteristic;
The determination type module of literary name section, for according in the converting characteristic test set and complete training neural network into
Row deep learning determines the type of literary name section.
14. a kind of electronic equipment of the type of determining literary name section characterized by comprising
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
The now method as described in any in claim 1-12.
15. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor
The method as described in any in claim 1-12 is realized when row.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910043827.3A CN109784407A (en) | 2019-01-17 | 2019-01-17 | The method and apparatus for determining the type of literary name section |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910043827.3A CN109784407A (en) | 2019-01-17 | 2019-01-17 | The method and apparatus for determining the type of literary name section |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109784407A true CN109784407A (en) | 2019-05-21 |
Family
ID=66500986
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910043827.3A Pending CN109784407A (en) | 2019-01-17 | 2019-01-17 | The method and apparatus for determining the type of literary name section |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109784407A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399912A (en) * | 2019-07-12 | 2019-11-01 | 广东浪潮大数据研究有限公司 | A kind of method of character recognition, system, equipment and computer readable storage medium |
CN112967780A (en) * | 2021-01-27 | 2021-06-15 | 安徽华米健康科技有限公司 | Physical ability age prediction method, device, equipment and storage medium based on PAI |
CN113312354A (en) * | 2021-06-10 | 2021-08-27 | 北京百度网讯科技有限公司 | Data table identification method, device, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105825138A (en) * | 2015-01-04 | 2016-08-03 | 北京神州泰岳软件股份有限公司 | Sensitive data identification method and device |
CN108647251A (en) * | 2018-04-20 | 2018-10-12 | 昆明理工大学 | The recommendation sort method of conjunctive model is recycled based on wide depth door |
CN109196527A (en) * | 2016-04-13 | 2019-01-11 | 谷歌有限责任公司 | Breadth and depth machine learning model |
-
2019
- 2019-01-17 CN CN201910043827.3A patent/CN109784407A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105825138A (en) * | 2015-01-04 | 2016-08-03 | 北京神州泰岳软件股份有限公司 | Sensitive data identification method and device |
CN109196527A (en) * | 2016-04-13 | 2019-01-11 | 谷歌有限责任公司 | Breadth and depth machine learning model |
CN108647251A (en) * | 2018-04-20 | 2018-10-12 | 昆明理工大学 | The recommendation sort method of conjunctive model is recycled based on wide depth door |
Non-Patent Citations (3)
Title |
---|
HENG-TZE CHENG等: "Wide & Deep Learning for Recommender Systems", 《HTTPS://ARXIV.ORG/ABS/1606.07792》 * |
YOON KIM: "Convolutional Neural Networks for Sentence Classification", 《HTTPS://ARXIV.ORG/ABS/1408.5882》 * |
京东数科技术说: "基于Wide & Deep网络和TextCNN的敏感字段识别", 《HTTPS://ZHUANLAN.ZHIHU.COM/P/54967619》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399912A (en) * | 2019-07-12 | 2019-11-01 | 广东浪潮大数据研究有限公司 | A kind of method of character recognition, system, equipment and computer readable storage medium |
CN112967780A (en) * | 2021-01-27 | 2021-06-15 | 安徽华米健康科技有限公司 | Physical ability age prediction method, device, equipment and storage medium based on PAI |
CN113312354A (en) * | 2021-06-10 | 2021-08-27 | 北京百度网讯科技有限公司 | Data table identification method, device, equipment and storage medium |
CN113312354B (en) * | 2021-06-10 | 2023-07-28 | 北京百度网讯科技有限公司 | Data table identification method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108171276B (en) | Method and apparatus for generating information | |
CN107105031A (en) | Information-pushing method and device | |
CN107463704A (en) | Searching method and device based on artificial intelligence | |
US20190163742A1 (en) | Method and apparatus for generating information | |
CN108153901A (en) | The information-pushing method and device of knowledge based collection of illustrative plates | |
CN109697641A (en) | The method and apparatus for calculating commodity similarity | |
CN108628830A (en) | A kind of method and apparatus of semantics recognition | |
CN107908666A (en) | A kind of method and apparatus of identification equipment mark | |
CN109784407A (en) | The method and apparatus for determining the type of literary name section | |
CN107908615A (en) | A kind of method and apparatus for obtaining search term corresponding goods classification | |
CN107766492A (en) | A kind of method and apparatus of picture search | |
CN107193974A (en) | Localized information based on artificial intelligence determines method and apparatus | |
CN109697537A (en) | The method and apparatus of data audit | |
CN108121699A (en) | For the method and apparatus of output information | |
CN106919711A (en) | The method and apparatus of the markup information based on artificial intelligence | |
CN108776692A (en) | Method and apparatus for handling information | |
CN110119445A (en) | The method and apparatus for generating feature vector and text classification being carried out based on feature vector | |
CN109146152A (en) | Incident classification prediction technique and device on a kind of line | |
CN109284367A (en) | Method and apparatus for handling text | |
CN110232487A (en) | A kind of task allocating method and device | |
CN110309142A (en) | The method and apparatus of regulation management | |
CN108171275A (en) | For identifying the method and apparatus of flowers | |
CN109389660A (en) | Image generating method and device | |
CN107291835A (en) | A kind of recommendation method and apparatus of search term | |
CN109960212A (en) | Task sending method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190521 |