CN109784407A - The method and apparatus for determining the type of literary name section - Google Patents

The method and apparatus for determining the type of literary name section Download PDF

Info

Publication number
CN109784407A
CN109784407A CN201910043827.3A CN201910043827A CN109784407A CN 109784407 A CN109784407 A CN 109784407A CN 201910043827 A CN201910043827 A CN 201910043827A CN 109784407 A CN109784407 A CN 109784407A
Authority
CN
China
Prior art keywords
feature
character
deep
name section
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910043827.3A
Other languages
Chinese (zh)
Inventor
范叶亮
马云龙
卢周
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JD Digital Technology Holdings Co Ltd
Original Assignee
JD Digital Technology Holdings Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JD Digital Technology Holdings Co Ltd filed Critical JD Digital Technology Holdings Co Ltd
Priority to CN201910043827.3A priority Critical patent/CN109784407A/en
Publication of CN109784407A publication Critical patent/CN109784407A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses the method and apparatus for the type for determining literary name section, are related to field of computer technology.One specific embodiment of this method includes: that the primitive character of literary name section is divided into attributive character and value tag;Eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic;According to the training set training neural network in the conversion characteristic;Deep learning is carried out according to the test set in the converting characteristic and the neural network for completing training, determines the type of literary name section.The embodiment solves the technological deficiency that recognition accuracy and recall rate of the prior art based on Keywords matching and conventional machines learning method are all relatively low, cost of labor is excessively high, so reach deep learning more targetedly, make full use of the primitive character of literary name section to make the more accurate technical effect of the type of determining literary name section.

Description

The method and apparatus for determining the type of literary name section
Technical field
The present invention relates to field of computer technology more particularly to a kind of method and apparatus of the type of determining literary name section.
Background technique
Whether Min Gan judgement is extremely important before storing to database for the type determination of literary name section and literary name section, In terms of secret protection or information security.Therefore before literary name section is loaded into database, need quick to being related to The literary name section of sense information (including but not limited to: name, identification card number, cell-phone number, bank's card number etc.) is encrypted.
The prior art is used based on keyword (such as: name, address, address etc.) matching and conventional machines study, is added It whether is that sensitive field identifies to the type and literary name section of literary name section in the method that artificial judgment assists.
In realizing process of the present invention, at least there are the following problems in the prior art for inventor's discovery:
1. recognition accuracy and recall rate based on Keywords matching and conventional machines learning method are all relatively low.
2. being subject to artificial judgment in the lower situation of recognition accuracy and assisting in identifying, cost of labor is excessively high.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of method and apparatus of the type of determining literary name section, it can reach deep Degree study is more targeted, the primitive character of literary name section is made full use of to make the more accurate technology effect of the type of determining literary name section Fruit.
To achieve the above object, according to an aspect of an embodiment of the present invention, a kind of type of determining literary name section is provided Method, comprising:
The primitive character of literary name section is divided into attributive character and value tag;
Eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic;
According to the training set training neural network in the conversion characteristic;
Deep learning is carried out according to the test set in the converting characteristic and the neural network for completing training, determines literary name section Type.
Optionally, the attributive character includes: numerical characteristics, characteristic of division, text feature;
The value tag includes text feature.
Optionally, eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic, is wrapped It includes:
Numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is width characteristics;
And/or by the attributive character text feature and value tag carry out eigentransformation, obtained conversion characteristic is Deep text feature;
And/or the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is that deep classification is special Sign and/or width characteristics.
Optionally, the numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is that width is special The transformation for mula of sign are as follows:
Wherein, wide_feature indicates the width characteristics after conversion, and raw_feature indicates primitive character, max_ Value indicates the maximum value of the width characteristics, min indicate the numerical characteristics with it is lesser in the maximum value of the width characteristics Numerical value.
Optionally, the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is that width is special Sign, comprising:
The characteristic of division is encoded using one-hot coding,
Result after coding is spliced into the vector that a numerical value is 0 or 1;
Spliced described 0 or 1 vector is width characteristics.
Optionally, the text feature and value tag being subjected to eigentransformation, obtained conversion characteristic is deep text feature, Include:
Final character will be added after text in text feature or value tag;
For text feature, preset length is set;
When the length plus final character of the text is greater than preset length, then deletion exceeds the part of maximum length, Remainder is deep text feature;
When the length plus final character of the text is less than preset length, then by the part benefit of insufficient preset length Character is filled to supply to obtain deep text feature.
Optionally, the characteristic of division is subjected to eigentransformation, obtained conversion characteristic is deep category feature, comprising:
Characteristic of division is spliced;
A vector is converted by the result of splicing;
The vector is deep category feature.
Optionally, according to the training set training neural network in the conversion characteristic, comprising:
Using the width characteristics in the training set as the input of broadband network in training neural network;
Using in the training set deep text feature and deep category feature as in trained neural network depth network it is defeated Enter;
According to the broadband network and depth network, neural network is determined.
Optionally, the width characteristics and the deep category feature are for full Connection Neural Network progress deep learning;
Convolutional neural networks of the depth text feature for character level carry out deep learning.
Optionally, for training the function model of neural network for flexible maximum value cross entropy loss function.
Optionally, each sample data in the training set of the neural network deep learning is special by attributive character and value Sign matching determines.
Optionally, deep learning is carried out according to the conversion characteristic, determines the type of literary name section, comprising:
Determine the prediction result of deep learning;
Determine the confidence interval of the prediction result;
According to voting mechanism and maximum confidence interval, the type of literary name section is determined.
Another aspect according to an embodiment of the present invention provides a kind of device of the type of determining literary name section, comprising:
Primitive character division module, for the primitive character of literary name section to be divided into attributive character and value tag;
Conversion characteristic module determines transformed for carrying out eigentransformation respectively to the attributive character and value tag Conversion characteristic;
Neural metwork training module, for according to the training set training neural network in the conversion characteristic;
The determination type module of literary name section, for the nerve net according to test set and completion training in the converting characteristic Network carries out deep learning, determines the type of literary name section.
Other side according to an embodiment of the present invention provides a kind of type electronic device of determining literary name section, comprising:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes the type method of determining literary name section provided by the invention.
Still another aspect according to an embodiment of the present invention provides a kind of computer-readable medium, is stored thereon with calculating Machine program realizes the type method of determining literary name section provided by the invention when described program is executed by processor.
One embodiment in foregoing invention have the following advantages that or the utility model has the advantages that
The present invention by using in literary name section primitive character attributive character and value tag carry out deep learning technology hand Section, solves recognition accuracy and recall rate of the prior art based on Keywords matching and conventional machines learning method all relatively Technological deficiency low, cost of labor is excessively high, and then reach and the primitive character of literary name section is made full use of to make determining literary name section Type is more acurrate;
By carrying out eigentransformation respectively to attributive character and value tag, so that transformed conversion data can be applied to Different neural networks does further training so that deep learning more targetedly, reach and further increase accuracy rate Technical effect overcomes the defect that the prior art determines the type inaccuracy of literary name section.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is the schematic diagram of the main flow of the method for the type of determining literary name section according to an embodiment of the present invention;
Fig. 2 is improved width according to an embodiment of the present invention and depth network structure;
Fig. 3 is the other convolutional neural networks of character level according to an embodiment of the present invention;
Fig. 4 is the detailed process of training and the prediction of the method for the type of determining literary name section according to an embodiment of the present invention Figure;
Fig. 5 is the schematic diagram of the main modular of the device of the type of determining literary name section according to an embodiment of the present invention;
Fig. 6 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Fig. 7 is adapted for the structural representation of the computer system for the terminal device or server of realizing the embodiment of the present invention Figure.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Fig. 1 is a kind of schematic diagram of the main flow of the method for the type of determining literary name section according to an embodiment of the present invention, As shown in Figure 1, comprising:
Step S101, the primitive character of literary name section is divided into attributive character and value tag;
Step S102, eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic;
Step S103, according to the training set training neural network in the conversion characteristic;
Step S104, according to the test set in the converting characteristic and trained neural network progress deep learning is completed, Determine the type of literary name section.
The type of the literary name section includes: sensitive field and non-sensitive field.In particular, when literary name section is sensitive field When, the purposes in data warehouse is extremely important.For example, being needed quick to being related to before clear data is loaded into data warehouse The field of sense information (including but not limited to: name, identification card number, cell-phone number, bank's card number etc.) is encrypted.It is described non- Sensitive field is exactly the field in addition to sensitive field, comprising: year information, time information etc..
Technological means of the present invention by attributive character and value tag progress deep learning in literary name section primitive character, solution Recognition accuracy and recall rate of the prior art of having determined based on Keywords matching and conventional machines learning method be all relatively low, people The excessively high technological deficiency of work cost, and then reach the type for making full use of the primitive character of literary name section to make determining literary name section more Accurate technical effect.
Eigentransformation is carried out respectively to attributive character and value tag, so that transformed conversion data can be applied to difference Neural network do further training so that deep learning more targetedly, reach the technology for further increasing accuracy rate Effect overcomes the defect of the type inaccuracy for the literary name section that the prior art determines.
Optionally, the attributive character (the namely attribute information of literary name section: for example: table name, table note, table type, word Section name, field annotation, field type etc.) include: numerical characteristics (such as: 1000,1.0 etc.), characteristic of division (such as: "Yes" and "No" etc.), text feature (such as: " this is the annotation information of a field ");
The value tag includes text feature.
Specifically, as shown in the primitive character example of table 1.For example, database name is known as attributive character, characteristic value is pair The test data library name answered, further belongs to text feature;The corresponding text feature of value tag may include an array, Such as, " value 1 ", " value 2 " ..., " value n ".
Other examples in table 1 repeat no more, and follow above-mentioned principle of classification.
1 primitive character example of table
Feature name Tagsort Characteristic value Feature Value Types
Database-name Attributive character Test data library name Text feature
Database annotation Attributive character Test database annotation Text feature
Database annotation length Attributive character 7 Numerical characteristics
Type of database Attributive character MYSQL Characteristic of division
Field name Attributive character Test field name Text feature
Field annotation Attributive character Test field annotation Text feature
Field value Value tag [" value 1 ", " value 2 " ..., " value n "] Text feature
Fig. 2 is improved width according to an embodiment of the present invention and depth network structure.As shown in Fig. 2, the width and depth The input of degree network may include: width (wide) feature, deep text (Deep-Text) feature, deep classification (Deep- Category) feature.
Optionally, eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic, is wrapped It includes: the numerical characteristics in the attributive character being subjected to eigentransformation, obtained conversion characteristic is width (wide) feature;
And/or by the attributive character text feature and value tag carry out eigentransformation, obtained conversion characteristic is Deep text (Deep-Text) feature;
And/or the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is deep classification (Deep-Category) feature and/or width (wide) feature.
Wherein, optionally, the numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is width Spend the transformation for mula of feature are as follows:
Wherein, wide_feature indicates the width characteristics after conversion, and raw_feature indicates primitive character, max_ Value indicates the maximum value of the width characteristics, and min indicates lesser numerical value in the numerical characteristics and the maximum value.
By the feature transformation of above-mentioned logarithm value tag, transformed result is obtained between 0 to 1, and then reaches special Levy normalized technical effect.All dimension values of namely transformed wide characteristic point are between zero and one.Optionally, Wide feature can be used for full Connection Neural Network.
Optionally, the text feature and value tag are subjected to eigentransformation, obtained conversion characteristic is deep text feature (Deep-Text), comprising:
Final character will be added after text in text feature or value tag;
For each text, one preset length is set;
When the length plus final character of the text is greater than preset length, then deletion exceeds the part of maximum length;
When the length plus final character of the text is less than preset length, then by the part benefit of insufficient preset length Character is filled to supply.
The determination process of deep text feature is introduced in the form of specific embodiment below:
Assuming that preset length is 6;Primitive character to be converted may include attributive character and value tag.
As shown in table 5 below, in example 1, final character<EOS>is added first after primitive character, obtains four characters. Due to supplemented with still without reaching 6 characters of preset length, therefore also needing to be mended with secondary characters<PAD>after final character Foot is until reach preset length.
In example 2, just reach preset length after primitive character is plus final character, so there is no need to add supplement word Symbol.
In example 3, preset length has been exceeded after primitive character is plus final character, therefore has needed to delete and exceeds part Character.
Feature after the conversion of the single text feature of table 2
Because the maximum length of this article eigen setting is 6, the part beyond the 6th character point will be deleted in upper table It removes.
Optionally, it when deep text feature corresponds to multiple text features or value tag, can be generated by the way of splicing Corresponding depth text, length (or dimension) are WhereinFor the preset length of each text feature setting.
Corresponding deep text is generated in such a way that a specific embodiment is described in detail using splicing below:
In the present embodiment, 2 text features are shared are as follows: table 6 and table 7, each text feature and include 3 examples. Wherein, the preset length of table 6 is 4, and the preset length of table 7 is 5.Therefore it needs to be the 5th character in deletion table 6 during converting Part in part, deletion table 7 after point after the 6th character point.Treated table 6 and table 7 are done into splicing again, obtain table Deep text feature after 8 conversions.Wherein, the 1-4 of table 8 is classified as the text feature after first conversion, and 5-9 is classified as second conversion Text feature afterwards.
Deep-Text feature 1 before the conversion of table 3
Index 1 2 3 4 5 6
Example 1 It is special Sign 1 <EOS>
Example 2 Also It is It is special Sign <EOS>
Example 3 It is long 's It is special Sign Value <EOS>
Deep-Text feature 2 before the conversion of table 4
Index 1 2 3 4 5 6
Example 1 It is special Sign 2 <EOS> <PAD>
Example 2 It is special Sign <EOS> <PAD> <PAD>
Example 3 ? It is It is long It is special Sign <EOS>
Deep-Text feature after the conversion of table 5
Index 1 2 3 4 5 6 7 8 9
Example 1 It is special Sign 1 <EOS> It is special Sign 2 <EOS> <PAD>
Example 2 Also It is It is special Sign It is special Sign <EOS> <PAD> <PAD>
Example 3 It is long 's It is special Sign ? It is It is long It is special Sign
For Deep-Text feature, dimension has determined as dimDeep-Text, maintenance one includes all possible characters Dictionary table, the dictionary table by any one character mapping become dimText-EmbeddingThe vector of dimension, such as:
Query result of 6 one words of table in dictionary table
Index 1 2 3 i dimembedding
Value 0.012 0.231 0.986 0.123 0.689
The corresponding vector random initializtion of each word in the dictionary table, constantly study is more in the training process for specific value Newly.Deep-Text is text-type feature, the convolutional neural networks of character level is corresponded to, for extracting the internal information of text.Its In, Fig. 3 is the other convolutional neural networks of character level according to an embodiment of the present invention.
As shown in figure 3, first converting feature vector for text feature, and his positive vector is stitched together by multiple texts Multiple dimensioned convolution is carried out, then pond is carried out to the result of convolution afterwards, finally by the knot of the output layer output training of neural network Fruit.
Optionally, the characteristic of division is encoded using one-hot coding,
Result after coding is spliced into the vector that a numerical value is 0 or 1;
Spliced described 0 or 1 vector is width characteristics.
Since category feature can preferably embody the corresponding classification of characteristic of division by the way of one-hot coding.It is optional The maximum length (or dimension) after eigentransformation is set dim by groundWide
As follows, it is assumed that the shared m kind of the type of this feature is possible, for i-th kind of characteristic type, is translated into as follows It only include 0 and 1 vector, as shown in the value in the following table 2:
Table 7 includes the one-hot coding of the i-th seed type of the possible type of m kind
Index 1 2 i m
Value 0 0 1 0
Include below numerical characteristics and the concrete operations that characteristic of division is converted with a specific embodiment introduction:
Specifically, there are 2 value type feature (f in primitive character1And f2), 2 classification type feature (f3And f4), often The statistical information of a feature is as follows:
1)f1Maximum value be 20, minimum value 0
2)f2Maximum value be 100, minimum value 0
3)f3All possibilities are as follows: A, B, totally 2 kinds of possibility
4)f4All possibilities are as follows: A, B, C, totally 3 kinds of possibility
Wide examples of features before conversion is as shown in table 3 below:
Wide examples of features before table 8 is converted
Feature f1 f2 f3 f4
Value 3 78 B A
Due to f1And f2For the value type feature in wide feature, then eigentransformation is carried out according to normalized mode;f3 And f4For the classification type feature in wide feature, then eigentransformation is carried out according to the mode of above-mentioned one-hot coding.Therefore after converting Wide examples of features it is as shown in table 4 below:
Wide examples of features after table 9 is converted
Wherein, f '1For f1Feature after conversion, f '2For f2Feature after conversion,WithFor f3Spy after conversion Sign,WithFor f4Feature after conversion.
Optionally, the characteristic of division is subjected to eigentransformation, obtained conversion characteristic is deep classification (Deep- Category) feature, comprising:
Characteristic of division is spliced;
A vector is converted by the result of splicing;
The vector is deep category feature.
Deep-Category feature source is the classification type in attributive character.For class of classifying in attributive character The primitive character of type does not do other processing, is only spliced into a long vector, and length (or dimension) is dimDeep-Category°
Optionally, according to the training set training neural network in the conversion characteristic, comprising:
Using the width characteristics in the training set as the input of broadband network in training neural network;
Using in the training set deep text feature and deep category feature as in trained neural network depth network it is defeated Enter;
According to the broadband network and depth network, neural network is determined.
Deep learning is carried out by broadband network and depth network, so that the parameter of entire model is by broadband network and depth Network joint effect.Broadband network carries out deep learning by fully-connected network, and depth network passes through Embedded convolutional Neural Network carries out deep learning, so that the accuracy of study is higher, model size and complexity can be controlled, and then whole The effect of study can be also obviously improved.
Optionally, the width characteristics and the deep category feature are for full Connection Neural Network progress deep learning;
Convolutional neural networks of the depth text feature for character level carry out deep learning.
Due to conversion after all dimensions of Wide feature value between zero and one, Wide feature can be used to connect entirely Neural network be trained.
Text feature corresponds to Deep-Text feature, and the dimension of Deep-Text feature has determined as dimDeep-Text.It will be literary Eigen, which is converted into deep text feature, can rely on the dictionary table comprising all possible characters, and the dictionary table is by any one character Mapping becomes dimText-EmbeddingThe vector of dimension, such as:
Query result of 10 1 words of table in dictionary table
Index 1 2 3 i dimembedding
Value 0.012 0.231 0.986 0.123 0.689
The corresponding vector random initializtion of each word in dictionary table, constantly study is more in the training process for specific value Newly.Deep-Text is text-type feature, can be used for the convolutional neural networks of character level, and then facilitates the inherence for extracting text Information.
As shown in figure 3, Deep-Category feature corresponding one includes all types of dictionary tables, the dictionary table is by one Seed type mapping becomes dimCategory-EmbeddingThe vector of dimension.Unlike Deep-Text feature, Deep-Category is special It takes over for use in the neural network connected entirely.After again, by Wide feature, the output of Deep-Text feature and Deep-Category feature It is spliced into a vector.Final output is finally obtained by a full articulamentum again.
Optionally, for training the function model of neural network for flexible maximum value cross entropy loss function.Wherein, it uses Flexible maximum value intersects entropy loss (softmax cross entropy) function and, as loss function, utilizes stochastic gradient descent Algorithm is trained function model to model.
Optionally, the sample of the neural network is determined by attributive character and value tag matching.
Due to including necessarily value tag in primitive character, a numerical value in value tag can be instructed in conjunction with attributive character Practice neural network.
Optionally, the text feature and value tag being subjected to eigentransformation, obtained conversion characteristic is deep text feature, Include:
Final character will be added after text in text feature or value tag;
For text feature, preset length is set;
When the length plus final character of the text is greater than preset length, then deletion exceeds the part of maximum length, Remainder is deep text feature;
When the length plus final character of the text is less than preset length, then by the part benefit of insufficient preset length Character is filled to supply to obtain deep text feature.
Optionally, deep learning is carried out according to the conversion characteristic, determines the type of literary name section, comprising:
Determine the prediction result of deep learning;
Determine the confidence interval of the prediction result;
According to voting mechanism and maximum confidence interval, the type of literary name section is determined.
Specifically, n can be constructed using the attributive character of literary name section and n value tag (taking not as n empty value) Sample.For judging whether a literary name section is sensitive field, predicted respectively using trained model to this n sample It is predicted.Finally determine that (i.e. the type of the maximum probability of type of prediction is for the sensitive kind of the literary name section using voting mechanism The sensitive kind of the literary name section).
Specifically, it is assumed that prediction result is as shown in the table:
11 1 groups of example prediction results of table
Group 1 2 3 i n
Prediction Cell-phone number Cell-phone number Fixed line Cell-phone number Cell-phone number
Assuming that only the 3rd group of feature prediction result is " fixed line " in prediction result, other groups are predicted as " cell-phone number ", then Determine that the final prediction result of the literary name section is " cell-phone number " using voting mechanism.The mechanism of the ballot, i.e. type of prediction number The most tag along sort of mesh is final prediction result.
Optionally, the confidence level of final prediction result, the formula of the confidence level are calculated are as follows:
Wherein, max_freq_label is the highest tag along sort of frequency of occurrence;Counter is count operator, that is, is calculated The frequency of appearance;N is the group number of input feature vector, and conf value is bigger, indicates that the confidence level of model result is higher;Conf is the prediction As a result confidence level.It can be weighed between accuracy rate and coverage rate by adjusting the size of conf value in use Weighing apparatus.
Illustrate detailed process of the invention with a specific embodiment below.
Fig. 4 is the detailed process of training and the prediction of the method for the type of determining literary name section according to an embodiment of the present invention Figure.It is as shown in Figure 4:
Primitive character is subjected to eigentransformation.It wherein needs to classify primitive character, is specifically divided into attributive character And value tag.Feature after obtained conversion includes: wide feature, Deep-Text feature, Deep-Category feature.According to The wide feature, Deep-Text feature, Deep-Category feature carry out deep learning, obtain Wide&Deep model.
When user issues predictions request, it is also desirable to carry out eigentransformation to primitive character to be predicted, and will be after conversion Feature for Wide&Deep model determine model export.Final predicated response is finally determined according to voting mechanism.
Fig. 5 is the schematic diagram of the main modular of the device of the type of determining literary name section according to an embodiment of the present invention.Such as figure It is shown, a kind of device 500 of the type of determining literary name section, comprising:
Module 501, primitive character division module, for the primitive character of literary name section to be divided into attributive character and value spy Sign;
Module 502, conversion characteristic module are determined for carrying out eigentransformation respectively to the attributive character and value tag Transformed conversion characteristic;
Module 503, neural metwork training module, for according to the training set training neural network in the conversion characteristic;
The determination type module of module 504, literary name section, for being trained according to the test set in the converting characteristic with completion Neural network carry out deep learning, determine the type of literary name section.
Optionally, the attributive character includes: numerical characteristics, characteristic of division, text feature;
The value tag includes text feature.
Optionally, eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic, is wrapped It includes:
Numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is width characteristics;
And/or by the attributive character text feature and value tag carry out eigentransformation, obtained conversion characteristic is Deep text feature;
And/or the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is that deep classification is special Sign and/or width characteristics.
Optionally, the numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is that width is special The transformation for mula of sign are as follows:
Wherein, wide_feature indicates the width characteristics after conversion, and raw_feature indicates primitive character, max_ Value indicates the maximum value of the width characteristics, min indicate the numerical characteristics with it is lesser in the maximum value of the width characteristics Numerical value.
Optionally, the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is that width is special Sign, comprising:
The characteristic of division is encoded using one-hot coding,
Result after coding is spliced into the vector that a numerical value is 0 or 1;
Spliced described 0 or 1 vector is width characteristics.
Optionally, the text feature and value tag being subjected to eigentransformation, obtained conversion characteristic is deep text feature, Include:
Final character will be added after text in text feature or value tag;
For text feature, preset length is set;
When the length plus final character of the text is greater than preset length, then deletion exceeds the part of maximum length, Remainder is deep text feature;
When the length plus final character of the text is less than preset length, then by the part benefit of insufficient preset length Character is filled to supply to obtain deep text feature.
Optionally, the characteristic of division is subjected to eigentransformation, obtained conversion characteristic is deep category feature, comprising:
Characteristic of division is spliced;
A vector is converted by the result of splicing;
The vector is deep category feature.
Optionally, according to the training set training neural network in the conversion characteristic, comprising:
Using the width characteristics in the training set as the input of broadband network in training neural network;
Using in the training set deep text feature and deep category feature as in trained neural network depth network it is defeated Enter;
According to the broadband network and depth network, neural network is determined.
Optionally, the width characteristics and the deep category feature are for full Connection Neural Network progress deep learning;
Convolutional neural networks of the depth text feature for character level carry out deep learning.
Optionally, for training the function model of neural network for flexible maximum value cross entropy loss function.
Optionally, each sample data in the training set of the neural network deep learning is special by attributive character and value Sign matching determines.
Optionally, deep learning is carried out according to the conversion characteristic, determines the type of literary name section, comprising:
Determine the prediction result of deep learning;
Determine the confidence interval of the prediction result;
According to voting mechanism and maximum confidence interval, the type of literary name section is determined.
Fig. 6 is shown can be using the type method of the determination literary name section of the embodiment of the present invention or the type of determining literary name section The exemplary system architecture 600 of device.
As shown in fig. 6, system architecture 600 may include terminal device 601,602,603, network 604 and server 605. Network 604 between terminal device 601,602,603 and server 605 to provide the medium of communication link.Network 604 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 601,602,603 and be interacted by network 604 with server 605, to receive or send out Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 601,602,603 (merely illustrative) such as the application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform softwares.
Terminal device 601,602,603 can be the various electronic equipments with display screen and supported web page browsing, packet Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 605 can be to provide the server of various services, such as utilize terminal device 601,602,603 to user The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to reception To the data such as information query request analyze etc. processing, and by processing result (such as target push information, product letter Breath -- merely illustrative) feed back to terminal device.
It should be noted that determining the type method of literary name section generally by server 605 provided by the embodiment of the present invention It executes, correspondingly, determines that the types of devices of literary name section is generally positioned in server 605.
It should be understood that the number of terminal device, network and server in Fig. 6 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.
Below with reference to Fig. 7, it illustrates the computer systems 700 for the terminal device for being suitable for being used to realize the embodiment of the present invention Structural schematic diagram.Terminal device shown in Fig. 7 is only an example, function to the embodiment of the present invention and should not use model Shroud carrys out any restrictions.
As shown in fig. 7, computer system 700 includes central processing module (CPU) 701, it can be read-only according to being stored in Program in memory (ROM) 702 or be loaded into the program in random access storage device (RAM) 703 from storage section 708 and Execute various movements appropriate and processing.In RAM 703, also it is stored with system 700 and operates required various programs and data. CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input/output (I/O) interface 705 is also connected to always Line 704.
I/O interface 705 is connected to lower component: the importation 706 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 708 including hard disk etc.; And the communications portion 709 of the network interface card including LAN card, modem etc..Communications portion 709 via such as because The network of spy's net executes communication process.Driver 710 is also connected to I/O interface 705 as needed.Detachable media 711, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 710, in order to read from thereon Computer program be mounted into storage section 708 as needed.
Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.? In such embodiment, which can be downloaded and installed from network by communications portion 709, and/or from can Medium 711 is dismantled to be mounted.When the computer program is executed by central processing module (CPU) 701, system of the invention is executed The above-mentioned function of middle restriction.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet It includes sending module, obtain module, determining module and first processing module.Wherein, the title of these modules is under certain conditions simultaneously The restriction to the module itself is not constituted, for example, sending module is also described as " sending picture to the server-side connected The module of acquisition request ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes Obtaining the equipment includes:
The primitive character of literary name section is divided into attributive character and value tag;
Eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic;
According to the training set training neural network in the conversion characteristic;
Deep learning is carried out according to the test set in the converting characteristic and the neural network for completing training, determines literary name section Type.
Technical solution according to an embodiment of the present invention, can achieve it is following the utility model has the advantages that
The present invention is by using the attributive character and the determining instruction of value tag progress deep learning in the primitive character of literary name section Practice model technological means, solve recognition accuracy of the prior art based on Keywords matching and conventional machines learning method and The technological deficiency that recall rate is all relatively low, cost of labor is excessively high, and then reach and the primitive character of literary name section is made full use of to make The type of determining literary name section is more acurrate;
By carrying out eigentransformation respectively to attributive character and value tag, so that transformed conversion data can be applied to Different neural networks does further training, so that deep learning is more targeted, further increases the technology of accuracy rate Effect overcomes the defect of the type inaccuracy for the literary name section that the prior art determines.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims (15)

1. a kind of method of the type of determining literary name section characterized by comprising
The primitive character of literary name section is divided into attributive character and value tag;
Eigentransformation is carried out to the attributive character and value tag respectively, determines transformed conversion characteristic;
According to the training set training neural network in the conversion characteristic;
Deep learning is carried out according to the test set in the converting characteristic and the neural network for completing training, determines the class of literary name section Type.
2. the method according to claim 1, wherein the attributive character include: numerical characteristics, characteristic of division, Text feature;
The value tag includes text feature.
3. according to the method described in claim 2, it is characterized in that, carrying out feature change respectively to the attributive character and value tag It changes, determines transformed conversion characteristic, comprising:
Numerical characteristics in the attributive character are subjected to eigentransformation, obtained conversion characteristic is width characteristics;
And/or by the attributive character text feature and value tag carry out eigentransformation, obtained conversion characteristic is deep text Eigen;
And/or the characteristic of division in the attributive character is subjected to eigentransformation, obtained conversion characteristic is deep category feature And/or width characteristics.
4. according to the method described in claim 3, it is characterized in that, the numerical characteristics in the attributive character are carried out feature change It changes, obtained conversion characteristic is the transformation for mula of width characteristics are as follows:
Wherein, wide_feature indicates the width characteristics after conversion, and raw_feature indicates primitive character, max_value table Show the maximum value of the width characteristics, min indicates lesser numerical value in the maximum value of the numerical characteristics and the width characteristics.
5. according to the method described in claim 4, it is characterized in that, the characteristic of division in the attributive character is carried out feature change It changes, obtained conversion characteristic is width characteristics, comprising:
The characteristic of division is encoded using one-hot coding,
Result after coding is spliced into the vector that a numerical value is 0 or 1;
Spliced described 0 or 1 vector is width characteristics.
6. according to the method described in claim 4, it is characterized in that, the text feature and value tag are carried out eigentransformation, Obtained conversion characteristic is deep text feature, comprising:
Final character will be added after text in text feature or value tag;
For text feature, preset length is set;
When the length plus final character of the text is greater than preset length, then deletion exceeds the part of maximum length, remaining Part is deep text feature;
When the length plus final character of the text is less than preset length, then by the part of insufficient preset length supplement word Symbol is supplied to obtain deep text feature.
7. according to the method described in claim 4, what is obtained turns it is characterized in that, the characteristic of division is carried out eigentransformation Changing feature is deep category feature, comprising:
Characteristic of division is spliced;
A vector is converted by the result of splicing;
The vector is deep category feature.
8. according to the method described in claim 3, it is characterized in that, according to the training set training nerve net in the conversion characteristic Network, comprising:
Using the width characteristics in the training set as the input of broadband network in training neural network;
Using the deep text feature in the training set with deep category feature as the input of depth network in training neural network;
According to the broadband network and depth network, neural network is determined.
9. according to the method described in claim 8, it is characterized in that, the width characteristics are with the deep category feature for connecting entirely It connects neural network and carries out deep learning;
Convolutional neural networks of the depth text feature for character level carry out deep learning.
10. according to the method described in claim 3, it is characterized in that, for train the function model of neural network for it is flexible most Big value cross entropy loss function.
11. according to the method described in claim 5, it is characterized in that, every in the training set of the neural network deep learning A sample data is determined by attributive character and value tag matching.
12. determining table the method according to claim 1, wherein carrying out deep learning according to the conversion characteristic The type of field, comprising:
Determine the prediction result of deep learning;
Determine the confidence interval of the prediction result;
According to voting mechanism and maximum confidence interval, the type of literary name section is determined.
13. a kind of device of the type of determining literary name section characterized by comprising
Primitive character division module, for the primitive character of literary name section to be divided into attributive character and value tag;
Conversion characteristic module determines transformed conversion for carrying out eigentransformation respectively to the attributive character and value tag Feature;
Neural metwork training module, for according to the training set training neural network in the conversion characteristic;
The determination type module of literary name section, for according in the converting characteristic test set and complete training neural network into Row deep learning determines the type of literary name section.
14. a kind of electronic equipment of the type of determining literary name section characterized by comprising
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in claim 1-12.
15. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor The method as described in any in claim 1-12 is realized when row.
CN201910043827.3A 2019-01-17 2019-01-17 The method and apparatus for determining the type of literary name section Pending CN109784407A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910043827.3A CN109784407A (en) 2019-01-17 2019-01-17 The method and apparatus for determining the type of literary name section

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910043827.3A CN109784407A (en) 2019-01-17 2019-01-17 The method and apparatus for determining the type of literary name section

Publications (1)

Publication Number Publication Date
CN109784407A true CN109784407A (en) 2019-05-21

Family

ID=66500986

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910043827.3A Pending CN109784407A (en) 2019-01-17 2019-01-17 The method and apparatus for determining the type of literary name section

Country Status (1)

Country Link
CN (1) CN109784407A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399912A (en) * 2019-07-12 2019-11-01 广东浪潮大数据研究有限公司 A kind of method of character recognition, system, equipment and computer readable storage medium
CN112967780A (en) * 2021-01-27 2021-06-15 安徽华米健康科技有限公司 Physical ability age prediction method, device, equipment and storage medium based on PAI
CN113312354A (en) * 2021-06-10 2021-08-27 北京百度网讯科技有限公司 Data table identification method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105825138A (en) * 2015-01-04 2016-08-03 北京神州泰岳软件股份有限公司 Sensitive data identification method and device
CN108647251A (en) * 2018-04-20 2018-10-12 昆明理工大学 The recommendation sort method of conjunctive model is recycled based on wide depth door
CN109196527A (en) * 2016-04-13 2019-01-11 谷歌有限责任公司 Breadth and depth machine learning model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105825138A (en) * 2015-01-04 2016-08-03 北京神州泰岳软件股份有限公司 Sensitive data identification method and device
CN109196527A (en) * 2016-04-13 2019-01-11 谷歌有限责任公司 Breadth and depth machine learning model
CN108647251A (en) * 2018-04-20 2018-10-12 昆明理工大学 The recommendation sort method of conjunctive model is recycled based on wide depth door

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HENG-TZE CHENG等: "Wide & Deep Learning for Recommender Systems", 《HTTPS://ARXIV.ORG/ABS/1606.07792》 *
YOON KIM: "Convolutional Neural Networks for Sentence Classification", 《HTTPS://ARXIV.ORG/ABS/1408.5882》 *
京东数科技术说: "基于Wide & Deep网络和TextCNN的敏感字段识别", 《HTTPS://ZHUANLAN.ZHIHU.COM/P/54967619》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399912A (en) * 2019-07-12 2019-11-01 广东浪潮大数据研究有限公司 A kind of method of character recognition, system, equipment and computer readable storage medium
CN112967780A (en) * 2021-01-27 2021-06-15 安徽华米健康科技有限公司 Physical ability age prediction method, device, equipment and storage medium based on PAI
CN113312354A (en) * 2021-06-10 2021-08-27 北京百度网讯科技有限公司 Data table identification method, device, equipment and storage medium
CN113312354B (en) * 2021-06-10 2023-07-28 北京百度网讯科技有限公司 Data table identification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108171276B (en) Method and apparatus for generating information
CN107105031A (en) Information-pushing method and device
CN107463704A (en) Searching method and device based on artificial intelligence
US20190163742A1 (en) Method and apparatus for generating information
CN108153901A (en) The information-pushing method and device of knowledge based collection of illustrative plates
CN109697641A (en) The method and apparatus for calculating commodity similarity
CN108628830A (en) A kind of method and apparatus of semantics recognition
CN107908666A (en) A kind of method and apparatus of identification equipment mark
CN109784407A (en) The method and apparatus for determining the type of literary name section
CN107908615A (en) A kind of method and apparatus for obtaining search term corresponding goods classification
CN107766492A (en) A kind of method and apparatus of picture search
CN107193974A (en) Localized information based on artificial intelligence determines method and apparatus
CN109697537A (en) The method and apparatus of data audit
CN108121699A (en) For the method and apparatus of output information
CN106919711A (en) The method and apparatus of the markup information based on artificial intelligence
CN108776692A (en) Method and apparatus for handling information
CN110119445A (en) The method and apparatus for generating feature vector and text classification being carried out based on feature vector
CN109146152A (en) Incident classification prediction technique and device on a kind of line
CN109284367A (en) Method and apparatus for handling text
CN110232487A (en) A kind of task allocating method and device
CN110309142A (en) The method and apparatus of regulation management
CN108171275A (en) For identifying the method and apparatus of flowers
CN109389660A (en) Image generating method and device
CN107291835A (en) A kind of recommendation method and apparatus of search term
CN109960212A (en) Task sending method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190521