CN107992594A - A kind of division methods of text attribute, device, server and storage medium - Google Patents
A kind of division methods of text attribute, device, server and storage medium Download PDFInfo
- Publication number
- CN107992594A CN107992594A CN201711316678.0A CN201711316678A CN107992594A CN 107992594 A CN107992594 A CN 107992594A CN 201711316678 A CN201711316678 A CN 201711316678A CN 107992594 A CN107992594 A CN 107992594A
- Authority
- CN
- China
- Prior art keywords
- data
- text
- feature
- text data
- disaggregated model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/358—Browsing; Visualisation therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a kind of division methods of text attribute, device, server and storage medium.The described method includes:Text data is converted into vector data;Input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data, wherein described disaggregated model further includes feature reconstruction part, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, and the feature reconstruction part is used to characteristic obtaining vector data by reconstruct;The corresponding characteristic of the text data is clustered, and Attribute transposition is carried out to the text data according to cluster result.The technical solution of the embodiment of the present invention can the feature based on text data itself text attribute is divided automatically, improve efficiency.
Description
Technical field
The present embodiments relate to field of computer technology, more particularly to a kind of division methods of text attribute, device, clothes
Business device and storage medium.
Background technology
With the fast development of depth learning technology, deep learning algorithm has been widely applied to text identification field.
Such as the application-specific such as automatic identification " license plate number ", " identification card number ", " bank's card number ", " telephone number " from text collection.
However, these applications are substantially all and belong to closed question, i.e., the specific scope of a target classification is limited first, is being known
Under the premise of which type of dominant character road target classification all possesses, by having the training and prediction of supervision, new textual data is judged
According to belonging to any attributive classification.
But as the high speed development of internet, various new text datas emerge in an endless stream.In internet text,
New attribute type may occur at any time, these excavations of new attribute for user information, but can not using particularly significant
The text recognition method for being closed formula extracts, and causes the loss and waste of information.If for example, a system is merely able to know
Not " license plate number " and " cell-phone number " two kinds of text attributes, when there is new text type such as " handset identity code ", which will
It is forced its being divided into " license plate number " or " cell-phone number " with relatively low confidence level, and extra new Attribute transposition can not be provided.
To solve the problems, such as this, the artificial mode of prior art generally use, i.e., according to the micro-judgment of people, irregularly
The attribute set to text identification expanded and changed, and for new attribute create labeled data collection.This mode efficiency is low
Under, substantial amounts of manpower can be brought to waste.
The content of the invention
The embodiment of the present invention provides a kind of division methods of text attribute, device, server and storage medium, can be based on
The feature of text data itself divides text attribute automatically, improves efficiency.
In a first aspect, an embodiment of the present invention provides a kind of division methods of text attribute, including:
Text data is converted into vector data;
Input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and will be described point
The output of feature transform portion is as the corresponding characteristic of the text data in class model, wherein the disaggregated model also wraps
Feature reconstruction part is included, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, the feature reconstruction
Part is used to characteristic obtaining vector data by reconstruct;
The corresponding characteristic of the text data is clustered, and the text data is carried out according to cluster result
Attribute transposition.
Second aspect, the embodiment of the present invention additionally provide a kind of division device of text attribute, which includes:
Vector data module, for text data to be converted into vector data;
Feature conversion module, for using the vector data as the disaggregated model for being in advance based on deep neural network structure
Input, and using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data, its
Described in disaggregated model further include feature reconstruction part, the feature transform portion, which is used to be abstracted from vector data, obtains feature
Data, the feature reconstruction part are used to characteristic obtaining vector data by reconstruct;
Attribute transposition module, for being clustered to the corresponding characteristic of the text data, and according to cluster result
Attribute transposition is carried out to the text data.
The third aspect, the embodiment of the present invention additionally provide a kind of server, and the server includes:
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are performed by one or more of processors so that one or more of processing
Device realizes the division methods of text attribute as described above.
Fourth aspect, the embodiment of the present invention additionally provide a kind of computer-readable recording medium, are stored thereon with computer
Program, the program realize the division methods of text attribute as described above when being executed by processor.
The embodiment of the present invention by text data by being converted into vector data, using the vector data as being in advance based on depth
The input of the disaggregated model of neutral net structure is spent, and using the output of feature transform portion in the disaggregated model as the text
The corresponding characteristic of notebook data, clusters the corresponding characteristic of the text data, and according to cluster result to institute
State text data and carry out Attribute transposition.The technical solution of the embodiment of the present invention can the feature based on text data itself to text
Attribute is divided automatically, improves efficiency, and also the parameter of disaggregated model is trained by feature reconstruction part, is reduced
The error of Attribute transposition.
Brief description of the drawings
Fig. 1 is the flow chart of the division methods of the text attribute in the embodiment of the present invention one;
Fig. 2 is the flow chart of the division methods of the text attribute in the embodiment of the present invention two;
Fig. 3 is the structure diagram of the division device of the text attribute in the embodiment of the present invention three;
Fig. 4 is the structure diagram of the server in the embodiment of the present invention four.
Embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that in order to just
It illustrate only part related to the present invention rather than entire infrastructure in description, attached drawing.
Embodiment one
Fig. 1 is the flow chart of the division methods of the text attribute in the embodiment of the present invention one, and the present embodiment is applicable to text
The situation of this division, this method can be performed by the division device of text attribute, which can use software and/or hardware
Mode realize, for example, the device is configured in server.As shown in Figure 1, this method can specifically include:
Step 110, by text data be converted into vector data.
In the present embodiment, the text data can be the unstructured short text data collection obtained from internet
Close, or the short text set of segments data obtained after being segmented to long text.
Specifically, the detailed process that text data is converted into vector data can be:All short texts are subjected to duplicate removal
Afterwards, to being separated between each character in each short text with space, as corpus.Calculated using CBOW or Skip-gram
Method is trained, so that each character to be converted into the vector (k can be set according to practical situations) of a k dimension, and is adopted
With between the unified Coordinate Adjusting by each character in k dimension spaces of normalized function to 0 to 1.Then, each text is cut
The k dimensional vectors of s character before taking, the null vector with k dimensions less than s character are filled.Finally, all short texts will all be turned
Turn to the two-dimensional matrix of a s × k.
Step 120, the input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and
Using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data, wherein the classification
Model further includes feature reconstruction part, and the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, described
Feature reconstruction part is used to characteristic obtaining vector data by reconstruct.
Wherein, the disaggregated model can be based on deep neural network structure, and the disaggregated model can turn including feature
Change part and feature reconstruction part, optionally, the feature transform portion includes input layer, convolutional layer, pond layer and length successively
Short-term memory network layer, the feature reconstruction part include repeat layer, shot and long term memory network layer, up-sampling layer and convolution successively
Layer.Wherein, the shot and long term memory network layer can be substituted using other kinds of circulation layer.
Also, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, the feature reconstruction portion
Divide and be used to characteristic obtaining vector data by reconstruct.The characteristic from existing text attribute type limit
System.
Specifically, input that can be using the vector data in step 110 as the disaggregated model built in advance, based on depth
Neutral net is trained the feature transform portion in the disaggregated model and the parameter of feature reconstruction part, until the spy
Untill the output of sign reconstruct part meets iteration stopping condition with input, the output of feature transform portion is taken as the textual data
According to corresponding characteristic.Wherein, the iteration stopping condition can be the output and feature conversion of the feature reconstruction part
Partial input (vector data) infinitely tends to be close.In disaggregated model training, the text data of vector quantization both can conduct
Input data, can also be used as supervision mark, therefore constitutes and learn without the self-supervisory in the case of exterior mark.
Step 130, cluster the corresponding characteristic of the text data, and according to cluster result to the text
Data carry out Attribute transposition.
Specifically, for the corresponding characteristic of unstructured text data, can be according to the non-structured text number
According to the size of k dimension mean squared criterion difference σ of the characteristic in feature space determine the cluster number of cluster, cluster number c=f (σ,
P), f is the function that c is calculated according to σ, and p is adjusting parameter.The form of function f can use linear function, logarithmic function and index
The diversified forms such as function.According to the size of c, using K-Means algorithms, characteristic can be carried out without prison in feature space
Cluster is superintended and directed, characteristic is gathered for c cluster.
According to the cluster label of cluster, original short text data can be labeled, realize the attribute to text data
Division.As a result can be showed by form and/or figure, and can by the manually real meaning to each cluster and whether be
Newfound attribute is finally confirmed.
The present embodiment is refreshing as depth is in advance based on using the vector data by the way that text data is converted into vector data
The input of disaggregated model through network struction, and using the output of feature transform portion in the disaggregated model as the textual data
According to corresponding characteristic, the corresponding characteristic of the text data is clustered, and according to cluster result to the text
Notebook data carries out Attribute transposition.The technical solution of the present embodiment can the feature based on text data itself to text attribute carry out
Automatic division, improves efficiency, also the parameter of disaggregated model is trained by feature reconstruction part, reduces Attribute transposition
Error.
Embodiment two
Fig. 2 is the flow chart of the division methods of the text attribute in the embodiment of the present invention two, and the present embodiment is in above-mentioned implementation
On the basis of example, the division methods of structured text data attribute are further illustrated, correspondingly, the method for the present embodiment can be with
Including:
Step 210, by structured text data preparation for no field label and have the data of two kinds of forms of field label.
Wherein, the data of the no field label can be to eliminate field name, and by the data group under all fields
It is combined and the data of duplicate removal, the data for having a field label can be comprising original field information, and by each field
Interior text data carries out the data of duplicate removal.
Specifically, data that can be by the structured text data preparation obtained from internet for two kinds of forms, two kinds
Form can for no field label data and have the data of field label.
Step 220, the data to no field label form carry out character separation operation, using obtained character as language material
Storehouse.
Specifically, the data of no field label form can be carried out to character separation operation, i.e., it is every in each text
Separated between a character with space, obtained character can be used as corpus.
Step 230, according to the corpus be converted into vector data by the data for having field label form.
Specifically, the corpus that will can be obtained in step 220, is trained using CBOW or Skip-gram algorithms, from
And each character is converted into the vector (k can be set according to practical situations) of a k dimension, and united using normalized function
One by between Coordinate Adjusting of each character in k dimension spaces to 0 to 1.Then, the k of s character before being intercepted to each text
Dimensional vector, the null vector with k dimensions less than s character are filled.The data for having field label form are all converted into a s × k
Two-dimensional matrix.
Step 240, the input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and
Using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data.
Wherein, the disaggregated model based on deep neural network structure can include feature transform portion and feature reconstruction
Part.Optionally, disaggregated model includes successively:
The 1st layer of input layer for s × k, for passing data to neutral net;
2nd layer is convolutional layer, can be one-dimensional convolutional layer, convolution kernel size 1 × 3, and data filling mode uses identical number
According to filling, convolution nuclear volume is set to k, this layer is used to extract the local feature in text;
3rd layer is pond layer, can be used for the resampling of data for one-dimensional maximum pond layer, pond size 1 × 2, this layer;
4th layer is shot and long term memory network layer, and neuronal quantity is set to k, for understanding the context of text in k dimension
Implication;
5th layer is repeat layer, and the output of last layer is repeated s/2 times;
6th layer is shot and long term memory network layer, and neuronal quantity is set according to the size of k, and the result of each step is complete
Portion exports;
7th layer is up-sampling layer, can be one-dimensional up-sampling layer, sample size 1 × 2;
8th layer is convolutional layer, can be one-dimensional convolutional layer, convolution kernel size 1 × 3, and data filling mode uses identical number
According to filling, convolution nuclear volume is set according to the size of k.
In the structure of above-mentioned disaggregated model, the 1st layer to the 4th layer is characterized transform portion, and the 5th layer to the 8th layer is characterized
Part is reconstructed, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, and the feature reconstruction part is used
In characteristic is obtained vector data by reconstruct.In disaggregated model, different activation primitives and loss letter can be used
Count to realize training.
Specifically, input that can be using the vector data in step 230 as the disaggregated model built in advance, based on depth
Neutral net is trained the feature transform portion in the disaggregated model and the parameter of feature reconstruction part, until the spy
Untill the output of sign reconstruct part meets iteration stopping condition with input, the 4th layer of output is taken to be corresponded to as the text data
Characteristic.
Step 250, cluster the corresponding characteristic of the text data, and according to cluster result to the text
Data carry out Attribute transposition.
For the characteristic of structured text data, side can be tieed up according to k of the characteristic in feature space and is marked
Whether the size of quasi- difference is divided into clean data more than threshold value set in advance and mixes data, wherein the threshold value can voluntarily be set
Fixed, the characteristic that can be defined less than the threshold value is clean data, otherwise definition is to mix more than the characteristic of the threshold value
Data;Can be using the clean data as a cluster;Determine to mix in the variance size of feature space according to the data that mix
The number of the cluster of miscellaneous data, and the data that mix are clustered according to the number for the cluster for mixing data;After cluster
Clean data and mix mutual distance of the cluster of data in feature space, carry out cluster merging, obtain structured text data
The cluster result of characteristic.
Specifically, according to above-mentioned cluster result, original structured text data can be labeled, realized to structure
Change the Attribute transposition of text data.The structured text data preset format can include key-value forms, mutipart
At least one of form, json forms and xml forms.
The present embodiment is no field label and the data for having two kinds of forms of field label by structured text data preparation,
Character separation operation is carried out to the data of no field label form, using obtained character as corpus, and according to the language material
The data for having field label form are converted into vector data by storehouse, using the vector data as being in advance based on deep neural network
The input of the disaggregated model of structure, and corresponded to the output of feature transform portion in the disaggregated model as the text data
Characteristic, the corresponding characteristic of the text data is clustered, and according to cluster result to the text data
Carry out Attribute transposition.The technical solution of the present embodiment can the feature based on text data itself text attribute is drawn automatically
Point, efficiency is improved, also the parameter of disaggregated model is trained by feature reconstruction part, reduces the mistake of Attribute transposition
Difference, and realize Attribute transposition on the basis of the raw information of structured text data is ensured.
Embodiment three
Fig. 3 is the structure diagram of the division device of the text attribute in the embodiment of the present invention three, and described device can wrap
Include:
Vector data module 310, for text data to be converted into vector data;
Feature conversion module 320, for using the vector data as the classification for being in advance based on deep neural network structure
The input of model, and using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data
According to wherein the disaggregated model further includes feature reconstruction part, the feature transform portion is used to be abstracted from vector data
To characteristic, the feature reconstruction part is used to characteristic obtaining vector data by reconstruct;
Attribute transposition module 330, is tied for being clustered to the corresponding characteristic of the text data, and according to cluster
Fruit carries out Attribute transposition to the text data.
Further, the vector data module 310 specifically can be used for:
It is no field label and the data for having two kinds of forms of field label by structured text data preparation;
Character separation operation is carried out to the data of no field label form, using obtained character as corpus;
The data for having field label form are converted into vector data according to the corpus.
Further, which can also include disaggregated model structure module, and the disaggregated model structure module is specifically used
In:
Using the corresponding vector data of text data as input, based on deep neural network to the spy in the disaggregated model
Sign transform portion and the parameter of feature reconstruction part are trained, until output and the input satisfaction of the feature reconstruction part change
Untill stop condition.
Exemplary, the feature transform portion can include input layer, convolutional layer, pond layer and shot and long term memory successively
Network layer;The feature reconstruction part includes repeat layer, shot and long term memory network layer, up-sampling layer and convolutional layer successively.
Further, the Attribute transposition module 330 specifically can be used for:
Characteristic corresponding for structured text data, according to the corresponding characteristic of the structured text data
The corresponding characteristic of the structured text data is divided into clean data in the distribution character of feature space and mixes data;
Using the clean data as a cluster;According to the cluster for mixing data and determining to mix data in the variance size of feature space
Number, and the data that mix are clustered according to the number for the cluster for mixing data;According to the clean data after cluster and
Mix mutual distance of the cluster of data in feature space, carry out cluster merging, obtain the corresponding characteristic of structured text data
According to cluster result.
The division device for the text attribute that the embodiment of the present invention is provided can perform what any embodiment of the present invention was provided
The division methods of text attribute, possess the corresponding function module of execution method and beneficial effect.
Example IV
Fig. 4 is the structure diagram of the server in the embodiment of the present invention four.Fig. 4 is shown suitable for being used for realizing the present invention
The block diagram of the exemplary servers 412 of embodiment.The server 412 that Fig. 4 is shown is only an example, should not be to the present invention
The function and use scope of embodiment bring any restrictions.
As shown in figure 4, server 412 is showed in the form of universal computing device.The component of server 412 can include but
It is not limited to:One or more processor 416, system storage 428, connection different system component (including system storage 428
With processor 416) bus 418.
Bus 418 represents the one or more in a few class bus structures, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor 416 or total using the local of any bus structures in a variety of bus structures
Line.For example, these architectures include but not limited to industry standard architecture (ISA) bus, microchannel architecture
(MAC) bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and periphery component interconnection (PCI) are total
Line.
Server 412 typically comprises various computing systems computer-readable recording medium.These media can be it is any being capable of bedding and clothing
The usable medium that business device 412 accesses, including volatile and non-volatile medium, moveable and immovable medium.
System storage 428 can include the computer system readable media of form of volatile memory, such as deposit at random
Access to memory (RAM) 430 and/or cache memory 432.Server 412 may further include it is other it is removable/can not
Mobile, volatile/non-volatile computer system storage medium.Only as an example, storage system 434 can be used for read-write not
Movably, non-volatile magnetic media (Fig. 4 is not shown, is commonly referred to as " hard disk drive ").Although not shown in Fig. 4, can with
The disc driver being used for moving non-volatile magnetic disk (such as " floppy disk ") read-write is provided, and to removable non-volatile
The CD drive of CD (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driving
Device can be connected by one or more data media interfaces with bus 418.Memory 428 can include at least one program
Product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform the present invention
The function of each embodiment.
Program/utility 440 with one group of (at least one) program module 442, can be stored in such as memory
In 428, such program module 442 includes but not limited to operating system, one or more application program, other program modules
And routine data, the realization of network environment may be included in each or certain combination in these examples.Program module 442
Usually perform the function and/or method in embodiment described in the invention.
Server 412 can also be with one or more external equipments 414 (such as keyboard, sensing equipment, display 424 etc.)
Communication, can also enable a user to the equipment communication interacted with the server 412 with one or more, and/or with causing the clothes
Any equipment (such as network interface card, modem etc.) that business device 412 can communicate with one or more of the other computing device
Communication.This communication can be carried out by input/output (I/O) interface 422.Also, server 412 can also be fitted by network
Orchestration 420 and one or more network (such as LAN (LAN), wide area network (WAN) and/or public network, such as because of spy
Net) communication.As shown in the figure, network adapter 420 is communicated by bus 418 with other modules of server 412.It should be understood that
Although not shown in the drawings, can combine server 412 uses other hardware and/or software module, include but not limited to:Micro- generation
Code, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and data backup are deposited
Storage system etc..
Processor 416 is stored in program in system storage 428 by operation, thus perform various functions application and
Data processing, such as realize the division methods for the text attribute that the embodiment of the present invention is provided, this method includes:
Text data is converted into vector data;
Input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and will be described point
The output of feature transform portion is as the corresponding characteristic of the text data in class model, wherein the disaggregated model also wraps
Feature reconstruction part is included, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, the feature reconstruction
Part is used to characteristic obtaining vector data by reconstruct;
The corresponding characteristic of the text data is clustered, and the text data is carried out according to cluster result
Attribute transposition.
Embodiment five
The embodiment of the present invention five additionally provides a kind of computer-readable recording medium, is stored thereon with computer program, should
The division methods of the text attribute provided such as the embodiment of the present invention are realized when program is executed by processor, this method includes:
Text data is converted into vector data;
Input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and will be described point
The output of feature transform portion is as the corresponding characteristic of the text data in class model, wherein the disaggregated model also wraps
Feature reconstruction part is included, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, the feature reconstruction
Part is used to characteristic obtaining vector data by reconstruct;
The corresponding characteristic of the text data is clustered, and the text data is carried out according to cluster result
Attribute transposition.
The computer-readable storage medium of the embodiment of the present invention, can use any of one or more computer-readable media
Combination.Computer-readable medium can be computer-readable signal media or computer-readable recording medium.It is computer-readable
Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or
Device, or any combination above.The more specifically example (non exhaustive list) of computer-readable recording medium includes:Tool
There are the electrical connections of one or more conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only storage
(ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-
ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage
Medium can be any includes or the tangible medium of storage program, the program can be commanded execution system, device or device
Using or it is in connection.
Computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium beyond storage medium is read, which, which can send, propagates or transmit, is used for
By instruction execution system, device either device use or program in connection.
The program code included on computer-readable medium can be transmitted with any appropriate medium, including --- but it is unlimited
In wireless, electric wire, optical cable, RF etc., or above-mentioned any appropriate combination.
It can be write with one or more programming languages or its combination for performing the computer that operates of the present invention
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
Fully perform, partly perform on the user computer on the user computer, the software kit independent as one performs, portion
Divide and partly perform or performed completely on remote computer or server on the remote computer on the user computer.
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or
Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service
Pass through Internet connection for business).
Note that it above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that
The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes,
Readjust and substitute without departing from protection scope of the present invention.Therefore, although being carried out by above example to the present invention
It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also
It can include other more equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.
Claims (10)
- A kind of 1. division methods of text attribute, it is characterised in that including:Text data is converted into vector data;Input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and by the classification mould The output of feature transform portion is as the corresponding characteristic of the text data in type, wherein the disaggregated model further includes spy Sign reconstruct part, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, the feature reconstruction part For characteristic to be obtained vector data by reconstruct;The corresponding characteristic of the text data is clustered, and attribute is carried out to the text data according to cluster result Division.
- 2. according to the method described in claim 1, it is characterized in that, described be converted into vector data by text data, including:It is no field label and the data for having two kinds of forms of field label by structured text data preparation;Character separation operation is carried out to the data of no field label form, using obtained character as corpus;The data for having field label form are converted into vector data according to the corpus.
- 3. according to the method described in claim 1, it is characterized in that, the structure of disaggregated model includes:Using the corresponding vector data of text data as input, the feature in the disaggregated model is turned based on deep neural network The parameter for changing part and feature reconstruction part is trained, until the output of the feature reconstruction part meets that iteration is stopped with input Only untill condition.
- 4. according to the method described in claim 1, it is characterized in that, the feature transform portion includes input layer, convolution successively Layer, pond layer and shot and long term memory network layer;The feature reconstruction part successively include repeat layer, shot and long term memory network layer, Up-sample layer and convolutional layer.
- 5. according to the method described in claim 1, it is characterized in that, described carry out the corresponding characteristic of the text data Cluster, including:Characteristic corresponding for structured text data, according to the corresponding characteristic of the structured text data in spy The corresponding characteristic of the structured text data is divided into clean data and mixes data by the distribution character for levying space;By institute Clean data are stated as a cluster;According to of the cluster for mixing data and determining to mix data in the variance size of feature space Number, and the data that mix are clustered according to the number for the cluster for mixing data;According to the clean data after cluster and mix Mutual distance of the cluster of data in feature space, carries out cluster merging, obtains the corresponding characteristic of structured text data Cluster result.
- A kind of 6. division device of text attribute, it is characterised in that including:Vector data module, for text data to be converted into vector data;Feature conversion module, for using the vector data as the defeated of the disaggregated model for being in advance based on deep neural network structure Enter, and using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data, wherein institute State disaggregated model and further include feature reconstruction part, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic According to the feature reconstruction part is used to characteristic obtaining vector data by reconstruct;Attribute transposition module, for being clustered to the corresponding characteristic of the text data, and according to cluster result to institute State text data and carry out Attribute transposition.
- 7. device according to claim 6, it is characterised in that the vector data module is specifically used for:It is no field label and the data for having two kinds of forms of field label by structured text data preparation;Character separation operation is carried out to the data of no field label form, using obtained character as corpus;The data for having field label form are converted into vector data according to the corpus.
- 8. device according to claim 6, it is characterised in that further include:Disaggregated model builds module, is specifically used for:Using the corresponding vector data of text data as input, based on depth nerve net Network is trained the feature transform portion in the disaggregated model and the parameter of feature reconstruction part, until the feature reconstruction Untill partial output meets iteration stopping condition with input.
- 9. a kind of server, it is characterised in that the server includes:One or more processors;Storage device, for storing one or more programs;When one or more of programs are performed by one or more of processors so that one or more of processors are real The now division methods of the text attribute as described in any in claim 1-5.
- 10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The division methods of the text attribute as described in any in claim 1-5 are realized during execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711316678.0A CN107992594A (en) | 2017-12-12 | 2017-12-12 | A kind of division methods of text attribute, device, server and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711316678.0A CN107992594A (en) | 2017-12-12 | 2017-12-12 | A kind of division methods of text attribute, device, server and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107992594A true CN107992594A (en) | 2018-05-04 |
Family
ID=62035893
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711316678.0A Pending CN107992594A (en) | 2017-12-12 | 2017-12-12 | A kind of division methods of text attribute, device, server and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107992594A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413986A (en) * | 2019-04-12 | 2019-11-05 | 上海晏鼠计算机技术股份有限公司 | A kind of text cluster multi-document auto-abstracting method and system improving term vector model |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104035431A (en) * | 2014-05-22 | 2014-09-10 | 清华大学 | Obtaining method and system for kernel function parameters applied to nonlinear process monitoring |
WO2016105803A1 (en) * | 2014-12-24 | 2016-06-30 | Intel Corporation | Hybrid technique for sentiment analysis |
CN106778853A (en) * | 2016-12-07 | 2017-05-31 | 中南大学 | Unbalanced data sorting technique based on weight cluster and sub- sampling |
AU2016256753A1 (en) * | 2016-01-13 | 2017-07-27 | Adobe Inc. | Image captioning using weak supervision and semantic natural language vector space |
CN107180023A (en) * | 2016-03-11 | 2017-09-19 | 科大讯飞股份有限公司 | A kind of file classification method and system |
-
2017
- 2017-12-12 CN CN201711316678.0A patent/CN107992594A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104035431A (en) * | 2014-05-22 | 2014-09-10 | 清华大学 | Obtaining method and system for kernel function parameters applied to nonlinear process monitoring |
WO2016105803A1 (en) * | 2014-12-24 | 2016-06-30 | Intel Corporation | Hybrid technique for sentiment analysis |
AU2016256753A1 (en) * | 2016-01-13 | 2017-07-27 | Adobe Inc. | Image captioning using weak supervision and semantic natural language vector space |
CN107180023A (en) * | 2016-03-11 | 2017-09-19 | 科大讯飞股份有限公司 | A kind of file classification method and system |
CN106778853A (en) * | 2016-12-07 | 2017-05-31 | 中南大学 | Unbalanced data sorting technique based on weight cluster and sub- sampling |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413986A (en) * | 2019-04-12 | 2019-11-05 | 上海晏鼠计算机技术股份有限公司 | A kind of text cluster multi-document auto-abstracting method and system improving term vector model |
CN110413986B (en) * | 2019-04-12 | 2023-08-29 | 上海晏鼠计算机技术股份有限公司 | Text clustering multi-document automatic summarization method and system for improving word vector model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220019855A1 (en) | Image generation method, neural network compression method, and related apparatus and device | |
CN111488826B (en) | Text recognition method and device, electronic equipment and storage medium | |
WO2022105125A1 (en) | Image segmentation method and apparatus, computer device, and storage medium | |
WO2022012407A1 (en) | Neural network training method and related device | |
GB2571825A (en) | Semantic class localization digital environment | |
EP3989109A1 (en) | Image identification method and device, identification model training method and device, and storage medium | |
WO2021164317A1 (en) | Sequence mining model training method, sequence data processing method and device | |
CN111338897A (en) | Identification method of abnormal node in application host, monitoring equipment and electronic equipment | |
CN112995414B (en) | Behavior quality inspection method, device, equipment and storage medium based on voice call | |
CN107832794A (en) | A kind of convolutional neural networks generation method, the recognition methods of car system and computing device | |
CN112364933B (en) | Image classification method, device, electronic equipment and storage medium | |
CN111401156A (en) | Image identification method based on Gabor convolution neural network | |
JP2024508867A (en) | Image clustering method, device, computer equipment and computer program | |
CN115223662A (en) | Data processing method, device, equipment and storage medium | |
Zhao et al. | Hybrid generative/discriminative scene classification strategy based on latent Dirichlet allocation for high spatial resolution remote sensing imagery | |
CN116824677B (en) | Expression recognition method and device, electronic equipment and storage medium | |
CN107992594A (en) | A kind of division methods of text attribute, device, server and storage medium | |
CN113177118A (en) | Text classification model, text classification method and device | |
CN112906652A (en) | Face image recognition method and device, electronic equipment and storage medium | |
CN108460335A (en) | The recognition methods of video fine granularity, device, computer equipment and storage medium | |
CN113139490B (en) | Image feature matching method and device, computer equipment and storage medium | |
CN115661472A (en) | Image duplicate checking method and device, computer equipment and storage medium | |
CN113139617B (en) | Power transmission line autonomous positioning method and device and terminal equipment | |
CN115116080A (en) | Table analysis method and device, electronic equipment and storage medium | |
CN116861226A (en) | Data processing method and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180504 |