CN107992594A - A kind of division methods of text attribute, device, server and storage medium - Google Patents

A kind of division methods of text attribute, device, server and storage medium Download PDF

Info

Publication number
CN107992594A
CN107992594A CN201711316678.0A CN201711316678A CN107992594A CN 107992594 A CN107992594 A CN 107992594A CN 201711316678 A CN201711316678 A CN 201711316678A CN 107992594 A CN107992594 A CN 107992594A
Authority
CN
China
Prior art keywords
data
text
feature
text data
disaggregated model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711316678.0A
Other languages
Chinese (zh)
Inventor
谢永恒
冯宇波
火莽
火一莽
董清风
万月亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201711316678.0A priority Critical patent/CN107992594A/en
Publication of CN107992594A publication Critical patent/CN107992594A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of division methods of text attribute, device, server and storage medium.The described method includes:Text data is converted into vector data;Input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data, wherein described disaggregated model further includes feature reconstruction part, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, and the feature reconstruction part is used to characteristic obtaining vector data by reconstruct;The corresponding characteristic of the text data is clustered, and Attribute transposition is carried out to the text data according to cluster result.The technical solution of the embodiment of the present invention can the feature based on text data itself text attribute is divided automatically, improve efficiency.

Description

A kind of division methods of text attribute, device, server and storage medium
Technical field
The present embodiments relate to field of computer technology, more particularly to a kind of division methods of text attribute, device, clothes Business device and storage medium.
Background technology
With the fast development of depth learning technology, deep learning algorithm has been widely applied to text identification field. Such as the application-specific such as automatic identification " license plate number ", " identification card number ", " bank's card number ", " telephone number " from text collection. However, these applications are substantially all and belong to closed question, i.e., the specific scope of a target classification is limited first, is being known Under the premise of which type of dominant character road target classification all possesses, by having the training and prediction of supervision, new textual data is judged According to belonging to any attributive classification.
But as the high speed development of internet, various new text datas emerge in an endless stream.In internet text, New attribute type may occur at any time, these excavations of new attribute for user information, but can not using particularly significant The text recognition method for being closed formula extracts, and causes the loss and waste of information.If for example, a system is merely able to know Not " license plate number " and " cell-phone number " two kinds of text attributes, when there is new text type such as " handset identity code ", which will It is forced its being divided into " license plate number " or " cell-phone number " with relatively low confidence level, and extra new Attribute transposition can not be provided.
To solve the problems, such as this, the artificial mode of prior art generally use, i.e., according to the micro-judgment of people, irregularly The attribute set to text identification expanded and changed, and for new attribute create labeled data collection.This mode efficiency is low Under, substantial amounts of manpower can be brought to waste.
The content of the invention
The embodiment of the present invention provides a kind of division methods of text attribute, device, server and storage medium, can be based on The feature of text data itself divides text attribute automatically, improves efficiency.
In a first aspect, an embodiment of the present invention provides a kind of division methods of text attribute, including:
Text data is converted into vector data;
Input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and will be described point The output of feature transform portion is as the corresponding characteristic of the text data in class model, wherein the disaggregated model also wraps Feature reconstruction part is included, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, the feature reconstruction Part is used to characteristic obtaining vector data by reconstruct;
The corresponding characteristic of the text data is clustered, and the text data is carried out according to cluster result Attribute transposition.
Second aspect, the embodiment of the present invention additionally provide a kind of division device of text attribute, which includes:
Vector data module, for text data to be converted into vector data;
Feature conversion module, for using the vector data as the disaggregated model for being in advance based on deep neural network structure Input, and using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data, its Described in disaggregated model further include feature reconstruction part, the feature transform portion, which is used to be abstracted from vector data, obtains feature Data, the feature reconstruction part are used to characteristic obtaining vector data by reconstruct;
Attribute transposition module, for being clustered to the corresponding characteristic of the text data, and according to cluster result Attribute transposition is carried out to the text data.
The third aspect, the embodiment of the present invention additionally provide a kind of server, and the server includes:
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are performed by one or more of processors so that one or more of processing Device realizes the division methods of text attribute as described above.
Fourth aspect, the embodiment of the present invention additionally provide a kind of computer-readable recording medium, are stored thereon with computer Program, the program realize the division methods of text attribute as described above when being executed by processor.
The embodiment of the present invention by text data by being converted into vector data, using the vector data as being in advance based on depth The input of the disaggregated model of neutral net structure is spent, and using the output of feature transform portion in the disaggregated model as the text The corresponding characteristic of notebook data, clusters the corresponding characteristic of the text data, and according to cluster result to institute State text data and carry out Attribute transposition.The technical solution of the embodiment of the present invention can the feature based on text data itself to text Attribute is divided automatically, improves efficiency, and also the parameter of disaggregated model is trained by feature reconstruction part, is reduced The error of Attribute transposition.
Brief description of the drawings
Fig. 1 is the flow chart of the division methods of the text attribute in the embodiment of the present invention one;
Fig. 2 is the flow chart of the division methods of the text attribute in the embodiment of the present invention two;
Fig. 3 is the structure diagram of the division device of the text attribute in the embodiment of the present invention three;
Fig. 4 is the structure diagram of the server in the embodiment of the present invention four.
Embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that in order to just It illustrate only part related to the present invention rather than entire infrastructure in description, attached drawing.
Embodiment one
Fig. 1 is the flow chart of the division methods of the text attribute in the embodiment of the present invention one, and the present embodiment is applicable to text The situation of this division, this method can be performed by the division device of text attribute, which can use software and/or hardware Mode realize, for example, the device is configured in server.As shown in Figure 1, this method can specifically include:
Step 110, by text data be converted into vector data.
In the present embodiment, the text data can be the unstructured short text data collection obtained from internet Close, or the short text set of segments data obtained after being segmented to long text.
Specifically, the detailed process that text data is converted into vector data can be:All short texts are subjected to duplicate removal Afterwards, to being separated between each character in each short text with space, as corpus.Calculated using CBOW or Skip-gram Method is trained, so that each character to be converted into the vector (k can be set according to practical situations) of a k dimension, and is adopted With between the unified Coordinate Adjusting by each character in k dimension spaces of normalized function to 0 to 1.Then, each text is cut The k dimensional vectors of s character before taking, the null vector with k dimensions less than s character are filled.Finally, all short texts will all be turned Turn to the two-dimensional matrix of a s × k.
Step 120, the input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and Using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data, wherein the classification Model further includes feature reconstruction part, and the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, described Feature reconstruction part is used to characteristic obtaining vector data by reconstruct.
Wherein, the disaggregated model can be based on deep neural network structure, and the disaggregated model can turn including feature Change part and feature reconstruction part, optionally, the feature transform portion includes input layer, convolutional layer, pond layer and length successively Short-term memory network layer, the feature reconstruction part include repeat layer, shot and long term memory network layer, up-sampling layer and convolution successively Layer.Wherein, the shot and long term memory network layer can be substituted using other kinds of circulation layer.
Also, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, the feature reconstruction portion Divide and be used to characteristic obtaining vector data by reconstruct.The characteristic from existing text attribute type limit System.
Specifically, input that can be using the vector data in step 110 as the disaggregated model built in advance, based on depth Neutral net is trained the feature transform portion in the disaggregated model and the parameter of feature reconstruction part, until the spy Untill the output of sign reconstruct part meets iteration stopping condition with input, the output of feature transform portion is taken as the textual data According to corresponding characteristic.Wherein, the iteration stopping condition can be the output and feature conversion of the feature reconstruction part Partial input (vector data) infinitely tends to be close.In disaggregated model training, the text data of vector quantization both can conduct Input data, can also be used as supervision mark, therefore constitutes and learn without the self-supervisory in the case of exterior mark.
Step 130, cluster the corresponding characteristic of the text data, and according to cluster result to the text Data carry out Attribute transposition.
Specifically, for the corresponding characteristic of unstructured text data, can be according to the non-structured text number According to the size of k dimension mean squared criterion difference σ of the characteristic in feature space determine the cluster number of cluster, cluster number c=f (σ, P), f is the function that c is calculated according to σ, and p is adjusting parameter.The form of function f can use linear function, logarithmic function and index The diversified forms such as function.According to the size of c, using K-Means algorithms, characteristic can be carried out without prison in feature space Cluster is superintended and directed, characteristic is gathered for c cluster.
According to the cluster label of cluster, original short text data can be labeled, realize the attribute to text data Division.As a result can be showed by form and/or figure, and can by the manually real meaning to each cluster and whether be Newfound attribute is finally confirmed.
The present embodiment is refreshing as depth is in advance based on using the vector data by the way that text data is converted into vector data The input of disaggregated model through network struction, and using the output of feature transform portion in the disaggregated model as the textual data According to corresponding characteristic, the corresponding characteristic of the text data is clustered, and according to cluster result to the text Notebook data carries out Attribute transposition.The technical solution of the present embodiment can the feature based on text data itself to text attribute carry out Automatic division, improves efficiency, also the parameter of disaggregated model is trained by feature reconstruction part, reduces Attribute transposition Error.
Embodiment two
Fig. 2 is the flow chart of the division methods of the text attribute in the embodiment of the present invention two, and the present embodiment is in above-mentioned implementation On the basis of example, the division methods of structured text data attribute are further illustrated, correspondingly, the method for the present embodiment can be with Including:
Step 210, by structured text data preparation for no field label and have the data of two kinds of forms of field label.
Wherein, the data of the no field label can be to eliminate field name, and by the data group under all fields It is combined and the data of duplicate removal, the data for having a field label can be comprising original field information, and by each field Interior text data carries out the data of duplicate removal.
Specifically, data that can be by the structured text data preparation obtained from internet for two kinds of forms, two kinds Form can for no field label data and have the data of field label.
Step 220, the data to no field label form carry out character separation operation, using obtained character as language material Storehouse.
Specifically, the data of no field label form can be carried out to character separation operation, i.e., it is every in each text Separated between a character with space, obtained character can be used as corpus.
Step 230, according to the corpus be converted into vector data by the data for having field label form.
Specifically, the corpus that will can be obtained in step 220, is trained using CBOW or Skip-gram algorithms, from And each character is converted into the vector (k can be set according to practical situations) of a k dimension, and united using normalized function One by between Coordinate Adjusting of each character in k dimension spaces to 0 to 1.Then, the k of s character before being intercepted to each text Dimensional vector, the null vector with k dimensions less than s character are filled.The data for having field label form are all converted into a s × k Two-dimensional matrix.
Step 240, the input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and Using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data.
Wherein, the disaggregated model based on deep neural network structure can include feature transform portion and feature reconstruction Part.Optionally, disaggregated model includes successively:
The 1st layer of input layer for s × k, for passing data to neutral net;
2nd layer is convolutional layer, can be one-dimensional convolutional layer, convolution kernel size 1 × 3, and data filling mode uses identical number According to filling, convolution nuclear volume is set to k, this layer is used to extract the local feature in text;
3rd layer is pond layer, can be used for the resampling of data for one-dimensional maximum pond layer, pond size 1 × 2, this layer;
4th layer is shot and long term memory network layer, and neuronal quantity is set to k, for understanding the context of text in k dimension Implication;
5th layer is repeat layer, and the output of last layer is repeated s/2 times;
6th layer is shot and long term memory network layer, and neuronal quantity is set according to the size of k, and the result of each step is complete Portion exports;
7th layer is up-sampling layer, can be one-dimensional up-sampling layer, sample size 1 × 2;
8th layer is convolutional layer, can be one-dimensional convolutional layer, convolution kernel size 1 × 3, and data filling mode uses identical number According to filling, convolution nuclear volume is set according to the size of k.
In the structure of above-mentioned disaggregated model, the 1st layer to the 4th layer is characterized transform portion, and the 5th layer to the 8th layer is characterized Part is reconstructed, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, and the feature reconstruction part is used In characteristic is obtained vector data by reconstruct.In disaggregated model, different activation primitives and loss letter can be used Count to realize training.
Specifically, input that can be using the vector data in step 230 as the disaggregated model built in advance, based on depth Neutral net is trained the feature transform portion in the disaggregated model and the parameter of feature reconstruction part, until the spy Untill the output of sign reconstruct part meets iteration stopping condition with input, the 4th layer of output is taken to be corresponded to as the text data Characteristic.
Step 250, cluster the corresponding characteristic of the text data, and according to cluster result to the text Data carry out Attribute transposition.
For the characteristic of structured text data, side can be tieed up according to k of the characteristic in feature space and is marked Whether the size of quasi- difference is divided into clean data more than threshold value set in advance and mixes data, wherein the threshold value can voluntarily be set Fixed, the characteristic that can be defined less than the threshold value is clean data, otherwise definition is to mix more than the characteristic of the threshold value Data;Can be using the clean data as a cluster;Determine to mix in the variance size of feature space according to the data that mix The number of the cluster of miscellaneous data, and the data that mix are clustered according to the number for the cluster for mixing data;After cluster Clean data and mix mutual distance of the cluster of data in feature space, carry out cluster merging, obtain structured text data The cluster result of characteristic.
Specifically, according to above-mentioned cluster result, original structured text data can be labeled, realized to structure Change the Attribute transposition of text data.The structured text data preset format can include key-value forms, mutipart At least one of form, json forms and xml forms.
The present embodiment is no field label and the data for having two kinds of forms of field label by structured text data preparation, Character separation operation is carried out to the data of no field label form, using obtained character as corpus, and according to the language material The data for having field label form are converted into vector data by storehouse, using the vector data as being in advance based on deep neural network The input of the disaggregated model of structure, and corresponded to the output of feature transform portion in the disaggregated model as the text data Characteristic, the corresponding characteristic of the text data is clustered, and according to cluster result to the text data Carry out Attribute transposition.The technical solution of the present embodiment can the feature based on text data itself text attribute is drawn automatically Point, efficiency is improved, also the parameter of disaggregated model is trained by feature reconstruction part, reduces the mistake of Attribute transposition Difference, and realize Attribute transposition on the basis of the raw information of structured text data is ensured.
Embodiment three
Fig. 3 is the structure diagram of the division device of the text attribute in the embodiment of the present invention three, and described device can wrap Include:
Vector data module 310, for text data to be converted into vector data;
Feature conversion module 320, for using the vector data as the classification for being in advance based on deep neural network structure The input of model, and using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data According to wherein the disaggregated model further includes feature reconstruction part, the feature transform portion is used to be abstracted from vector data To characteristic, the feature reconstruction part is used to characteristic obtaining vector data by reconstruct;
Attribute transposition module 330, is tied for being clustered to the corresponding characteristic of the text data, and according to cluster Fruit carries out Attribute transposition to the text data.
Further, the vector data module 310 specifically can be used for:
It is no field label and the data for having two kinds of forms of field label by structured text data preparation;
Character separation operation is carried out to the data of no field label form, using obtained character as corpus;
The data for having field label form are converted into vector data according to the corpus.
Further, which can also include disaggregated model structure module, and the disaggregated model structure module is specifically used In:
Using the corresponding vector data of text data as input, based on deep neural network to the spy in the disaggregated model Sign transform portion and the parameter of feature reconstruction part are trained, until output and the input satisfaction of the feature reconstruction part change Untill stop condition.
Exemplary, the feature transform portion can include input layer, convolutional layer, pond layer and shot and long term memory successively Network layer;The feature reconstruction part includes repeat layer, shot and long term memory network layer, up-sampling layer and convolutional layer successively.
Further, the Attribute transposition module 330 specifically can be used for:
Characteristic corresponding for structured text data, according to the corresponding characteristic of the structured text data The corresponding characteristic of the structured text data is divided into clean data in the distribution character of feature space and mixes data; Using the clean data as a cluster;According to the cluster for mixing data and determining to mix data in the variance size of feature space Number, and the data that mix are clustered according to the number for the cluster for mixing data;According to the clean data after cluster and Mix mutual distance of the cluster of data in feature space, carry out cluster merging, obtain the corresponding characteristic of structured text data According to cluster result.
The division device for the text attribute that the embodiment of the present invention is provided can perform what any embodiment of the present invention was provided The division methods of text attribute, possess the corresponding function module of execution method and beneficial effect.
Example IV
Fig. 4 is the structure diagram of the server in the embodiment of the present invention four.Fig. 4 is shown suitable for being used for realizing the present invention The block diagram of the exemplary servers 412 of embodiment.The server 412 that Fig. 4 is shown is only an example, should not be to the present invention The function and use scope of embodiment bring any restrictions.
As shown in figure 4, server 412 is showed in the form of universal computing device.The component of server 412 can include but It is not limited to:One or more processor 416, system storage 428, connection different system component (including system storage 428 With processor 416) bus 418.
Bus 418 represents the one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor 416 or total using the local of any bus structures in a variety of bus structures Line.For example, these architectures include but not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and periphery component interconnection (PCI) are total Line.
Server 412 typically comprises various computing systems computer-readable recording medium.These media can be it is any being capable of bedding and clothing The usable medium that business device 412 accesses, including volatile and non-volatile medium, moveable and immovable medium.
System storage 428 can include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 430 and/or cache memory 432.Server 412 may further include it is other it is removable/can not Mobile, volatile/non-volatile computer system storage medium.Only as an example, storage system 434 can be used for read-write not Movably, non-volatile magnetic media (Fig. 4 is not shown, is commonly referred to as " hard disk drive ").Although not shown in Fig. 4, can with The disc driver being used for moving non-volatile magnetic disk (such as " floppy disk ") read-write is provided, and to removable non-volatile The CD drive of CD (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driving Device can be connected by one or more data media interfaces with bus 418.Memory 428 can include at least one program Product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform the present invention The function of each embodiment.
Program/utility 440 with one group of (at least one) program module 442, can be stored in such as memory In 428, such program module 442 includes but not limited to operating system, one or more application program, other program modules And routine data, the realization of network environment may be included in each or certain combination in these examples.Program module 442 Usually perform the function and/or method in embodiment described in the invention.
Server 412 can also be with one or more external equipments 414 (such as keyboard, sensing equipment, display 424 etc.) Communication, can also enable a user to the equipment communication interacted with the server 412 with one or more, and/or with causing the clothes Any equipment (such as network interface card, modem etc.) that business device 412 can communicate with one or more of the other computing device Communication.This communication can be carried out by input/output (I/O) interface 422.Also, server 412 can also be fitted by network Orchestration 420 and one or more network (such as LAN (LAN), wide area network (WAN) and/or public network, such as because of spy Net) communication.As shown in the figure, network adapter 420 is communicated by bus 418 with other modules of server 412.It should be understood that Although not shown in the drawings, can combine server 412 uses other hardware and/or software module, include but not limited to:Micro- generation Code, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and data backup are deposited Storage system etc..
Processor 416 is stored in program in system storage 428 by operation, thus perform various functions application and Data processing, such as realize the division methods for the text attribute that the embodiment of the present invention is provided, this method includes:
Text data is converted into vector data;
Input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and will be described point The output of feature transform portion is as the corresponding characteristic of the text data in class model, wherein the disaggregated model also wraps Feature reconstruction part is included, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, the feature reconstruction Part is used to characteristic obtaining vector data by reconstruct;
The corresponding characteristic of the text data is clustered, and the text data is carried out according to cluster result Attribute transposition.
Embodiment five
The embodiment of the present invention five additionally provides a kind of computer-readable recording medium, is stored thereon with computer program, should The division methods of the text attribute provided such as the embodiment of the present invention are realized when program is executed by processor, this method includes:
Text data is converted into vector data;
Input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and will be described point The output of feature transform portion is as the corresponding characteristic of the text data in class model, wherein the disaggregated model also wraps Feature reconstruction part is included, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, the feature reconstruction Part is used to characteristic obtaining vector data by reconstruct;
The corresponding characteristic of the text data is clustered, and the text data is carried out according to cluster result Attribute transposition.
The computer-readable storage medium of the embodiment of the present invention, can use any of one or more computer-readable media Combination.Computer-readable medium can be computer-readable signal media or computer-readable recording medium.It is computer-readable Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or Device, or any combination above.The more specifically example (non exhaustive list) of computer-readable recording medium includes:Tool There are the electrical connections of one or more conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only storage (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD- ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage Medium can be any includes or the tangible medium of storage program, the program can be commanded execution system, device or device Using or it is in connection.
Computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium beyond storage medium is read, which, which can send, propagates or transmit, is used for By instruction execution system, device either device use or program in connection.
The program code included on computer-readable medium can be transmitted with any appropriate medium, including --- but it is unlimited In wireless, electric wire, optical cable, RF etc., or above-mentioned any appropriate combination.
It can be write with one or more programming languages or its combination for performing the computer that operates of the present invention Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with Fully perform, partly perform on the user computer on the user computer, the software kit independent as one performs, portion Divide and partly perform or performed completely on remote computer or server on the remote computer on the user computer. Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service Pass through Internet connection for business).
Note that it above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although being carried out by above example to the present invention It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also It can include other more equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.

Claims (10)

  1. A kind of 1. division methods of text attribute, it is characterised in that including:
    Text data is converted into vector data;
    Input using the vector data as the disaggregated model for being in advance based on deep neural network structure, and by the classification mould The output of feature transform portion is as the corresponding characteristic of the text data in type, wherein the disaggregated model further includes spy Sign reconstruct part, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic, the feature reconstruction part For characteristic to be obtained vector data by reconstruct;
    The corresponding characteristic of the text data is clustered, and attribute is carried out to the text data according to cluster result Division.
  2. 2. according to the method described in claim 1, it is characterized in that, described be converted into vector data by text data, including:
    It is no field label and the data for having two kinds of forms of field label by structured text data preparation;
    Character separation operation is carried out to the data of no field label form, using obtained character as corpus;
    The data for having field label form are converted into vector data according to the corpus.
  3. 3. according to the method described in claim 1, it is characterized in that, the structure of disaggregated model includes:
    Using the corresponding vector data of text data as input, the feature in the disaggregated model is turned based on deep neural network The parameter for changing part and feature reconstruction part is trained, until the output of the feature reconstruction part meets that iteration is stopped with input Only untill condition.
  4. 4. according to the method described in claim 1, it is characterized in that, the feature transform portion includes input layer, convolution successively Layer, pond layer and shot and long term memory network layer;The feature reconstruction part successively include repeat layer, shot and long term memory network layer, Up-sample layer and convolutional layer.
  5. 5. according to the method described in claim 1, it is characterized in that, described carry out the corresponding characteristic of the text data Cluster, including:
    Characteristic corresponding for structured text data, according to the corresponding characteristic of the structured text data in spy The corresponding characteristic of the structured text data is divided into clean data and mixes data by the distribution character for levying space;By institute Clean data are stated as a cluster;According to of the cluster for mixing data and determining to mix data in the variance size of feature space Number, and the data that mix are clustered according to the number for the cluster for mixing data;According to the clean data after cluster and mix Mutual distance of the cluster of data in feature space, carries out cluster merging, obtains the corresponding characteristic of structured text data Cluster result.
  6. A kind of 6. division device of text attribute, it is characterised in that including:
    Vector data module, for text data to be converted into vector data;
    Feature conversion module, for using the vector data as the defeated of the disaggregated model for being in advance based on deep neural network structure Enter, and using the output of feature transform portion in the disaggregated model as the corresponding characteristic of the text data, wherein institute State disaggregated model and further include feature reconstruction part, the feature transform portion, which is used to be abstracted from vector data, obtains characteristic According to the feature reconstruction part is used to characteristic obtaining vector data by reconstruct;
    Attribute transposition module, for being clustered to the corresponding characteristic of the text data, and according to cluster result to institute State text data and carry out Attribute transposition.
  7. 7. device according to claim 6, it is characterised in that the vector data module is specifically used for:
    It is no field label and the data for having two kinds of forms of field label by structured text data preparation;
    Character separation operation is carried out to the data of no field label form, using obtained character as corpus;
    The data for having field label form are converted into vector data according to the corpus.
  8. 8. device according to claim 6, it is characterised in that further include:
    Disaggregated model builds module, is specifically used for:Using the corresponding vector data of text data as input, based on depth nerve net Network is trained the feature transform portion in the disaggregated model and the parameter of feature reconstruction part, until the feature reconstruction Untill partial output meets iteration stopping condition with input.
  9. 9. a kind of server, it is characterised in that the server includes:
    One or more processors;
    Storage device, for storing one or more programs;
    When one or more of programs are performed by one or more of processors so that one or more of processors are real The now division methods of the text attribute as described in any in claim 1-5.
  10. 10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The division methods of the text attribute as described in any in claim 1-5 are realized during execution.
CN201711316678.0A 2017-12-12 2017-12-12 A kind of division methods of text attribute, device, server and storage medium Pending CN107992594A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711316678.0A CN107992594A (en) 2017-12-12 2017-12-12 A kind of division methods of text attribute, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711316678.0A CN107992594A (en) 2017-12-12 2017-12-12 A kind of division methods of text attribute, device, server and storage medium

Publications (1)

Publication Number Publication Date
CN107992594A true CN107992594A (en) 2018-05-04

Family

ID=62035893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711316678.0A Pending CN107992594A (en) 2017-12-12 2017-12-12 A kind of division methods of text attribute, device, server and storage medium

Country Status (1)

Country Link
CN (1) CN107992594A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413986A (en) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 A kind of text cluster multi-document auto-abstracting method and system improving term vector model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104035431A (en) * 2014-05-22 2014-09-10 清华大学 Obtaining method and system for kernel function parameters applied to nonlinear process monitoring
WO2016105803A1 (en) * 2014-12-24 2016-06-30 Intel Corporation Hybrid technique for sentiment analysis
CN106778853A (en) * 2016-12-07 2017-05-31 中南大学 Unbalanced data sorting technique based on weight cluster and sub- sampling
AU2016256753A1 (en) * 2016-01-13 2017-07-27 Adobe Inc. Image captioning using weak supervision and semantic natural language vector space
CN107180023A (en) * 2016-03-11 2017-09-19 科大讯飞股份有限公司 A kind of file classification method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104035431A (en) * 2014-05-22 2014-09-10 清华大学 Obtaining method and system for kernel function parameters applied to nonlinear process monitoring
WO2016105803A1 (en) * 2014-12-24 2016-06-30 Intel Corporation Hybrid technique for sentiment analysis
AU2016256753A1 (en) * 2016-01-13 2017-07-27 Adobe Inc. Image captioning using weak supervision and semantic natural language vector space
CN107180023A (en) * 2016-03-11 2017-09-19 科大讯飞股份有限公司 A kind of file classification method and system
CN106778853A (en) * 2016-12-07 2017-05-31 中南大学 Unbalanced data sorting technique based on weight cluster and sub- sampling

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413986A (en) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 A kind of text cluster multi-document auto-abstracting method and system improving term vector model
CN110413986B (en) * 2019-04-12 2023-08-29 上海晏鼠计算机技术股份有限公司 Text clustering multi-document automatic summarization method and system for improving word vector model

Similar Documents

Publication Publication Date Title
US20220019855A1 (en) Image generation method, neural network compression method, and related apparatus and device
CN111488826B (en) Text recognition method and device, electronic equipment and storage medium
WO2022105125A1 (en) Image segmentation method and apparatus, computer device, and storage medium
WO2022012407A1 (en) Neural network training method and related device
GB2571825A (en) Semantic class localization digital environment
EP3989109A1 (en) Image identification method and device, identification model training method and device, and storage medium
WO2021164317A1 (en) Sequence mining model training method, sequence data processing method and device
CN111338897A (en) Identification method of abnormal node in application host, monitoring equipment and electronic equipment
CN112995414B (en) Behavior quality inspection method, device, equipment and storage medium based on voice call
CN107832794A (en) A kind of convolutional neural networks generation method, the recognition methods of car system and computing device
CN112364933B (en) Image classification method, device, electronic equipment and storage medium
CN111401156A (en) Image identification method based on Gabor convolution neural network
JP2024508867A (en) Image clustering method, device, computer equipment and computer program
CN115223662A (en) Data processing method, device, equipment and storage medium
Zhao et al. Hybrid generative/discriminative scene classification strategy based on latent Dirichlet allocation for high spatial resolution remote sensing imagery
CN116824677B (en) Expression recognition method and device, electronic equipment and storage medium
CN107992594A (en) A kind of division methods of text attribute, device, server and storage medium
CN113177118A (en) Text classification model, text classification method and device
CN112906652A (en) Face image recognition method and device, electronic equipment and storage medium
CN108460335A (en) The recognition methods of video fine granularity, device, computer equipment and storage medium
CN113139490B (en) Image feature matching method and device, computer equipment and storage medium
CN115661472A (en) Image duplicate checking method and device, computer equipment and storage medium
CN113139617B (en) Power transmission line autonomous positioning method and device and terminal equipment
CN115116080A (en) Table analysis method and device, electronic equipment and storage medium
CN116861226A (en) Data processing method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180504