CN109597888A - Establish the method, apparatus of text field identification model - Google Patents

Establish the method, apparatus of text field identification model Download PDF

Info

Publication number
CN109597888A
CN109597888A CN201811376081.XA CN201811376081A CN109597888A CN 109597888 A CN109597888 A CN 109597888A CN 201811376081 A CN201811376081 A CN 201811376081A CN 109597888 A CN109597888 A CN 109597888A
Authority
CN
China
Prior art keywords
text
field
domain classification
classification template
extensive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811376081.XA
Other languages
Chinese (zh)
Inventor
梁川
梁一川
凌光
林英展
徐威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201811376081.XA priority Critical patent/CN109597888A/en
Publication of CN109597888A publication Critical patent/CN109597888A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The present invention provides a kind of method, apparatus for establishing text field identification model, which comprises obtains the text for not carrying out domain classification;Field belonging to the text is marked using domain classification template;Using each text as input, using the field of each text marking as output, train classification models obtain text field identification model;Wherein, field belonging to text that user is inputted can be identified using the text field identification model.The present invention is able to solve the prior art and overfitting problem caused by template classification or category of model is used alone, and then promotes the accuracy that text field identification model identifies text fields.

Description

Establish the method, apparatus of text field identification model
[technical field]
The present invention relates to natural language processing technique field more particularly to a kind of sides for establishing text field identification model Method, device, equipment and computer storage medium.
[background technique]
In some Domestic News systems, after obtaining the query text that user is inputted, need from recommendation, question and answer, chat Correct field module is selected to issue the inquiry request of user in equal fields module, therefore a urgent problem needed to be solved is exactly How the field of the query text of user input is identified.
The prior art, generally can be using the side that template or learning model is used alone when identifying field belonging to text Formula, and limitation below can be had by being used alone when two kinds of mode classifications identify field belonging to texts.Wherein, individually make When identifying text field with learning model, most important disadvantage is: if sufficient labeled data can not be obtained, will lead to training There are more serious overfitting problems for obtained disaggregated model, so that text fields can not be identified accurately.And independent When identifying text field using template, most important disadvantage is: if wanting to realize accurately identifying for text fields, needing By a large amount of classification model of human configuration, thus manpower expend it is huge, if the negligible amounts of classification model, can equally exist compared with Serious overfitting problem.
[summary of the invention]
In view of this, the present invention provides a kind of method, apparatus for establishing text field identification model, equipment and computers Storage medium is used alone overfitting problem caused by template classification or category of model for solving the prior art, promotes text Recognition accuracy of this field identification model to text fields.
The present invention in order to solve the technical problem used by technical solution be to provide a kind of text field identification model established Method, which comprises obtain the text for not carrying out domain classification;It is marked belonging to the text using domain classification template Field;Using each text as input, using the field of each text marking as output, train classification models obtain text field knowledge Other model;Wherein, field belonging to the inputted text of user can be identified using the text field identification model.
According to one preferred embodiment of the present invention, the domain classification template obtains in the following manner: obtaining each field Common text;Word cutting is carried out to the common text, to obtain the semanteme of each word in the common text;According to described common Each word is semantic extensive to common text progress in text;Using the extensive result of the common text as described common The domain classification template of text fields.
According to one preferred embodiment of the present invention, described to mark the packet of field belonging to the text using domain classification template It includes: word cutting being carried out to the text, to obtain the semanteme of each word in the text;According to the semanteme of word each in the text It is extensive to text progress, to obtain the extensive result of the text;Judge whether the extensive result of the text hits institute State domain classification template;If the extensive result of the text hits the domain classification template, the domain classification that will be hit The corresponding field of template is labeled as field belonging to the text;If domain classification described in the extensive result miss of the text Field belonging to the text is then labeled as default field by template.
According to one preferred embodiment of the present invention, whether the extensive result for judging the text hits the domain classification Template includes: the text similarity calculated between the extensive result and the domain classification template of the text;If calculating gained The text similarity arrived is greater than preset threshold, it is determined that the extensive result of the text hits the domain classification template, otherwise Determine miss.
According to one preferred embodiment of the present invention, after marking field belonging to the text using domain classification template, Further include: using the extensive result of the text as the domain classification template of the text fields.
The present invention in order to solve the technical problem used by technical solution be to provide a kind of text field identification model established Device, described device include: acquiring unit, for obtaining the text for not carrying out domain classification;Unit is marked, for utilizing field Classification model marks field belonging to the text;Training unit is used for using each text as input, by the neck of each text marking Domain obtains text field identification model as output, train classification models.
According to one preferred embodiment of the present invention, the mark unit obtains the domain classification template in the following manner: Obtain the common text in each field;Word cutting is carried out to the common text, to obtain the semanteme of each word in the common text; According to the semantic extensive to common text progress of word each in the common text;By the extensive result of the common text Domain classification template as the common text fields.
According to one preferred embodiment of the present invention, the mark unit is marked belonging to the text using domain classification template Field when, it is specific to execute: word cutting to be carried out to the text, to obtain the semanteme of each word in the text;According to the text Each word is semantic extensive to text progress in this, to obtain the extensive result of the text;Judge the general of the text Change whether result hits the domain classification template;It, will if the extensive result of the text hits the domain classification template The corresponding field of domain classification template hit is labeled as field belonging to the text;If the extensive result of the text is not The domain classification template is hit, then field belonging to the text is labeled as default field.
According to one preferred embodiment of the present invention, whether the mark unit hits institute in the extensive result for judging the text It is specific to execute: to calculate the text between the extensive result of the text and the domain classification template when stating domain classification template Similarity;If calculating obtained text similarity greater than preset threshold, it is determined that described in the extensive result hit of the text Otherwise domain classification template determines miss.
According to one preferred embodiment of the present invention, the mark unit is marked belonging to the text using domain classification template Field after, also execute: using the extensive result of the text as the domain classification template of the text fields.
As can be seen from the above technical solutions, the present invention obtains text by way of fusion template classification and category of model This field identification model can alleviate the limitation of existing text field identification method, be effectively prevented from exclusive use classification Template or disaggregated model carry out overfitting problem existing when text field identification, to reach better recognition effect.
[Detailed description of the invention]
Fig. 1 is the method flow diagram for establishing text field identification model that one embodiment of the invention provides;
Fig. 2 is the structure drawing of device for establishing text field identification model that one embodiment of the invention provides;
Fig. 3 is the block diagram for the computer system/server that one embodiment of the invention provides.
[specific embodiment]
To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments The present invention is described in detail.
The term used in embodiments of the present invention is only to be not intended to be limiting merely for for the purpose of describing particular embodiments The present invention.In the embodiment of the present invention and the "an" of singular used in the attached claims, " described " and "the" It is also intended to including most forms, unless the context clearly indicates other meaning.
It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, indicate There may be three kinds of relationships, for example, A and/or B, can indicate: individualism A, exist simultaneously A and B, individualism B these three Situation.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".
Depending on context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determination " or " in response to detection ".Similarly, depend on context, phrase " if it is determined that " or " if detection (condition or event of statement) " can be construed to " when determining " or " in response to determination " or " when the detection (condition of statement Or event) when " or " in response to detection (condition or event of statement) ".
Fig. 1 is a kind of method flow diagram for establishing text field identification model that one embodiment of the invention provides, as shown in figure 1 It is shown, which comprises
In 101, the text for not carrying out domain classification is obtained.
In this step, the text for not carrying out domain classification is obtained, that is, obtains the text for not marking fields.
It is understood that can be obtained from internet by way of data mining and not carry out domain classification largely Text, such as the search query word that is inputted of user is excavated from web search log;It can also be by the side that artificially collects Formula obtains the text for not carrying out domain classification largely.The present invention to acquisition do not carry out domain classification text mode without It limits.
In 102, field belonging to the text is marked using domain classification template.
In this step, using domain classification template to the text institute for not carrying out domain classification acquired in step 101 The field of category is labeled, so that it is determined that field belonging to each text.
Wherein, this step can obtain in the following ways domain classification template: obtaining the common text in each field, that is, obtain Take text common, representative in each field;Word cutting is carried out to common text, to obtain each word in common text It is semantic;According to the semantic extensive to common text progress of word each in common text, using the extensive result of common text as this The domain classification template of common text fields.In addition, domain classification template is also possible to pre-existing, directly acquire pre- The domain classification template pre-existed carries out the mark to text fields.
It is understood that the frequency of occurrences in each field can be higher than to the text of predeterminated frequency as the common of the field Text;Common text of the N text as the field can also be selected at random from the text in each field, wherein N be greater than etc. In 1 positive integer;The common text in each field can also be obtained by way of artificially collecting.The present invention is to each field of acquisition The mode of common text is without limiting.
Wherein, for domain classification template for classifying to field belonging to text, a domain classification template is corresponding only A field one by one, but can have multiple and different domain classification templates in a field.
Specifically, this step, can be in the following ways when using field belonging to domain classification template mark text: Word cutting is carried out to text, to obtain the semanteme of each word in text;It is extensive to text progress according to the semanteme of word each in text, To obtain the extensive result of text;Judge whether the extensive result of text hits domain classification template;If the extensive result of text Domain classification template is hit, then the corresponding field of domain classification template hit is labeled as field belonging to text, if literary Field belonging to text is then labeled as default field by this extensive result miss domain classification template.Wherein, field is preset It can be the field that user is separately provided, or any one field in each field.
It is understood that can be used following when whether the extensive result for judging text hits domain classification template Mode: the text similarity between the extensive result of text and domain classification template is calculated;Judgement calculates obtained text phase Whether it is greater than preset threshold like degree, if calculating obtained text similarity greater than preset threshold, it is determined that the extensive knot of text Fruit hits domain classification template, otherwise the extensive result miss domain classification template of text.
In addition, this step can also include in following after using field belonging to domain classification template mark text Hold: using the extensive result of text as the domain classification template of text fields.I.e. after the field belonging to mark text, Using the extensive result of text as the domain classification template in corresponding field, then go to be labeled field belonging to other texts, It is recycled with this.
That is, this step complete to the mark of text fields after, using the extensive result of text be used as this The domain classification template of text fields, so that the domain classification template being constantly increasing in each field, is further promoted each The classification capacity of domain classification template in field, to be labeled to field belonging to text more accurately.
It is understood that this step is other than it can obtain largely having marked the text in field, additionally it is possible to obtain big Amount can identify the domain classification template of text fields.Therefore the present invention can also be merely with obtained whole fields point Class template just can be realized the purpose for the text fields that identification user is inputted, without in the instruction for carrying out disaggregated model After white silk, recycles and obtained text field identification model is trained to obtain field belonging to text.
This step is in such a way that domain classification template is labeled text fields, without by manually carrying out text The mark in field just can obtain largely having marked the texts of its fields, and then utilize the obtained field that marked Text training obtains text field identification model.
In 103, using each text as input, using the field of each text marking as output, train classification models, to obtain To text field identification model.
It in this step, will using field belonging to each text that each text and step 102 are marked as training sample Each text is as input, using the field of each text marking as output, train classification models, to obtain text field identification mould Type.By the obtained text field identification model of training, it just can be realized and text institute obtained according to the text of user's input The purpose in the field of category.
Wherein, disaggregated model can for support vector machines, neural network model, deep learning model etc., the present invention to point The type of class model is without limiting.
Since step 102 can obtain a large amount of text for having marked field, this step is according to enough mark numbers According to the text field identification model that training obtains, field belonging to text can be more accurately identified.
The mode of fusion template classification and category of model provided by through the invention is realized to text fields Identification can obtain a large amount of texts for marking fields according to a small amount of domain classification template at training initial stage, and can be correspondingly A large amount of domain classification template is obtained, is no longer needed to by a large amount of domain classification template of human configuration, to reduce manpower loss;And Phase after training can be trained disaggregated model according to the sufficient text for having marked field, existing so as to alleviate Limitation of the mode classification within the different trained periods is effectively prevented from and classification model or disaggregated model is used alone in progress text This field overfitting problem existing when identifying, to make obtained text field identification model that there is preferably identification effect Fruit.
Fig. 2 is a kind of structure drawing of device for establishing text field identification model that one embodiment of the invention provides, in Fig. 2 Shown, described device includes: acquiring unit 21, mark unit 22 and training unit 23.
Acquiring unit 21, for obtaining the text for not carrying out domain classification.
Acquiring unit 21 obtains the text for not carrying out domain classification, that is, obtains the text for not marking fields.
It is understood that acquiring unit 21 can by way of data mining, obtained from internet it is a large amount of not into The text of row domain classification, such as the search query word that excavation user is inputted from web search log;Acquiring unit 21 The text for not carrying out domain classification largely can be obtained by way of artificially collecting.The present invention does not carry out field point to acquisition The mode of the text of class is without limiting.
Unit 22 is marked, for marking field belonging to the text using domain classification template.
Unit 22 is marked using domain classification template to the text for not carrying out domain classification acquired in acquiring unit 21 Affiliated field is labeled, so that it is determined that field belonging to each text.
Wherein, mark unit 22 can obtain in the following ways domain classification template: the common text in each field is obtained, Obtain text common, representative in each field;Word cutting is carried out to common text, to obtain each word in common text The semanteme of language;According to the semantic extensive to common text progress of word each in common text, the extensive result of common text is made For the domain classification template of the common text fields.
It is understood that the frequency of occurrences in each field can be higher than the text of predeterminated frequency as this by mark unit 22 The common text in field;Mark unit 22 can also select N text as the normal of the field at random from the text in each field With text, wherein N is the positive integer more than or equal to 1;Mark unit 22 can also obtain each field by way of artificially collecting Common text.The present invention is to the mode for the common text for obtaining each field without limiting.
Wherein, for domain classification template for classifying to field belonging to text, a domain classification template is corresponding only A field one by one, but can have multiple and different domain classification templates in a field.
Specifically, mark unit 22 can use following when using field belonging to domain classification template mark text Mode: word cutting is carried out to text, to obtain the semanteme of each word in text;Semantic according to word each in text carries out text It is extensive, to obtain the extensive result of text;Judge whether the extensive result of text hits domain classification template;If text is extensive As a result domain classification template is hit, then the corresponding field of domain classification template hit is labeled as field belonging to text, If the extensive result miss domain classification template of text, field belonging to text is labeled as default field.Wherein, it presets Field can be the field that user is separately provided, or any one field in each field.
It is understood that unit 22 is marked when whether the extensive result for judging text hits domain classification template, it can With in the following ways: calculating the text similarity between the extensive result of text and domain classification template;Judgement calculates gained To text similarity whether be greater than preset threshold, if calculating obtained text similarity greater than preset threshold, it is determined that text This extensive result hits domain classification template, otherwise the extensive result miss domain classification template of text.
In addition, mark unit 22 using after field, can also include belonging to domain classification template mark text with Lower content: using the extensive result of text as the domain classification template of text fields.That is the field belonging to mark text Later, mark unit 22 is using the extensive result of the text as the domain classification template in corresponding field, then goes to other text institutes The field of category is labeled, and is recycled with this, to realize without just can manually obtain a large amount of domain classification template.
That is, after completing to the mark of text fields, the extensive result of text is made for mark unit 22 For the domain classification template of text fields, so that the domain classification template being constantly increasing in each field, further mentions The classification capacity for rising domain classification template in each field, to be labeled to field belonging to text more accurately.
It is understood that mark unit 22 is other than it can obtain largely having marked the text in field, additionally it is possible to To the domain classification template that can largely identify text fields.Therefore the present invention can also all be led merely with obtained Domain classification model just can be realized the purpose for the text fields that identification user is inputted, without carrying out disaggregated model Training after, recycle the obtained text field identification model of training to obtain field belonging to text.
Unit 22 is marked in such a way that domain classification template is labeled text fields, without by manually carrying out The mark of text field just can obtain largely having marked the text of its fields, and then mark neck using obtained The text training in domain obtains text field identification model.
Training unit 23, for training classification mould using the field of each text marking as output using each text as input Type, to obtain text field identification model.
Training unit 23 using field belonging to each text and each text for being marked of mark unit 22 as training sample, Will each text as input, using the field of each text marking as exporting, train classification models are known to obtain text field Other model.By the obtained text field identification model of the training of training unit 23, the text inputted according to user just can be realized Originally the purpose in field belonging to the text is obtained.
Wherein, disaggregated model can for support vector machines, neural network model, deep learning model etc., the present invention to point The type of class model is without limiting.
Since mark unit 22 can obtain a large amount of text for having marked field, training unit 23 is according to enough The text field identification model that labeled data training obtains, can be effectively prevented from overfitting problem, to more accurately know Field belonging to other text.
Fig. 3 shows the frame for being suitable for the exemplary computer system/server 012 for being used to realize embodiment of the present invention Figure.The computer system/server 012 that Fig. 3 is shown is only an example, should not function and use to the embodiment of the present invention Range band carrys out any restrictions.
As shown in figure 3, computer system/server 012 is showed in the form of universal computing device.Computer system/clothes The component of business device 012 can include but is not limited to: one or more processor or processing unit 016, system storage 028, connect the bus 018 of different system components (including system storage 028 and processing unit 016).
Bus 018 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Computer system/server 012 typically comprises a variety of computer system readable media.These media, which can be, appoints The usable medium what can be accessed by computer system/server 012, including volatile and non-volatile media, movably With immovable medium.
System storage 028 may include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 030 and/or cache memory 032.Computer system/server 012 may further include other Removable/nonremovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 034 can For reading and writing immovable, non-volatile magnetic media (Fig. 3 do not show, commonly referred to as " hard disk drive ").Although in Fig. 3 It is not shown, the disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided, and to can The CD drive of mobile anonvolatile optical disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these situations Under, each driver can be connected by one or more data media interfaces with bus 018.Memory 028 may include At least one program product, the program product have one group of (for example, at least one) program module, these program modules are configured To execute the function of various embodiments of the present invention.
Program/utility 040 with one group of (at least one) program module 042, can store in such as memory In 028, such program module 042 includes --- but being not limited to --- operating system, one or more application program, other It may include the realization of network environment in program module and program data, each of these examples or certain combination.Journey Sequence module 042 usually executes function and/or method in embodiment described in the invention.
Computer system/server 012 can also with one or more external equipments 014 (such as keyboard, sensing equipment, Display 024 etc.) communication, in the present invention, computer system/server 012 is communicated with outside radar equipment, can also be with One or more enable a user to the equipment interacted with the computer system/server 012 communication, and/or with make the meter Any equipment (such as network interface card, the modulation that calculation machine systems/servers 012 can be communicated with one or more of the other calculating equipment Demodulator etc.) communication.This communication can be carried out by input/output (I/O) interface 022.Also, computer system/clothes Being engaged in device 012 can also be by network adapter 020 and one or more network (such as local area network (LAN), wide area network (WAN) And/or public network, such as internet) communication.As shown, network adapter 020 by bus 018 and computer system/ Other modules of server 012 communicate.It should be understood that computer system/server 012 can be combined although being not shown in Fig. 3 Using other hardware and/or software module, including but not limited to: microcode, device driver, redundant processing unit, external magnetic Dish driving array, RAID system, tape drive and data backup storage system etc..
Processing unit 016 by the program that is stored in system storage 028 of operation, thereby executing various function application with And data processing, such as realize method flow provided by the embodiment of the present invention.
Above-mentioned computer program can be set in computer storage medium, i.e., the computer storage medium is encoded with Computer program, the program by one or more computers when being executed, so that one or more computers execute in the present invention State method flow shown in embodiment and/or device operation.For example, it is real to execute the present invention by said one or multiple processors Apply method flow provided by example.
With time, the development of technology, medium meaning is more and more extensive, and the route of transmission of computer program is no longer limited by Tangible medium, can also be directly from network downloading etc..It can be using any combination of one or more computer-readable media. Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer-readable storage medium Matter for example may be-but not limited to-system, device or the device of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or Any above combination of person.The more specific example (non exhaustive list) of computer readable storage medium includes: with one Or the electrical connections of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM), Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light Memory device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium can With to be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or Person is in connection.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission is for by the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.? Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or Wide area network (WAN) is connected to subscriber computer, or, it may be connected to outer computer (such as provided using Internet service Quotient is connected by internet).
Using technical solution provided by the present invention, text is obtained by way of fusion template classification and category of model Field identification model can alleviate the limitation of existing text field identification method, be effectively prevented from exclusive use classification mould Plate or disaggregated model carry out overfitting problem existing when text field identification, to promote text field identification model to text The accuracy of this fields identification.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various It can store the medium of program code.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims (12)

1. a kind of method for establishing text field identification model, which is characterized in that the described method includes:
Obtain the text for not carrying out domain classification;
Field belonging to the text is marked using domain classification template;
Using each text as input, using the field of each text marking as output, train classification models obtain text field identification Model;
Wherein, field belonging to text that user is inputted can be identified using the text field identification model.
2. the method according to claim 1, wherein the domain classification template obtains in the following manner:
Obtain the common text in each field;
Word cutting is carried out to the common text, to obtain the semanteme of each word in the common text;
According to the semantic extensive to common text progress of word each in the common text;
Using the extensive result of the common text as the domain classification template of the common text fields.
3. the method according to claim 1, wherein described marked belonging to the text using domain classification template Field include:
Word cutting is carried out to the text, to obtain the semanteme of each word in the text;
According to the semantic extensive to text progress of word each in the text, to obtain the extensive result of the text;
Judge whether the extensive result of the text hits the domain classification template;
If the extensive result of the text hits the domain classification template, the corresponding neck of domain classification template that will be hit Domain is labeled as field belonging to the text;
If domain classification template described in the extensive result miss of the text, field belonging to the text is labeled as pre- If field.
4. according to the method described in claim 3, it is characterized in that, whether the extensive result for judging the text hits institute Stating domain classification template includes:
Calculate the text similarity between the extensive result of the text and the domain classification template;
If calculating obtained text similarity greater than preset threshold, it is determined that the extensive result of the text hits the field Otherwise classification model determines miss.
5. the method according to claim 1, wherein being marked belonging to the text using domain classification template After field, further includes:
Using the extensive result of the text as the domain classification template of the text fields.
6. a kind of device for establishing text field identification model, which is characterized in that described device includes:
Acquiring unit, for obtaining the text for not carrying out domain classification;
Unit is marked, for marking field belonging to the text using domain classification template;
Training unit, for using each text as input, using the field of each text marking as output, train classification models to be obtained To text field identification model.
7. device according to claim 6, which is characterized in that the mark unit obtains the field in the following manner Classification model:
Obtain the common text in each field;
Word cutting is carried out to the common text, to obtain the semanteme of each word in the common text;
According to the semantic extensive to common text progress of word each in the common text;
Using the extensive result of the common text as the domain classification template of the common text fields.
8. device according to claim 6, which is characterized in that the mark unit is marking institute using domain classification template It is specific to execute when stating field belonging to text:
Word cutting is carried out to the text, to obtain the semanteme of each word in the text;
According to the semantic extensive to text progress of word each in the text, to obtain the extensive result of the text;
Judge whether the extensive result of the text hits the domain classification template;
If the extensive result of the text hits the domain classification template, the corresponding neck of domain classification template that will be hit Domain is labeled as field belonging to the text;
If domain classification template described in the extensive result miss of the text, field belonging to the text is labeled as pre- If field.
9. device according to claim 8, which is characterized in that the mark unit is in the extensive result for judging the text It is specific to execute when whether hitting the domain classification template:
Calculate the text similarity between the extensive result of the text and the domain classification template;
If calculating obtained text similarity greater than preset threshold, it is determined that the extensive result of the text hits the field Otherwise classification model determines miss.
10. device according to claim 6, which is characterized in that the mark unit is marked using domain classification template After field belonging to the text, also execute:
Using the extensive result of the text as the domain classification template of the text fields.
11. a kind of equipment, which is characterized in that the equipment includes:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as method as claimed in any one of claims 1 to 5.
12. a kind of storage medium comprising computer executable instructions, the computer executable instructions are by computer disposal For executing such as method as claimed in any one of claims 1 to 5 when device executes.
CN201811376081.XA 2018-11-19 2018-11-19 Establish the method, apparatus of text field identification model Pending CN109597888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811376081.XA CN109597888A (en) 2018-11-19 2018-11-19 Establish the method, apparatus of text field identification model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811376081.XA CN109597888A (en) 2018-11-19 2018-11-19 Establish the method, apparatus of text field identification model

Publications (1)

Publication Number Publication Date
CN109597888A true CN109597888A (en) 2019-04-09

Family

ID=65958765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811376081.XA Pending CN109597888A (en) 2018-11-19 2018-11-19 Establish the method, apparatus of text field identification model

Country Status (1)

Country Link
CN (1) CN109597888A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112445897A (en) * 2021-01-28 2021-03-05 京华信息科技股份有限公司 Method, system, device and storage medium for large-scale classification and labeling of text data

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015194052A1 (en) * 2014-06-20 2015-12-23 Nec Corporation Feature weighting for naive bayes classifiers using a generative model
US20160124933A1 (en) * 2014-10-30 2016-05-05 International Business Machines Corporation Generation apparatus, generation method, and program
CN105677873A (en) * 2016-01-11 2016-06-15 中国电子科技集团公司第十研究所 Text information associating and clustering collecting processing method based on domain knowledge model
CN106156766A (en) * 2015-03-25 2016-11-23 阿里巴巴集团控股有限公司 The generation method and device of line of text grader
CN107506434A (en) * 2017-08-23 2017-12-22 北京百度网讯科技有限公司 Method and apparatus based on artificial intelligence classification phonetic entry text
CN107680579A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Text regularization model training method and device, text regularization method and device
CN107908635A (en) * 2017-09-26 2018-04-13 百度在线网络技术(北京)有限公司 Establish textual classification model and the method, apparatus of text classification
CN108090127A (en) * 2017-11-15 2018-05-29 北京百度网讯科技有限公司 Question and answer text evaluation model is established with evaluating the method, apparatus of question and answer text
CN108228758A (en) * 2017-12-22 2018-06-29 北京奇艺世纪科技有限公司 A kind of file classification method and device
CN108304442A (en) * 2017-11-20 2018-07-20 腾讯科技(深圳)有限公司 A kind of text message processing method, device and storage medium
CN108647325A (en) * 2018-05-11 2018-10-12 吉林大学 A kind of Text Classification System of avoidable over-fitting

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015194052A1 (en) * 2014-06-20 2015-12-23 Nec Corporation Feature weighting for naive bayes classifiers using a generative model
US20160124933A1 (en) * 2014-10-30 2016-05-05 International Business Machines Corporation Generation apparatus, generation method, and program
CN106156766A (en) * 2015-03-25 2016-11-23 阿里巴巴集团控股有限公司 The generation method and device of line of text grader
CN105677873A (en) * 2016-01-11 2016-06-15 中国电子科技集团公司第十研究所 Text information associating and clustering collecting processing method based on domain knowledge model
CN107506434A (en) * 2017-08-23 2017-12-22 北京百度网讯科技有限公司 Method and apparatus based on artificial intelligence classification phonetic entry text
CN107908635A (en) * 2017-09-26 2018-04-13 百度在线网络技术(北京)有限公司 Establish textual classification model and the method, apparatus of text classification
CN107680579A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Text regularization model training method and device, text regularization method and device
CN108090127A (en) * 2017-11-15 2018-05-29 北京百度网讯科技有限公司 Question and answer text evaluation model is established with evaluating the method, apparatus of question and answer text
CN108304442A (en) * 2017-11-20 2018-07-20 腾讯科技(深圳)有限公司 A kind of text message processing method, device and storage medium
CN108228758A (en) * 2017-12-22 2018-06-29 北京奇艺世纪科技有限公司 A kind of file classification method and device
CN108647325A (en) * 2018-05-11 2018-10-12 吉林大学 A kind of Text Classification System of avoidable over-fitting

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112445897A (en) * 2021-01-28 2021-03-05 京华信息科技股份有限公司 Method, system, device and storage medium for large-scale classification and labeling of text data

Similar Documents

Publication Publication Date Title
US11610394B2 (en) Neural network model training method and apparatus, living body detecting method and apparatus, device and storage medium
CN110287479B (en) Named entity recognition method, electronic device and storage medium
CN107492379B (en) Voiceprint creating and registering method and device
US20210073473A1 (en) Vector Representation Based on Context
CN107992596A (en) A kind of Text Clustering Method, device, server and storage medium
CN109599095A (en) A kind of mask method of voice data, device, equipment and computer storage medium
US11640551B2 (en) Method and apparatus for recommending sample data
CN109145680A (en) A kind of method, apparatus, equipment and computer storage medium obtaining obstacle information
CN110163257A (en) Method, apparatus, equipment and the computer storage medium of drawing-out structure information
CN110069608A (en) A kind of method, apparatus of interactive voice, equipment and computer storage medium
CN107193973A (en) The field recognition methods of semanteme parsing information and device, equipment and computer-readable recording medium
JP6756079B2 (en) Artificial intelligence-based ternary check method, equipment and computer program
CN110232340A (en) Establish the method, apparatus of video classification model and visual classification
CN109933269A (en) Method, equipment and the computer storage medium that small routine is recommended
CN110245580A (en) A kind of method, apparatus of detection image, equipment and computer storage medium
CN109543560A (en) Dividing method, device, equipment and the computer storage medium of personage in a kind of video
CN109783631A (en) Method of calibration, device, computer equipment and the storage medium of community's question and answer data
CN108932066A (en) Method, apparatus, equipment and the computer storage medium of input method acquisition expression packet
CN110162786A (en) Construct the method, apparatus of configuration file and drawing-out structure information
CN108154103A (en) Detect method, apparatus, equipment and the computer storage media of promotion message conspicuousness
CN109885597A (en) Tenant group processing method, device and electric terminal based on machine learning
CN109408829A (en) Article readability determines method, apparatus, equipment and medium
CN110377694A (en) Text is marked to the method, apparatus, equipment and computer storage medium of logical relation
CN110110320A (en) Automatic treaty review method, apparatus, medium and electronic equipment
US11238027B2 (en) Dynamic document reliability formulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190409

RJ01 Rejection of invention patent application after publication