CN109597888A - Establish the method, apparatus of text field identification model - Google Patents
Establish the method, apparatus of text field identification model Download PDFInfo
- Publication number
- CN109597888A CN109597888A CN201811376081.XA CN201811376081A CN109597888A CN 109597888 A CN109597888 A CN 109597888A CN 201811376081 A CN201811376081 A CN 201811376081A CN 109597888 A CN109597888 A CN 109597888A
- Authority
- CN
- China
- Prior art keywords
- text
- field
- domain classification
- classification template
- extensive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of method, apparatus for establishing text field identification model, which comprises obtains the text for not carrying out domain classification;Field belonging to the text is marked using domain classification template;Using each text as input, using the field of each text marking as output, train classification models obtain text field identification model;Wherein, field belonging to text that user is inputted can be identified using the text field identification model.The present invention is able to solve the prior art and overfitting problem caused by template classification or category of model is used alone, and then promotes the accuracy that text field identification model identifies text fields.
Description
[technical field]
The present invention relates to natural language processing technique field more particularly to a kind of sides for establishing text field identification model
Method, device, equipment and computer storage medium.
[background technique]
In some Domestic News systems, after obtaining the query text that user is inputted, need from recommendation, question and answer, chat
Correct field module is selected to issue the inquiry request of user in equal fields module, therefore a urgent problem needed to be solved is exactly
How the field of the query text of user input is identified.
The prior art, generally can be using the side that template or learning model is used alone when identifying field belonging to text
Formula, and limitation below can be had by being used alone when two kinds of mode classifications identify field belonging to texts.Wherein, individually make
When identifying text field with learning model, most important disadvantage is: if sufficient labeled data can not be obtained, will lead to training
There are more serious overfitting problems for obtained disaggregated model, so that text fields can not be identified accurately.And independent
When identifying text field using template, most important disadvantage is: if wanting to realize accurately identifying for text fields, needing
By a large amount of classification model of human configuration, thus manpower expend it is huge, if the negligible amounts of classification model, can equally exist compared with
Serious overfitting problem.
[summary of the invention]
In view of this, the present invention provides a kind of method, apparatus for establishing text field identification model, equipment and computers
Storage medium is used alone overfitting problem caused by template classification or category of model for solving the prior art, promotes text
Recognition accuracy of this field identification model to text fields.
The present invention in order to solve the technical problem used by technical solution be to provide a kind of text field identification model established
Method, which comprises obtain the text for not carrying out domain classification;It is marked belonging to the text using domain classification template
Field;Using each text as input, using the field of each text marking as output, train classification models obtain text field knowledge
Other model;Wherein, field belonging to the inputted text of user can be identified using the text field identification model.
According to one preferred embodiment of the present invention, the domain classification template obtains in the following manner: obtaining each field
Common text;Word cutting is carried out to the common text, to obtain the semanteme of each word in the common text;According to described common
Each word is semantic extensive to common text progress in text;Using the extensive result of the common text as described common
The domain classification template of text fields.
According to one preferred embodiment of the present invention, described to mark the packet of field belonging to the text using domain classification template
It includes: word cutting being carried out to the text, to obtain the semanteme of each word in the text;According to the semanteme of word each in the text
It is extensive to text progress, to obtain the extensive result of the text;Judge whether the extensive result of the text hits institute
State domain classification template;If the extensive result of the text hits the domain classification template, the domain classification that will be hit
The corresponding field of template is labeled as field belonging to the text;If domain classification described in the extensive result miss of the text
Field belonging to the text is then labeled as default field by template.
According to one preferred embodiment of the present invention, whether the extensive result for judging the text hits the domain classification
Template includes: the text similarity calculated between the extensive result and the domain classification template of the text;If calculating gained
The text similarity arrived is greater than preset threshold, it is determined that the extensive result of the text hits the domain classification template, otherwise
Determine miss.
According to one preferred embodiment of the present invention, after marking field belonging to the text using domain classification template,
Further include: using the extensive result of the text as the domain classification template of the text fields.
The present invention in order to solve the technical problem used by technical solution be to provide a kind of text field identification model established
Device, described device include: acquiring unit, for obtaining the text for not carrying out domain classification;Unit is marked, for utilizing field
Classification model marks field belonging to the text;Training unit is used for using each text as input, by the neck of each text marking
Domain obtains text field identification model as output, train classification models.
According to one preferred embodiment of the present invention, the mark unit obtains the domain classification template in the following manner:
Obtain the common text in each field;Word cutting is carried out to the common text, to obtain the semanteme of each word in the common text;
According to the semantic extensive to common text progress of word each in the common text;By the extensive result of the common text
Domain classification template as the common text fields.
According to one preferred embodiment of the present invention, the mark unit is marked belonging to the text using domain classification template
Field when, it is specific to execute: word cutting to be carried out to the text, to obtain the semanteme of each word in the text;According to the text
Each word is semantic extensive to text progress in this, to obtain the extensive result of the text;Judge the general of the text
Change whether result hits the domain classification template;It, will if the extensive result of the text hits the domain classification template
The corresponding field of domain classification template hit is labeled as field belonging to the text;If the extensive result of the text is not
The domain classification template is hit, then field belonging to the text is labeled as default field.
According to one preferred embodiment of the present invention, whether the mark unit hits institute in the extensive result for judging the text
It is specific to execute: to calculate the text between the extensive result of the text and the domain classification template when stating domain classification template
Similarity;If calculating obtained text similarity greater than preset threshold, it is determined that described in the extensive result hit of the text
Otherwise domain classification template determines miss.
According to one preferred embodiment of the present invention, the mark unit is marked belonging to the text using domain classification template
Field after, also execute: using the extensive result of the text as the domain classification template of the text fields.
As can be seen from the above technical solutions, the present invention obtains text by way of fusion template classification and category of model
This field identification model can alleviate the limitation of existing text field identification method, be effectively prevented from exclusive use classification
Template or disaggregated model carry out overfitting problem existing when text field identification, to reach better recognition effect.
[Detailed description of the invention]
Fig. 1 is the method flow diagram for establishing text field identification model that one embodiment of the invention provides;
Fig. 2 is the structure drawing of device for establishing text field identification model that one embodiment of the invention provides;
Fig. 3 is the block diagram for the computer system/server that one embodiment of the invention provides.
[specific embodiment]
To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments
The present invention is described in detail.
The term used in embodiments of the present invention is only to be not intended to be limiting merely for for the purpose of describing particular embodiments
The present invention.In the embodiment of the present invention and the "an" of singular used in the attached claims, " described " and "the"
It is also intended to including most forms, unless the context clearly indicates other meaning.
It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, indicate
There may be three kinds of relationships, for example, A and/or B, can indicate: individualism A, exist simultaneously A and B, individualism B these three
Situation.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".
Depending on context, word as used in this " if " can be construed to " ... when " or " when ...
When " or " in response to determination " or " in response to detection ".Similarly, depend on context, phrase " if it is determined that " or " if detection
(condition or event of statement) " can be construed to " when determining " or " in response to determination " or " when the detection (condition of statement
Or event) when " or " in response to detection (condition or event of statement) ".
Fig. 1 is a kind of method flow diagram for establishing text field identification model that one embodiment of the invention provides, as shown in figure 1
It is shown, which comprises
In 101, the text for not carrying out domain classification is obtained.
In this step, the text for not carrying out domain classification is obtained, that is, obtains the text for not marking fields.
It is understood that can be obtained from internet by way of data mining and not carry out domain classification largely
Text, such as the search query word that is inputted of user is excavated from web search log;It can also be by the side that artificially collects
Formula obtains the text for not carrying out domain classification largely.The present invention to acquisition do not carry out domain classification text mode without
It limits.
In 102, field belonging to the text is marked using domain classification template.
In this step, using domain classification template to the text institute for not carrying out domain classification acquired in step 101
The field of category is labeled, so that it is determined that field belonging to each text.
Wherein, this step can obtain in the following ways domain classification template: obtaining the common text in each field, that is, obtain
Take text common, representative in each field;Word cutting is carried out to common text, to obtain each word in common text
It is semantic;According to the semantic extensive to common text progress of word each in common text, using the extensive result of common text as this
The domain classification template of common text fields.In addition, domain classification template is also possible to pre-existing, directly acquire pre-
The domain classification template pre-existed carries out the mark to text fields.
It is understood that the frequency of occurrences in each field can be higher than to the text of predeterminated frequency as the common of the field
Text;Common text of the N text as the field can also be selected at random from the text in each field, wherein N be greater than etc.
In 1 positive integer;The common text in each field can also be obtained by way of artificially collecting.The present invention is to each field of acquisition
The mode of common text is without limiting.
Wherein, for domain classification template for classifying to field belonging to text, a domain classification template is corresponding only
A field one by one, but can have multiple and different domain classification templates in a field.
Specifically, this step, can be in the following ways when using field belonging to domain classification template mark text:
Word cutting is carried out to text, to obtain the semanteme of each word in text;It is extensive to text progress according to the semanteme of word each in text,
To obtain the extensive result of text;Judge whether the extensive result of text hits domain classification template;If the extensive result of text
Domain classification template is hit, then the corresponding field of domain classification template hit is labeled as field belonging to text, if literary
Field belonging to text is then labeled as default field by this extensive result miss domain classification template.Wherein, field is preset
It can be the field that user is separately provided, or any one field in each field.
It is understood that can be used following when whether the extensive result for judging text hits domain classification template
Mode: the text similarity between the extensive result of text and domain classification template is calculated;Judgement calculates obtained text phase
Whether it is greater than preset threshold like degree, if calculating obtained text similarity greater than preset threshold, it is determined that the extensive knot of text
Fruit hits domain classification template, otherwise the extensive result miss domain classification template of text.
In addition, this step can also include in following after using field belonging to domain classification template mark text
Hold: using the extensive result of text as the domain classification template of text fields.I.e. after the field belonging to mark text,
Using the extensive result of text as the domain classification template in corresponding field, then go to be labeled field belonging to other texts,
It is recycled with this.
That is, this step complete to the mark of text fields after, using the extensive result of text be used as this
The domain classification template of text fields, so that the domain classification template being constantly increasing in each field, is further promoted each
The classification capacity of domain classification template in field, to be labeled to field belonging to text more accurately.
It is understood that this step is other than it can obtain largely having marked the text in field, additionally it is possible to obtain big
Amount can identify the domain classification template of text fields.Therefore the present invention can also be merely with obtained whole fields point
Class template just can be realized the purpose for the text fields that identification user is inputted, without in the instruction for carrying out disaggregated model
After white silk, recycles and obtained text field identification model is trained to obtain field belonging to text.
This step is in such a way that domain classification template is labeled text fields, without by manually carrying out text
The mark in field just can obtain largely having marked the texts of its fields, and then utilize the obtained field that marked
Text training obtains text field identification model.
In 103, using each text as input, using the field of each text marking as output, train classification models, to obtain
To text field identification model.
It in this step, will using field belonging to each text that each text and step 102 are marked as training sample
Each text is as input, using the field of each text marking as output, train classification models, to obtain text field identification mould
Type.By the obtained text field identification model of training, it just can be realized and text institute obtained according to the text of user's input
The purpose in the field of category.
Wherein, disaggregated model can for support vector machines, neural network model, deep learning model etc., the present invention to point
The type of class model is without limiting.
Since step 102 can obtain a large amount of text for having marked field, this step is according to enough mark numbers
According to the text field identification model that training obtains, field belonging to text can be more accurately identified.
The mode of fusion template classification and category of model provided by through the invention is realized to text fields
Identification can obtain a large amount of texts for marking fields according to a small amount of domain classification template at training initial stage, and can be correspondingly
A large amount of domain classification template is obtained, is no longer needed to by a large amount of domain classification template of human configuration, to reduce manpower loss;And
Phase after training can be trained disaggregated model according to the sufficient text for having marked field, existing so as to alleviate
Limitation of the mode classification within the different trained periods is effectively prevented from and classification model or disaggregated model is used alone in progress text
This field overfitting problem existing when identifying, to make obtained text field identification model that there is preferably identification effect
Fruit.
Fig. 2 is a kind of structure drawing of device for establishing text field identification model that one embodiment of the invention provides, in Fig. 2
Shown, described device includes: acquiring unit 21, mark unit 22 and training unit 23.
Acquiring unit 21, for obtaining the text for not carrying out domain classification.
Acquiring unit 21 obtains the text for not carrying out domain classification, that is, obtains the text for not marking fields.
It is understood that acquiring unit 21 can by way of data mining, obtained from internet it is a large amount of not into
The text of row domain classification, such as the search query word that excavation user is inputted from web search log;Acquiring unit 21
The text for not carrying out domain classification largely can be obtained by way of artificially collecting.The present invention does not carry out field point to acquisition
The mode of the text of class is without limiting.
Unit 22 is marked, for marking field belonging to the text using domain classification template.
Unit 22 is marked using domain classification template to the text for not carrying out domain classification acquired in acquiring unit 21
Affiliated field is labeled, so that it is determined that field belonging to each text.
Wherein, mark unit 22 can obtain in the following ways domain classification template: the common text in each field is obtained,
Obtain text common, representative in each field;Word cutting is carried out to common text, to obtain each word in common text
The semanteme of language;According to the semantic extensive to common text progress of word each in common text, the extensive result of common text is made
For the domain classification template of the common text fields.
It is understood that the frequency of occurrences in each field can be higher than the text of predeterminated frequency as this by mark unit 22
The common text in field;Mark unit 22 can also select N text as the normal of the field at random from the text in each field
With text, wherein N is the positive integer more than or equal to 1;Mark unit 22 can also obtain each field by way of artificially collecting
Common text.The present invention is to the mode for the common text for obtaining each field without limiting.
Wherein, for domain classification template for classifying to field belonging to text, a domain classification template is corresponding only
A field one by one, but can have multiple and different domain classification templates in a field.
Specifically, mark unit 22 can use following when using field belonging to domain classification template mark text
Mode: word cutting is carried out to text, to obtain the semanteme of each word in text;Semantic according to word each in text carries out text
It is extensive, to obtain the extensive result of text;Judge whether the extensive result of text hits domain classification template;If text is extensive
As a result domain classification template is hit, then the corresponding field of domain classification template hit is labeled as field belonging to text,
If the extensive result miss domain classification template of text, field belonging to text is labeled as default field.Wherein, it presets
Field can be the field that user is separately provided, or any one field in each field.
It is understood that unit 22 is marked when whether the extensive result for judging text hits domain classification template, it can
With in the following ways: calculating the text similarity between the extensive result of text and domain classification template;Judgement calculates gained
To text similarity whether be greater than preset threshold, if calculating obtained text similarity greater than preset threshold, it is determined that text
This extensive result hits domain classification template, otherwise the extensive result miss domain classification template of text.
In addition, mark unit 22 using after field, can also include belonging to domain classification template mark text with
Lower content: using the extensive result of text as the domain classification template of text fields.That is the field belonging to mark text
Later, mark unit 22 is using the extensive result of the text as the domain classification template in corresponding field, then goes to other text institutes
The field of category is labeled, and is recycled with this, to realize without just can manually obtain a large amount of domain classification template.
That is, after completing to the mark of text fields, the extensive result of text is made for mark unit 22
For the domain classification template of text fields, so that the domain classification template being constantly increasing in each field, further mentions
The classification capacity for rising domain classification template in each field, to be labeled to field belonging to text more accurately.
It is understood that mark unit 22 is other than it can obtain largely having marked the text in field, additionally it is possible to
To the domain classification template that can largely identify text fields.Therefore the present invention can also all be led merely with obtained
Domain classification model just can be realized the purpose for the text fields that identification user is inputted, without carrying out disaggregated model
Training after, recycle the obtained text field identification model of training to obtain field belonging to text.
Unit 22 is marked in such a way that domain classification template is labeled text fields, without by manually carrying out
The mark of text field just can obtain largely having marked the text of its fields, and then mark neck using obtained
The text training in domain obtains text field identification model.
Training unit 23, for training classification mould using the field of each text marking as output using each text as input
Type, to obtain text field identification model.
Training unit 23 using field belonging to each text and each text for being marked of mark unit 22 as training sample,
Will each text as input, using the field of each text marking as exporting, train classification models are known to obtain text field
Other model.By the obtained text field identification model of the training of training unit 23, the text inputted according to user just can be realized
Originally the purpose in field belonging to the text is obtained.
Wherein, disaggregated model can for support vector machines, neural network model, deep learning model etc., the present invention to point
The type of class model is without limiting.
Since mark unit 22 can obtain a large amount of text for having marked field, training unit 23 is according to enough
The text field identification model that labeled data training obtains, can be effectively prevented from overfitting problem, to more accurately know
Field belonging to other text.
Fig. 3 shows the frame for being suitable for the exemplary computer system/server 012 for being used to realize embodiment of the present invention
Figure.The computer system/server 012 that Fig. 3 is shown is only an example, should not function and use to the embodiment of the present invention
Range band carrys out any restrictions.
As shown in figure 3, computer system/server 012 is showed in the form of universal computing device.Computer system/clothes
The component of business device 012 can include but is not limited to: one or more processor or processing unit 016, system storage
028, connect the bus 018 of different system components (including system storage 028 and processing unit 016).
Bus 018 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts
For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC)
Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Computer system/server 012 typically comprises a variety of computer system readable media.These media, which can be, appoints
The usable medium what can be accessed by computer system/server 012, including volatile and non-volatile media, movably
With immovable medium.
System storage 028 may include the computer system readable media of form of volatile memory, such as deposit at random
Access to memory (RAM) 030 and/or cache memory 032.Computer system/server 012 may further include other
Removable/nonremovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 034 can
For reading and writing immovable, non-volatile magnetic media (Fig. 3 do not show, commonly referred to as " hard disk drive ").Although in Fig. 3
It is not shown, the disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided, and to can
The CD drive of mobile anonvolatile optical disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these situations
Under, each driver can be connected by one or more data media interfaces with bus 018.Memory 028 may include
At least one program product, the program product have one group of (for example, at least one) program module, these program modules are configured
To execute the function of various embodiments of the present invention.
Program/utility 040 with one group of (at least one) program module 042, can store in such as memory
In 028, such program module 042 includes --- but being not limited to --- operating system, one or more application program, other
It may include the realization of network environment in program module and program data, each of these examples or certain combination.Journey
Sequence module 042 usually executes function and/or method in embodiment described in the invention.
Computer system/server 012 can also with one or more external equipments 014 (such as keyboard, sensing equipment,
Display 024 etc.) communication, in the present invention, computer system/server 012 is communicated with outside radar equipment, can also be with
One or more enable a user to the equipment interacted with the computer system/server 012 communication, and/or with make the meter
Any equipment (such as network interface card, the modulation that calculation machine systems/servers 012 can be communicated with one or more of the other calculating equipment
Demodulator etc.) communication.This communication can be carried out by input/output (I/O) interface 022.Also, computer system/clothes
Being engaged in device 012 can also be by network adapter 020 and one or more network (such as local area network (LAN), wide area network (WAN)
And/or public network, such as internet) communication.As shown, network adapter 020 by bus 018 and computer system/
Other modules of server 012 communicate.It should be understood that computer system/server 012 can be combined although being not shown in Fig. 3
Using other hardware and/or software module, including but not limited to: microcode, device driver, redundant processing unit, external magnetic
Dish driving array, RAID system, tape drive and data backup storage system etc..
Processing unit 016 by the program that is stored in system storage 028 of operation, thereby executing various function application with
And data processing, such as realize method flow provided by the embodiment of the present invention.
Above-mentioned computer program can be set in computer storage medium, i.e., the computer storage medium is encoded with
Computer program, the program by one or more computers when being executed, so that one or more computers execute in the present invention
State method flow shown in embodiment and/or device operation.For example, it is real to execute the present invention by said one or multiple processors
Apply method flow provided by example.
With time, the development of technology, medium meaning is more and more extensive, and the route of transmission of computer program is no longer limited by
Tangible medium, can also be directly from network downloading etc..It can be using any combination of one or more computer-readable media.
Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer-readable storage medium
Matter for example may be-but not limited to-system, device or the device of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or
Any above combination of person.The more specific example (non exhaustive list) of computer readable storage medium includes: with one
Or the electrical connections of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM),
Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light
Memory device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium can
With to be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or
Person is in connection.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but
It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be
Any computer-readable medium other than computer readable storage medium, which can send, propagate or
Transmission is for by the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited
In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion
Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.?
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or
Wide area network (WAN) is connected to subscriber computer, or, it may be connected to outer computer (such as provided using Internet service
Quotient is connected by internet).
Using technical solution provided by the present invention, text is obtained by way of fusion template classification and category of model
Field identification model can alleviate the limitation of existing text field identification method, be effectively prevented from exclusive use classification mould
Plate or disaggregated model carry out overfitting problem existing when text field identification, to promote text field identification model to text
The accuracy of this fields identification.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer
It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the present invention
The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-
Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various
It can store the medium of program code.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.
Claims (12)
1. a kind of method for establishing text field identification model, which is characterized in that the described method includes:
Obtain the text for not carrying out domain classification;
Field belonging to the text is marked using domain classification template;
Using each text as input, using the field of each text marking as output, train classification models obtain text field identification
Model;
Wherein, field belonging to text that user is inputted can be identified using the text field identification model.
2. the method according to claim 1, wherein the domain classification template obtains in the following manner:
Obtain the common text in each field;
Word cutting is carried out to the common text, to obtain the semanteme of each word in the common text;
According to the semantic extensive to common text progress of word each in the common text;
Using the extensive result of the common text as the domain classification template of the common text fields.
3. the method according to claim 1, wherein described marked belonging to the text using domain classification template
Field include:
Word cutting is carried out to the text, to obtain the semanteme of each word in the text;
According to the semantic extensive to text progress of word each in the text, to obtain the extensive result of the text;
Judge whether the extensive result of the text hits the domain classification template;
If the extensive result of the text hits the domain classification template, the corresponding neck of domain classification template that will be hit
Domain is labeled as field belonging to the text;
If domain classification template described in the extensive result miss of the text, field belonging to the text is labeled as pre-
If field.
4. according to the method described in claim 3, it is characterized in that, whether the extensive result for judging the text hits institute
Stating domain classification template includes:
Calculate the text similarity between the extensive result of the text and the domain classification template;
If calculating obtained text similarity greater than preset threshold, it is determined that the extensive result of the text hits the field
Otherwise classification model determines miss.
5. the method according to claim 1, wherein being marked belonging to the text using domain classification template
After field, further includes:
Using the extensive result of the text as the domain classification template of the text fields.
6. a kind of device for establishing text field identification model, which is characterized in that described device includes:
Acquiring unit, for obtaining the text for not carrying out domain classification;
Unit is marked, for marking field belonging to the text using domain classification template;
Training unit, for using each text as input, using the field of each text marking as output, train classification models to be obtained
To text field identification model.
7. device according to claim 6, which is characterized in that the mark unit obtains the field in the following manner
Classification model:
Obtain the common text in each field;
Word cutting is carried out to the common text, to obtain the semanteme of each word in the common text;
According to the semantic extensive to common text progress of word each in the common text;
Using the extensive result of the common text as the domain classification template of the common text fields.
8. device according to claim 6, which is characterized in that the mark unit is marking institute using domain classification template
It is specific to execute when stating field belonging to text:
Word cutting is carried out to the text, to obtain the semanteme of each word in the text;
According to the semantic extensive to text progress of word each in the text, to obtain the extensive result of the text;
Judge whether the extensive result of the text hits the domain classification template;
If the extensive result of the text hits the domain classification template, the corresponding neck of domain classification template that will be hit
Domain is labeled as field belonging to the text;
If domain classification template described in the extensive result miss of the text, field belonging to the text is labeled as pre-
If field.
9. device according to claim 8, which is characterized in that the mark unit is in the extensive result for judging the text
It is specific to execute when whether hitting the domain classification template:
Calculate the text similarity between the extensive result of the text and the domain classification template;
If calculating obtained text similarity greater than preset threshold, it is determined that the extensive result of the text hits the field
Otherwise classification model determines miss.
10. device according to claim 6, which is characterized in that the mark unit is marked using domain classification template
After field belonging to the text, also execute:
Using the extensive result of the text as the domain classification template of the text fields.
11. a kind of equipment, which is characterized in that the equipment includes:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
Now such as method as claimed in any one of claims 1 to 5.
12. a kind of storage medium comprising computer executable instructions, the computer executable instructions are by computer disposal
For executing such as method as claimed in any one of claims 1 to 5 when device executes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811376081.XA CN109597888A (en) | 2018-11-19 | 2018-11-19 | Establish the method, apparatus of text field identification model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811376081.XA CN109597888A (en) | 2018-11-19 | 2018-11-19 | Establish the method, apparatus of text field identification model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109597888A true CN109597888A (en) | 2019-04-09 |
Family
ID=65958765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811376081.XA Pending CN109597888A (en) | 2018-11-19 | 2018-11-19 | Establish the method, apparatus of text field identification model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109597888A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112445897A (en) * | 2021-01-28 | 2021-03-05 | 京华信息科技股份有限公司 | Method, system, device and storage medium for large-scale classification and labeling of text data |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015194052A1 (en) * | 2014-06-20 | 2015-12-23 | Nec Corporation | Feature weighting for naive bayes classifiers using a generative model |
US20160124933A1 (en) * | 2014-10-30 | 2016-05-05 | International Business Machines Corporation | Generation apparatus, generation method, and program |
CN105677873A (en) * | 2016-01-11 | 2016-06-15 | 中国电子科技集团公司第十研究所 | Text information associating and clustering collecting processing method based on domain knowledge model |
CN106156766A (en) * | 2015-03-25 | 2016-11-23 | 阿里巴巴集团控股有限公司 | The generation method and device of line of text grader |
CN107506434A (en) * | 2017-08-23 | 2017-12-22 | 北京百度网讯科技有限公司 | Method and apparatus based on artificial intelligence classification phonetic entry text |
CN107680579A (en) * | 2017-09-29 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Text regularization model training method and device, text regularization method and device |
CN107908635A (en) * | 2017-09-26 | 2018-04-13 | 百度在线网络技术(北京)有限公司 | Establish textual classification model and the method, apparatus of text classification |
CN108090127A (en) * | 2017-11-15 | 2018-05-29 | 北京百度网讯科技有限公司 | Question and answer text evaluation model is established with evaluating the method, apparatus of question and answer text |
CN108228758A (en) * | 2017-12-22 | 2018-06-29 | 北京奇艺世纪科技有限公司 | A kind of file classification method and device |
CN108304442A (en) * | 2017-11-20 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of text message processing method, device and storage medium |
CN108647325A (en) * | 2018-05-11 | 2018-10-12 | 吉林大学 | A kind of Text Classification System of avoidable over-fitting |
-
2018
- 2018-11-19 CN CN201811376081.XA patent/CN109597888A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015194052A1 (en) * | 2014-06-20 | 2015-12-23 | Nec Corporation | Feature weighting for naive bayes classifiers using a generative model |
US20160124933A1 (en) * | 2014-10-30 | 2016-05-05 | International Business Machines Corporation | Generation apparatus, generation method, and program |
CN106156766A (en) * | 2015-03-25 | 2016-11-23 | 阿里巴巴集团控股有限公司 | The generation method and device of line of text grader |
CN105677873A (en) * | 2016-01-11 | 2016-06-15 | 中国电子科技集团公司第十研究所 | Text information associating and clustering collecting processing method based on domain knowledge model |
CN107506434A (en) * | 2017-08-23 | 2017-12-22 | 北京百度网讯科技有限公司 | Method and apparatus based on artificial intelligence classification phonetic entry text |
CN107908635A (en) * | 2017-09-26 | 2018-04-13 | 百度在线网络技术(北京)有限公司 | Establish textual classification model and the method, apparatus of text classification |
CN107680579A (en) * | 2017-09-29 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Text regularization model training method and device, text regularization method and device |
CN108090127A (en) * | 2017-11-15 | 2018-05-29 | 北京百度网讯科技有限公司 | Question and answer text evaluation model is established with evaluating the method, apparatus of question and answer text |
CN108304442A (en) * | 2017-11-20 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of text message processing method, device and storage medium |
CN108228758A (en) * | 2017-12-22 | 2018-06-29 | 北京奇艺世纪科技有限公司 | A kind of file classification method and device |
CN108647325A (en) * | 2018-05-11 | 2018-10-12 | 吉林大学 | A kind of Text Classification System of avoidable over-fitting |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112445897A (en) * | 2021-01-28 | 2021-03-05 | 京华信息科技股份有限公司 | Method, system, device and storage medium for large-scale classification and labeling of text data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11610394B2 (en) | Neural network model training method and apparatus, living body detecting method and apparatus, device and storage medium | |
US20210073473A1 (en) | Vector Representation Based on Context | |
CN107492379B (en) | Voiceprint creating and registering method and device | |
US11640551B2 (en) | Method and apparatus for recommending sample data | |
CN109599095A (en) | A kind of mask method of voice data, device, equipment and computer storage medium | |
CN109145680A (en) | A kind of method, apparatus, equipment and computer storage medium obtaining obstacle information | |
CN107193973A (en) | The field recognition methods of semanteme parsing information and device, equipment and computer-readable recording medium | |
CN110163257A (en) | Method, apparatus, equipment and the computer storage medium of drawing-out structure information | |
JP6756079B2 (en) | Artificial intelligence-based ternary check method, equipment and computer program | |
CN109543560A (en) | Dividing method, device, equipment and the computer storage medium of personage in a kind of video | |
CN112990294B (en) | Training method and device of behavior discrimination model, electronic equipment and storage medium | |
CN110232340A (en) | Establish the method, apparatus of video classification model and visual classification | |
CN108563655A (en) | Text based event recognition method and device | |
CN109783631A (en) | Method of calibration, device, computer equipment and the storage medium of community's question and answer data | |
CN109933269A (en) | Method, equipment and the computer storage medium that small routine is recommended | |
CN110245580A (en) | A kind of method, apparatus of detection image, equipment and computer storage medium | |
CN108932066A (en) | Method, apparatus, equipment and the computer storage medium of input method acquisition expression packet | |
CN110162786A (en) | Construct the method, apparatus of configuration file and drawing-out structure information | |
CN109885597A (en) | Tenant group processing method, device and electric terminal based on machine learning | |
CN113158656B (en) | Ironic content recognition method, ironic content recognition device, electronic device, and storage medium | |
CN109408829A (en) | Article readability determines method, apparatus, equipment and medium | |
CN110377694A (en) | Text is marked to the method, apparatus, equipment and computer storage medium of logical relation | |
US20200301908A1 (en) | Dynamic Document Reliability Formulation | |
CN110457683A (en) | Model optimization method, apparatus, computer equipment and storage medium | |
CN111598122B (en) | Data verification method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190409 |
|
RJ01 | Rejection of invention patent application after publication |