CN110046254A - Method and apparatus for generating model - Google Patents
Method and apparatus for generating model Download PDFInfo
- Publication number
- CN110046254A CN110046254A CN201910312916.3A CN201910312916A CN110046254A CN 110046254 A CN110046254 A CN 110046254A CN 201910312916 A CN201910312916 A CN 201910312916A CN 110046254 A CN110046254 A CN 110046254A
- Authority
- CN
- China
- Prior art keywords
- training sample
- sample
- text
- training
- sample set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000012549 training Methods 0.000 claims abstract description 297
- 238000012545 processing Methods 0.000 claims abstract description 28
- 238000010801 machine learning Methods 0.000 claims abstract description 20
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 17
- 239000000284 extract Substances 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 230000006854 communication Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000005291 magnetic effect Effects 0.000 description 3
- 230000036651 mood Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 210000003127 knee Anatomy 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the present application discloses the method and apparatus for generating model.One specific embodiment of this method includes: to obtain the first training sample set, and the training sample in the first training sample set includes sample text;Count the sample text of different length in the first training sample set ratio shared in the first training sample set;According to the ratio counted, training sample is extracted from the first training sample set and obtains the second training sample set;Utilize machine learning algorithm, the sample text for including using the second training sample set is as input, using markup information corresponding with the sample text of input as desired output, training obtains the text-processing model for target text, and target text is identical as the sample text source in the first training sample set.This embodiment offers a kind of model training mechanism that training sample is extracted based on concrete scene, improve the accuracy of model output result.
Description
Technical field
The invention relates to field of computer technology, more particularly, to generate the method and apparatus of model.
Background technique
With the development of AI (Artificial Intelligence, artificial intelligence) technology, various tasks
To be realized by machine learning model, such as Classification of Speech task, text categorization task.Before executing actual task, need
Model training is first carried out, the machine learning model for having respective capabilities is obtained.
Summary of the invention
The embodiment of the present application proposes the method and apparatus for generating model.
In a first aspect, the embodiment of the present application provides a kind of method for generating model, this method comprises: obtaining first
Training sample set, the training sample in the first training sample set includes sample text;It counts in the first training sample set
The sample text of different length ratio shared in the first training sample set;According to the ratio counted, from the first training
Training sample is extracted in sample set obtains the second training sample set;Using machine learning algorithm, by the second training sample set
Sample text that conjunction includes is as input, using markup information corresponding with the sample text of input as desired output, trained
To the text-processing model for being directed to target text, target text is identical as the sample text source in the first training sample set.
In some embodiments, according to the ratio counted, training sample is extracted from the first training sample set and is obtained
Second training sample set, comprising: according to the ratio counted, extract training sample from the first training sample set;Removal
Include the training sample of predetermined keyword in the training sample extracted, obtains the second training sample set.
In some embodiments, text-processing model includes two disaggregated models, the training sample in the first training sample set
This includes positive sample and negative sample.
In some embodiments, the sample text of different length in the first training sample set is counted in the first training sample
Shared ratio in set, comprising: the sample text of different length is shared in positive sample in the first training sample set of statistics
Ratio, and the ratio that the sample text that counts different length in the first training sample set is shared in negative sample;And root
The ratio gone out according to statistics extracts training sample from the first training sample set and obtains the second training sample set, comprising: according to
The sample text of the different length counted ratio shared in positive sample, the positive sample for including from the first training sample set
Middle extraction positive sample, and the ratio shared in negative sample according to the sample text of the different length counted, from the first training
Negative sample is extracted in the negative sample that sample set includes, obtains the second training sample set.
In some embodiments, remove in extracted training sample include predetermined keyword training sample, obtain the
Two training sample set, comprising: include the training sample of the keyword in positive sample in the extracted negative sample of removal, obtain the
Two training sample set.
Second aspect, the embodiment of the present application provide it is a kind of for generating the device of model, the device include: obtain it is single
Member is configured to obtain the first training sample set, and the training sample in the first training sample set includes sample text;Statistics
Unit, the sample text for being configured to count different length in the first training sample set are shared in the first training sample set
Ratio;Extracting unit, is configured to according to the ratio counted, and training sample is extracted from the first training sample set and is obtained
Second training sample set;Training unit is configured to using machine learning algorithm, the sample for including by the second training sample set
This text is as input, and using markup information corresponding with the sample text of input as desired output, training is obtained for target
The text-processing model of text, target text are identical as the sample text source in the first training sample set.
In some embodiments, extracting unit, comprising: extract subelement, be configured to according to the ratio that counts, from the
Training sample is extracted in one training sample set;Removal unit, it includes default for being configured to remove in extracted training sample
The training sample of keyword obtains the second training sample set.
In some embodiments, text-processing model includes two disaggregated models, the training sample in the first training sample set
This includes positive sample and negative sample.
In some embodiments, statistic unit is further configured to: different length in the first training sample set of statistics
Sample text ratio shared in positive sample, and count the sample text of different length in the first training sample set negative
Shared ratio in sample;And extracting unit, it is further configured to: being existed according to the sample text of the different length counted
Shared ratio in positive sample, extract positive sample from the positive sample that the first training sample set includes, and according to counting
The sample text of different length ratio shared in negative sample extracts negative from the negative sample that the first training sample set includes
Sample obtains the second training sample set.
In some embodiments, removal unit is further configured to: including positive sample in the extracted negative sample of removal
In keyword training sample, obtain the second training sample set.
The third aspect, the embodiment of the present application provide a kind of equipment, comprising: one or more processors;Storage device,
On be stored with one or more programs, when said one or multiple programs are executed by said one or multiple processors so that on
It states one or more processors and realizes such as the above-mentioned method of first aspect.
Fourth aspect, the embodiment of the present application provide a kind of computer-readable medium, are stored thereon with computer program, should
Such as first aspect above-mentioned method is realized when program is executed by processor.
Method and apparatus provided by the embodiments of the present application for generating model, by obtaining the first training sample set,
Training sample in first training sample set includes sample text;Count the sample of different length in the first training sample set
Text ratio shared in the first training sample set;According to the ratio counted, extracted from the first training sample set
Training sample obtains the second training sample set;Using machine learning algorithm, the sample for including by the second training sample set is literary
This conduct input, using markup information corresponding with the sample text of input as desired output, training is obtained for target text
Text-processing model, target text is identical as the sample text source in the first training sample set, provides one kind and is based on
Concrete scene extracts the model training mechanism of training sample, improves the accuracy of model output result.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is that this application can be applied to exemplary system architecture figures therein;
Fig. 2 is the flow chart according to one embodiment of the method for generating model of the application;
Fig. 3 is a schematic diagram according to the application scenarios of the method for generating model of the application;
Fig. 4 is the flow chart according to another embodiment of the method for generating model of the application;
Fig. 5 is the structural schematic diagram according to one embodiment of the device for generating model of the application;
Fig. 6 is adapted for the structural schematic diagram of the computer system for the server or terminal of realizing the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to
Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can be using the method for generating model of the application or the implementation of the device for generating model
The exemplary system architecture 100 of example.
As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105.
Network 104 between terminal device 101,102,103 and server 105 to provide the medium of communication link.Network 104 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 101,102,103 and be interacted by network 104 with server 105, to receive or send out
Send message etc..Various telecommunication customer end applications can be installed, such as text-processing class is answered on terminal device 101,102,103
With the application of, speech processes class, map class application, searching class application etc..
Terminal device 101,102,103 can be hardware, be also possible to software.When terminal device 101,102,103 is hard
When part, it can be various electronic equipments, including but not limited to smart phone, tablet computer, intelligent sound box, portable calculating on knee
Machine and desktop computer etc..When terminal device 101,102,103 is software, it may be mounted at above-mentioned cited electronics and set
In standby.Multiple softwares or software module may be implemented into it, and single software or software module also may be implemented into.It does not do and has herein
Body limits.
Server 105 can be to provide the server of various services, such as to installing on terminal device 101,102,103
Using providing the background server supported, the available first training sample set of server 105, in the first training sample set
Training sample include sample text;The sample text of different length in the first training sample set is counted in the first training sample
Shared ratio in set;According to the ratio counted, training sample is extracted from the first training sample set and obtains the second instruction
Practice sample set;Using machine learning algorithm, the sample text for including using the second training sample set, will be with input as input
The corresponding markup information of sample text as desired output, training obtains the text-processing model for target text, target
Text is identical as the sample text source in the first training sample set.
It should be noted that the method provided by the embodiment of the present application for generating model can be held by server 105
Row, can also be executed, correspondingly, the device for generating model can be set in server by terminal device 101,102,103
In 105, also it can be set in terminal device 101,102,103.
It should be noted that server can be hardware, it is also possible to software.When server is hardware, may be implemented
At the distributed server cluster that multiple servers form, individual server also may be implemented into.It, can when server is software
To be implemented as multiple softwares or software module (such as providing Distributed Services), single software or software also may be implemented into
Module.It is not specifically limited herein.
It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need
It wants, can have any number of terminal device, network and server.
With continued reference to Fig. 2, the process of one embodiment of the method for generating model according to the application is shown
200.The method for being used to generate model, comprising the following steps:
Step 201, the first training sample set is obtained.
It in the present embodiment, can for generating the method executing subject (such as server shown in FIG. 1 or terminal) of model
To obtain the first training sample set first, the training sample in the first training sample set includes sample text.Sample text
Specific scene can be derived from, for example, the query statement that search engine receives, the phonetic order that intelligent sound box receives turns
The text that the phonetic order that the text or vehicle intelligent equipment got in return receive is converted to can collect one section from line
The text of time, as sample text.
Step 202, the sample text of different length in the first training sample set is counted in the first training sample set
Shared ratio.
In the present embodiment, above-mentioned executing subject can in the first training sample set for being obtained in statistic procedure 201 not
With the sample text of length ratio shared in the first training sample set.For example, the sample text of length 3 has 300000
Item, accounting 15%, 4 sample text of length have 60000, and accounting 30%, 5 sample text of length has 60000, and accounting is
30%, the sample text of length 6 has 300000, and accounting 15%, the sample text of other length has 200000, and accounting is
10%.
Step 203, according to the ratio counted, training sample is extracted from the first training sample set and obtains the second training
Sample set.
In the present embodiment, above-mentioned executing subject can be according to the ratio counted in step 202, from the first training sample
Training sample is extracted in set obtains the second training sample set.As an example, counting the sample text of length 3 in step 201
Originally there are 300000, accounting 15%, 4 sample text of length there are 600000, and accounting 30%, 5 sample text of length has
600000, accounting 30%, the sample text of length 6 has 300000, and the sample text of accounting 15%, other length has
200000, accounting 10%.Above-mentioned executing subject can extract the sample text of length 3 from the first training sample set
3000, accounting 15% extracts 4 sample text of length 6000, and accounting 30% extracts 5 sample text of length 6000,
Accounting is 30%, sample text 3000 for extracting length 6, accounting 15%, sample text 2000 for extracting other length,
Accounting is 10%.The deviation of specific sampling proportion and the ratio counted is all that can receive in the range of presetting
's.
In addition, can also be carried out to the sample text extracted after sample drawn text in the first training sample set
Mark, is labeled compared to all sample texts in the first training sample set, saves system resource.With sample text
Corresponding markup information can be marked to obtain by artificial or machine, markup information different, example according to the difference of concrete model
It such as, is the model of mood for identification to training pattern, markup information may include the emotional informations such as glad, unhappy.
Step 204, using machine learning algorithm, the sample text for including using the second training sample set, will as input
Markup information corresponding with the sample text of input obtains the text-processing mould for target text as desired output, training
Type.
In the present embodiment, above-mentioned executing subject can use machine learning algorithm, by the second instruction obtained in step 203
Practice the sample text that sample set includes and is used as input, markup information corresponding with the sample text of input is defeated as it is expected
Out, training obtains the text-processing model for target text.Text-processing model can be the model of mood for identification, use
In the model etc. for judging that text is intended to.Herein, target text is identical as the sample text source in the first training sample set.
The identical data that can be in source are produced from same application scene, for example, all deriving from intelligent sound box, all derive from search engine,
Or all derive from vehicle intelligent equipment.
Specifically, above-mentioned executing subject can use machine learning algorithm, the second training sample set that step 203 is obtained
Sample text in conjunction is as input, using markup information corresponding with the sample text of input as desired output, to introductory die
Type (such as Recognition with Recurrent Neural Network, convolutional neural networks) is trained, available for the sample text of each training input
Reality output.Wherein, reality output is initial model reality output.Then, above-mentioned executing subject can be declined using gradient
The methods of, it is based on reality output and desired output, adjusts the parameter of initial model, the model obtained after each adjusting parameter is made
Terminate training and in the case where meeting preset trained termination condition for the initial model of training next time, so that training obtains
Text-processing model.In addition, can also include the pretreatment such as being segmented, deleting stop words to sample text in the process
Step, and the machine learning models such as convolutional neural networks, Recognition with Recurrent Neural Network or bag of words are based on by pretreated word
Be converted to term vector.
In some optional implementations of the present embodiment, text-processing model includes two disaggregated models, the first training sample
Training sample in this set includes positive sample and negative sample.
In some optional implementations of the present embodiment, the sample text of different length in the first training sample set is counted
This shared ratio in the first training sample set, comprising: the sample text of different length in the first training sample set of statistics
This shared ratio in positive sample, and count the sample text of different length in the first training sample set institute in negative sample
The ratio accounted for;And according to the ratio counted, training sample is extracted from the first training sample set and obtains the second training sample
This set, comprising: according to the sample text of the different length counted ratio shared in positive sample, from the first training sample
Extract positive sample in the positive sample that set includes, and shared in negative sample according to the sample text of the different length counted
Ratio extracts negative sample from the negative sample that the first training sample set includes, and obtains the second training sample set.
In this implementation, in two disaggregated model of training, by the sample for adjusting separately different length in positive negative sample
The ratio of this text so that in positive negative sample in the distribution of the text of different length and actual scene the text of different length point
Cloth is more close to further improving the accuracy of the output of the model trained.
The method provided by the above embodiment of the application is by obtaining the first training sample set, the first training sample set
In training sample include sample text;The sample text of different length in the first training sample set is counted in the first training sample
Shared ratio in this set;According to the ratio counted, training sample is extracted from the first training sample set and obtains second
Training sample set;Using machine learning algorithm, the sample text for including using the second training sample set as input, will with it is defeated
The corresponding markup information of the sample text entered obtains the text-processing model for target text, mesh as desired output, training
It is identical as the sample text source in the first training sample set to mark text, provides a kind of extract based on concrete scene and trains sample
This model training mechanism improves the accuracy of model output result.
With continued reference to the signal that Fig. 3, Fig. 3 are according to the application scenarios of the method for generating model of the present embodiment
Figure.In the application scenarios of Fig. 3, the available vehicle intelligent equipment 302 of server 301, vehicle intelligent equipment 303 is collected
The text and corresponding markup information that voice messaging converts are as training sample, to obtain the first training sample set
Close 304;The sample text of different length in the first training sample set 304 can then be counted in the first training sample set
Shared ratio in 304;According to the ratio counted, training sample is extracted from the first training sample set 304 and obtains second
Training sample set, finally utilizes machine learning algorithm, and the sample text for including using the second training sample set, will as input
Markup information corresponding with the sample text of input is obtained as desired output, training for vehicle intelligent equipment 302, vehicle-mounted intelligence
The text-processing model for the text that energy equipment 303 or the collected voice messaging of other vehicle intelligent equipments convert.
With further reference to Fig. 4, it illustrates the processes 400 of another embodiment of the method for generating model.The use
In the process 400 for the method for generating model, comprising the following steps:
Step 401, the first training sample set is obtained.
It in the present embodiment, can for generating the method executing subject (such as server shown in FIG. 1 or terminal) of model
To obtain the first training sample set first.
Step 402, the sample text of different length in the first training sample set is counted in the first training sample set
Shared ratio.
In the present embodiment, above-mentioned executing subject can in the first training sample set for being obtained in statistic procedure 401 not
With the sample text of length ratio shared in the first training sample set.
Step 403, according to the ratio counted, training sample is extracted from the first training sample set.
In the present embodiment, above-mentioned executing subject can be according to the ratio counted in step 402, from the first training sample
Training sample is extracted in set.
Step 404, the training sample in extracted training sample including predetermined keyword is removed, the second training sample is obtained
This set.
In the present embodiment, it includes default close that above-mentioned executing subject, which can remove in the training sample extracted in step 403,
The training sample of keyword obtains the second training sample set.Keyword may include being easy to cause training pattern appearance biggish
The word of deviation, for example, the word being closer to markup information.
In some optional implementations of the present embodiment, removing includes predetermined keyword in extracted training sample
Training sample obtains the second training sample set, comprising: includes the keyword in positive sample in the extracted negative sample of removal
Training sample obtains the second training sample set.In negative sample include positive sample in keyword will lead to part positive sample and
There is higher similarity between negative sample, is easy to cause training pattern biggish deviation occur, interference can be reduced after removal, into
One step improves the accuracy of the model output result of generation.As an example, being used to judge the mood in text to training pattern
It is glad or unhappy, wherein happiness is positive sample, and unhappy is negative sample, can remove " happiness " negative sample for including.
In some optional implementations of the present embodiment, can also equally remove includes negative sample in extracted positive sample
The training sample of keyword in this.
Step 405, using machine learning algorithm, the sample text for including using the second training sample set, will as input
Markup information corresponding with the sample text of input obtains the text-processing mould for target text as desired output, training
Type.
In the present embodiment, above-mentioned executing subject can use machine learning algorithm, by the second instruction obtained in step 404
Practice the sample text that sample set includes and is used as input, markup information corresponding with the sample text of input is defeated as it is expected
Out, training obtains the text-processing model for target text.
In the present embodiment, step 401, step 402, step 403, the operation of step 405 and step 201, step 202,
Step 203, the operation of step 204 are essentially identical, and details are not described herein.
Figure 4, it is seen that the method for generating model compared with the corresponding embodiment of Fig. 2, in the present embodiment
Process 400 in by remove include in extracted training sample predetermined keyword training sample, further improve instruction
Practice the quality of sample, further improves the accuracy of the model output result of generation in the scheme of the present embodiment description as a result,.
With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, this application provides one kind for generating mould
One embodiment of the device of type, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, which can specifically answer
For in various electronic equipments.
As shown in figure 5, the present embodiment includes: acquiring unit 501, statistic unit for generating the device 500 of model
502, extracting unit 503, training unit 504.Wherein, acquiring unit is configured to obtain the first training sample set, the first instruction
The training sample practiced in sample set includes sample text;Statistic unit is configured to count in the first training sample set not
With the sample text of length ratio shared in the first training sample set;Extracting unit is configured to what basis counted
Ratio extracts training sample from the first training sample set and obtains the second training sample set;Training unit is configured to benefit
With machine learning algorithm, the sample text for including using the second training sample set is as input, by the sample text pair with input
For the markup information answered as desired output, training obtains the text-processing model for target text, target text and the first instruction
The sample text source practiced in sample set is identical.
In the present embodiment, for generating acquiring unit 501, the statistic unit 502, extracting unit of the device 500 of model
503, the specific processing of training unit 504 can be with reference to step 201, step 202, step 203 and the step in Fig. 2 corresponding embodiment
Rapid 204.
In some optional implementations of the present embodiment, extracting unit, comprising: extract subelement, be configured to basis
The ratio counted extracts training sample from the first training sample set;Removal unit is configured to remove extracted instruction
Practice the training sample in sample including predetermined keyword, obtains the second training sample set.
In some optional implementations of the present embodiment, text-processing model includes two disaggregated models, the first training sample
Training sample in this set includes positive sample and negative sample.
In some optional implementations of the present embodiment, statistic unit is further configured to: statistics the first training sample
The sample text of different length ratio shared in positive sample in this set, and count different long in the first training sample set
The sample text of degree ratio shared in negative sample;And extracting unit, it is further configured to: according to the difference counted
The sample text of length ratio shared in positive sample, extracts positive sample from the positive sample that the first training sample set includes
This, and the ratio shared in negative sample according to the sample text of the different length counted, from the first training sample set packet
Negative sample is extracted in the negative sample included, obtains the second training sample set.
In some optional implementations of the present embodiment, removal unit is further configured to: removal is extracted negative
Include the training sample of the keyword in positive sample in sample, obtains the second training sample set.
The device provided by the above embodiment of the application, by obtaining the first training sample set, the first training sample set
Training sample in conjunction includes sample text;The sample text of different length in the first training sample set is counted in the first training
Shared ratio in sample set;According to the ratio counted, training sample is extracted from the first training sample set and obtains the
Two training sample set;Using machine learning algorithm, the sample text for including using the second training sample set, will be with as input
The corresponding markup information of the sample text of input obtains the text-processing model for target text as desired output, training,
Target text is identical as the sample text source in the first training sample set, provides a kind of extract based on concrete scene and trains
The model training mechanism of sample improves the accuracy of model output result.
Below with reference to Fig. 6, it illustrates the server for being suitable for being used to realize the embodiment of the present application or the departments of computer science of terminal
The structural schematic diagram of system 600.Server or terminal shown in Fig. 6 are only an example, should not be to the function of the embodiment of the present application
Any restrictions can be brought with use scope.
As shown in fig. 6, computer system 600 includes central processing unit (CPU) 601, it can be read-only according to being stored in
Program in memory (ROM) 602 or be loaded into the program in random access storage device (RAM) 603 from storage section 608 and
Execute various movements appropriate and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data.
CPU 601, ROM 602 and RAM 603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to always
Line 604.
It can connect with lower component to I/O interface 605: the importation 606 including keyboard, mouse etc.;Including all
The output par, c 607 of such as cathode-ray tube (CRT), liquid crystal display (LCD) and loudspeaker etc.;Storage including hard disk etc.
Part 608;And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 passes through
Communication process is executed by the network of such as internet.Driver 610 is also connected to I/O interface 605 as needed.Detachable media
611, such as disk, CD, magneto-optic disk, semiconductor memory etc., are mounted on as needed on driver 610, in order to from
The computer program read thereon is mounted into storage section 608 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description
Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium
On computer program, which includes the program code for method shown in execution flow chart.In such reality
It applies in example, which can be downloaded and installed from network by communications portion 609, and/or from detachable media
611 are mounted.When the computer program is executed by central processing unit (CPU) 601, limited in execution the present processes
Above-mentioned function.It should be noted that computer-readable medium described herein can be computer-readable signal media or
Computer-readable medium either the two any combination.Computer-readable medium for example can be --- but it is unlimited
In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates
The more specific example of machine readable medium can include but is not limited to: electrical connection, portable meter with one or more conducting wires
Calculation machine disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory
(EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or
The above-mentioned any appropriate combination of person.In this application, computer-readable medium, which can be, any includes or storage program has
Shape medium, the program can be commanded execution system, device or device use or in connection.And in the application
In, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, wherein
Carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to electric
Magnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable Jie
Any computer-readable medium other than matter, the computer-readable medium can be sent, propagated or transmitted for being held by instruction
Row system, device or device use or program in connection.The program code for including on computer-readable medium
It can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. or above-mentioned any conjunction
Suitable combination.
The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof
Machine program code, described program design language include object oriented program language-such as Java, Smalltalk, C+
+, it further include conventional procedural programming language-such as C language or similar programming language.Program code can be with
It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion
Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.?
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or
Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service
It is connected for quotient by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use
The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box
The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually
It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse
Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding
The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction
Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard
The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet
Include acquiring unit, statistic unit, extracting unit and training unit.Wherein, the title of these units not structure under certain conditions
The restriction of the pairs of unit itself, such as acquiring unit are also described as " being configured to obtain the first training sample set
Unit ".
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be
Included in device described in above-described embodiment;It is also possible to individualism, and without in the supplying device.Above-mentioned calculating
Machine readable medium carries one or more program, when said one or multiple programs are executed by the device, so that should
Device: the first training sample set is obtained, the training sample in the first training sample set includes sample text;The first instruction of statistics
Practice the sample text of different length in sample set ratio shared in the first training sample set;According to the ratio counted
Example extracts training sample from the first training sample set and obtains the second training sample set;Using machine learning algorithm, by
The sample text that two training sample set include is as input, using markup information corresponding with the sample text of input as expectation
Output, training obtain the text-processing model for target text, the sample text in target text and the first training sample set
This source is identical.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art
Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic
Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature
Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein
Can technical characteristic replaced mutually and the technical solution that is formed.
Claims (12)
1. a kind of method for generating model, comprising:
The first training sample set is obtained, the training sample in the first training sample set includes sample text;
The sample text for counting different length in the first training sample set is shared in the first training sample set
Ratio;
According to the ratio counted, training sample is extracted from the first training sample set and obtains the second training sample set
It closes;
Using machine learning algorithm, the sample text for including using the second training sample set is as input, by the sample with input
For the corresponding markup information of text as desired output, training obtains the text-processing model for target text, the target text
This is identical as the sample text source in the first training sample set.
2. according to the method described in claim 1, wherein, the ratio that the basis counts, from first training sample set
Training sample is extracted in conjunction obtains the second training sample set, comprising:
According to the ratio counted, training sample is extracted from the first training sample set;
The training sample in extracted training sample including predetermined keyword is removed, the second training sample set is obtained.
3. described first instructs according to the method described in claim 2, wherein, the text-processing model includes two disaggregated models
The training sample practiced in sample set includes positive sample and negative sample.
4. according to the method described in claim 3, wherein, the sample of different length in statistics the first training sample set
This text ratio shared in the first training sample set, comprising:
The sample text of different length in the first training sample set ratio shared in positive sample is counted, and counts institute
State the sample text of different length in the first training sample set ratio shared in negative sample;And
The ratio that the basis counts extracts training sample from the first training sample set and obtains the second training sample
Set, comprising:
According to the sample text of the different length counted ratio shared in positive sample, from the first training sample set
Including positive sample in extract positive sample, and the ratio shared in negative sample according to the sample text of the different length counted
Example, extracts negative sample from the negative sample that the first training sample set includes, obtains the second training sample set.
5. the method according to claim 3 or 4, wherein include default key in the extracted training sample of the removal
The training sample of word obtains the second training sample set, comprising:
The training sample including the keyword in positive sample in extracted negative sample is removed, the second training sample set is obtained.
6. a kind of for generating the device of model, comprising:
Acquiring unit is configured to obtain the first training sample set, the training sample packet in the first training sample set
Include sample text;
Statistic unit is configured to count the sample text of different length in the first training sample set in first instruction
Practice ratio shared in sample set;
Extracting unit, is configured to according to the ratio counted, and training sample is extracted from the first training sample set and is obtained
To the second training sample set;
Training unit is configured to using machine learning algorithm, and the sample text for including using the second training sample set is as defeated
Enter, using markup information corresponding with the sample text of input as desired output, training is obtained at the text for target text
Model is managed, the target text is identical as the sample text source in the first training sample set.
7. device according to claim 6, wherein the extracting unit, comprising:
Subelement is extracted, is configured to extract training sample from the first training sample set according to the ratio counted;
Removal unit is configured to remove the training sample in extracted training sample including predetermined keyword, obtains second
Training sample set.
8. device according to claim 7, wherein the text-processing model includes two disaggregated models, first instruction
The training sample practiced in sample set includes positive sample and negative sample.
9. device according to claim 8, wherein the statistic unit is further configured to:
The sample text of different length in the first training sample set ratio shared in positive sample is counted, and counts institute
State the sample text of different length in the first training sample set ratio shared in negative sample;And
The extracting unit, is further configured to:
According to the sample text of the different length counted ratio shared in positive sample, from the first training sample set
Including positive sample in extract positive sample, and the ratio shared in negative sample according to the sample text of the different length counted
Example, extracts negative sample from the negative sample that the first training sample set includes, obtains the second training sample set.
10. device according to claim 8 or claim 9, wherein the removal unit is further configured to:
The training sample including the keyword in positive sample in extracted negative sample is removed, the second training sample set is obtained.
11. a kind of electronic equipment, comprising:
One or more processors;
Storage device is stored thereon with one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors
Realize such as method as claimed in any one of claims 1 to 5.
12. a kind of computer-readable medium, is stored thereon with computer program, such as right is realized when which is executed by processor
It is required that any method in 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910312916.3A CN110046254B (en) | 2019-04-18 | 2019-04-18 | Method and apparatus for generating a model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910312916.3A CN110046254B (en) | 2019-04-18 | 2019-04-18 | Method and apparatus for generating a model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110046254A true CN110046254A (en) | 2019-07-23 |
CN110046254B CN110046254B (en) | 2022-03-08 |
Family
ID=67277799
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910312916.3A Active CN110046254B (en) | 2019-04-18 | 2019-04-18 | Method and apparatus for generating a model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110046254B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851708A (en) * | 2019-10-16 | 2020-02-28 | 中国平安人寿保险股份有限公司 | Negative sample extraction method and device, computer equipment and storage medium |
CN111143514A (en) * | 2019-12-27 | 2020-05-12 | 北京百度网讯科技有限公司 | Method and apparatus for generating information |
CN111709247A (en) * | 2020-05-20 | 2020-09-25 | 北京百度网讯科技有限公司 | Data set processing method and device, electronic equipment and storage medium |
CN112396047A (en) * | 2020-10-30 | 2021-02-23 | 北京文思海辉金信软件有限公司 | Training sample generation method and device, computer equipment and storage medium |
CN112613572A (en) * | 2020-12-30 | 2021-04-06 | 北京奇艺世纪科技有限公司 | Sample data obtaining method and device, electronic equipment and storage medium |
CN112836013A (en) * | 2021-01-29 | 2021-05-25 | 北京大米科技有限公司 | Data labeling method and device, readable storage medium and electronic equipment |
CN113111173A (en) * | 2020-02-13 | 2021-07-13 | 北京明亿科技有限公司 | Regular expression-based alarm receiving warning condition category determination method and device |
CN113111234A (en) * | 2020-02-13 | 2021-07-13 | 北京明亿科技有限公司 | Regular expression-based alarm condition category determination method and device |
CN114091427A (en) * | 2021-11-19 | 2022-02-25 | 海信电子科技(武汉)有限公司 | Image text similarity model training method and display equipment |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202177A (en) * | 2016-06-27 | 2016-12-07 | 腾讯科技(深圳)有限公司 | A kind of file classification method and device |
US20170004269A1 (en) * | 2015-06-30 | 2017-01-05 | BWW Holdings, Ltd. | Systems and methods for estimating mental health assessment results |
CN106780552A (en) * | 2016-11-08 | 2017-05-31 | 西安电子科技大学 | Anti-shelter target tracking based on regional area joint tracing detection study |
CN107562742A (en) * | 2016-06-30 | 2018-01-09 | 苏宁云商集团股份有限公司 | A kind of image processing method and device |
JP2018025956A (en) * | 2016-08-09 | 2018-02-15 | 日本電信電話株式会社 | Model creation device, estimation device, method, and program |
CN108062563A (en) * | 2017-12-12 | 2018-05-22 | 华东理工大学 | A kind of representative sample based on classification equilibrium finds method |
CN108287816A (en) * | 2017-01-10 | 2018-07-17 | 腾讯科技(深圳)有限公司 | Point of interest on-line checking, Machine learning classifiers training method and device |
CN109165284A (en) * | 2018-08-22 | 2019-01-08 | 重庆邮电大学 | A kind of financial field human-computer dialogue intension recognizing method based on big data |
CN109165658A (en) * | 2018-08-28 | 2019-01-08 | 哈尔滨工业大学(威海) | A kind of strong negative sample underwater target detection method based on Faster-RCNN |
CN109492764A (en) * | 2018-10-24 | 2019-03-19 | 平安科技(深圳)有限公司 | Training method, relevant device and the medium of production confrontation network |
-
2019
- 2019-04-18 CN CN201910312916.3A patent/CN110046254B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170004269A1 (en) * | 2015-06-30 | 2017-01-05 | BWW Holdings, Ltd. | Systems and methods for estimating mental health assessment results |
CN106202177A (en) * | 2016-06-27 | 2016-12-07 | 腾讯科技(深圳)有限公司 | A kind of file classification method and device |
CN107562742A (en) * | 2016-06-30 | 2018-01-09 | 苏宁云商集团股份有限公司 | A kind of image processing method and device |
JP2018025956A (en) * | 2016-08-09 | 2018-02-15 | 日本電信電話株式会社 | Model creation device, estimation device, method, and program |
CN106780552A (en) * | 2016-11-08 | 2017-05-31 | 西安电子科技大学 | Anti-shelter target tracking based on regional area joint tracing detection study |
CN108287816A (en) * | 2017-01-10 | 2018-07-17 | 腾讯科技(深圳)有限公司 | Point of interest on-line checking, Machine learning classifiers training method and device |
CN108062563A (en) * | 2017-12-12 | 2018-05-22 | 华东理工大学 | A kind of representative sample based on classification equilibrium finds method |
CN109165284A (en) * | 2018-08-22 | 2019-01-08 | 重庆邮电大学 | A kind of financial field human-computer dialogue intension recognizing method based on big data |
CN109165658A (en) * | 2018-08-28 | 2019-01-08 | 哈尔滨工业大学(威海) | A kind of strong negative sample underwater target detection method based on Faster-RCNN |
CN109492764A (en) * | 2018-10-24 | 2019-03-19 | 平安科技(深圳)有限公司 | Training method, relevant device and the medium of production confrontation network |
Non-Patent Citations (3)
Title |
---|
KIICHI TAGO等: "Influence analysis of emotional behaviors and user relationships based on Twitter data", 《TSINGHUA SCIENCE AND TECHNOLOGY》 * |
周永章等: "《地球科学大数据挖掘与机器学习》", 30 September 2018, 中山大学出版社 * |
李隆: "网络文本自杀倾向性识别方法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851708A (en) * | 2019-10-16 | 2020-02-28 | 中国平安人寿保险股份有限公司 | Negative sample extraction method and device, computer equipment and storage medium |
CN110851708B (en) * | 2019-10-16 | 2023-11-03 | 中国平安人寿保险股份有限公司 | Negative sample extraction method, device, computer equipment and storage medium |
CN111143514A (en) * | 2019-12-27 | 2020-05-12 | 北京百度网讯科技有限公司 | Method and apparatus for generating information |
CN111143514B (en) * | 2019-12-27 | 2023-03-21 | 北京百度网讯科技有限公司 | Method and apparatus for generating information |
CN113111173A (en) * | 2020-02-13 | 2021-07-13 | 北京明亿科技有限公司 | Regular expression-based alarm receiving warning condition category determination method and device |
CN113111234A (en) * | 2020-02-13 | 2021-07-13 | 北京明亿科技有限公司 | Regular expression-based alarm condition category determination method and device |
CN111709247B (en) * | 2020-05-20 | 2023-04-07 | 北京百度网讯科技有限公司 | Data set processing method and device, electronic equipment and storage medium |
CN111709247A (en) * | 2020-05-20 | 2020-09-25 | 北京百度网讯科技有限公司 | Data set processing method and device, electronic equipment and storage medium |
CN112396047A (en) * | 2020-10-30 | 2021-02-23 | 北京文思海辉金信软件有限公司 | Training sample generation method and device, computer equipment and storage medium |
CN112613572A (en) * | 2020-12-30 | 2021-04-06 | 北京奇艺世纪科技有限公司 | Sample data obtaining method and device, electronic equipment and storage medium |
CN112613572B (en) * | 2020-12-30 | 2024-01-23 | 北京奇艺世纪科技有限公司 | Sample data obtaining method and device, electronic equipment and storage medium |
CN112836013A (en) * | 2021-01-29 | 2021-05-25 | 北京大米科技有限公司 | Data labeling method and device, readable storage medium and electronic equipment |
CN114091427A (en) * | 2021-11-19 | 2022-02-25 | 海信电子科技(武汉)有限公司 | Image text similarity model training method and display equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110046254B (en) | 2022-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110046254A (en) | Method and apparatus for generating model | |
CN108764487A (en) | For generating the method and apparatus of model, the method and apparatus of information for identification | |
CN108831505B (en) | Method and device for identifying use scenes of application | |
CN107491534A (en) | Information processing method and device | |
CN108986805B (en) | Method and apparatus for sending information | |
CN110019742B (en) | Method and device for processing information | |
CN109299477A (en) | Method and apparatus for generating text header | |
CN109635095A (en) | Method and apparatus for optimizing dialog model | |
CN108121800A (en) | Information generating method and device based on artificial intelligence | |
CN108933730A (en) | Information-pushing method and device | |
CN112650841A (en) | Information processing method and device and electronic equipment | |
CN109829164A (en) | Method and apparatus for generating text | |
CN108897853A (en) | The method and apparatus for generating pushed information | |
CN109582954A (en) | Method and apparatus for output information | |
CN109214501A (en) | The method and apparatus of information for identification | |
CN108959087A (en) | test method and device | |
CN110516261A (en) | Resume appraisal procedure, device, electronic equipment and computer storage medium | |
CN109543068A (en) | Method and apparatus for generating the comment information of video | |
CN109284367A (en) | Method and apparatus for handling text | |
CN108877779A (en) | Method and apparatus for detecting voice tail point | |
CN109117758A (en) | Method and apparatus for generating information | |
CN110245334A (en) | Method and apparatus for output information | |
CN109919220A (en) | Method and apparatus for generating the feature vector of video | |
CN109829431A (en) | Method and apparatus for generating information | |
CN110675865B (en) | Method and apparatus for training hybrid language recognition models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20211012 Address after: 100176 Room 101, 1st floor, building 1, yard 7, Ruihe West 2nd Road, economic and Technological Development Zone, Daxing District, Beijing Applicant after: Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. Address before: 100085 Baidu Building, 10 Shangdi Tenth Street, Haidian District, Beijing Applicant before: BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |