CN110046254A

CN110046254A - Method and apparatus for generating model

Info

Publication number: CN110046254A
Application number: CN201910312916.3A
Authority: CN
Inventors: 陈飞标
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority date: 2019-04-18
Filing date: 2019-04-18
Publication date: 2019-07-23
Anticipated expiration: 2039-04-18
Also published as: CN110046254B

Abstract

The embodiment of the present application discloses the method and apparatus for generating model.One specific embodiment of this method includes: to obtain the first training sample set, and the training sample in the first training sample set includes sample text；Count the sample text of different length in the first training sample set ratio shared in the first training sample set；According to the ratio counted, training sample is extracted from the first training sample set and obtains the second training sample set；Utilize machine learning algorithm, the sample text for including using the second training sample set is as input, using markup information corresponding with the sample text of input as desired output, training obtains the text-processing model for target text, and target text is identical as the sample text source in the first training sample set.This embodiment offers a kind of model training mechanism that training sample is extracted based on concrete scene, improve the accuracy of model output result.

Description

Method and apparatus for generating model

Technical field

The invention relates to field of computer technology, more particularly, to generate the method and apparatus of model.

Background technique

With the development of AI (Artificial Intelligence, artificial intelligence) technology, various tasks To be realized by machine learning model, such as Classification of Speech task, text categorization task.Before executing actual task, need Model training is first carried out, the machine learning model for having respective capabilities is obtained.

Summary of the invention

The embodiment of the present application proposes the method and apparatus for generating model.

In a first aspect, the embodiment of the present application provides a kind of method for generating model, this method comprises: obtaining first Training sample set, the training sample in the first training sample set includes sample text；It counts in the first training sample set The sample text of different length ratio shared in the first training sample set；According to the ratio counted, from the first training Training sample is extracted in sample set obtains the second training sample set；Using machine learning algorithm, by the second training sample set Sample text that conjunction includes is as input, using markup information corresponding with the sample text of input as desired output, trained To the text-processing model for being directed to target text, target text is identical as the sample text source in the first training sample set.

In some embodiments, according to the ratio counted, training sample is extracted from the first training sample set and is obtained Second training sample set, comprising: according to the ratio counted, extract training sample from the first training sample set；Removal Include the training sample of predetermined keyword in the training sample extracted, obtains the second training sample set.

In some embodiments, text-processing model includes two disaggregated models, the training sample in the first training sample set This includes positive sample and negative sample.

In some embodiments, the sample text of different length in the first training sample set is counted in the first training sample Shared ratio in set, comprising: the sample text of different length is shared in positive sample in the first training sample set of statistics Ratio, and the ratio that the sample text that counts different length in the first training sample set is shared in negative sample；And root The ratio gone out according to statistics extracts training sample from the first training sample set and obtains the second training sample set, comprising: according to The sample text of the different length counted ratio shared in positive sample, the positive sample for including from the first training sample set Middle extraction positive sample, and the ratio shared in negative sample according to the sample text of the different length counted, from the first training Negative sample is extracted in the negative sample that sample set includes, obtains the second training sample set.

In some embodiments, remove in extracted training sample include predetermined keyword training sample, obtain the Two training sample set, comprising: include the training sample of the keyword in positive sample in the extracted negative sample of removal, obtain the Two training sample set.

Second aspect, the embodiment of the present application provide it is a kind of for generating the device of model, the device include: obtain it is single Member is configured to obtain the first training sample set, and the training sample in the first training sample set includes sample text；Statistics Unit, the sample text for being configured to count different length in the first training sample set are shared in the first training sample set Ratio；Extracting unit, is configured to according to the ratio counted, and training sample is extracted from the first training sample set and is obtained Second training sample set；Training unit is configured to using machine learning algorithm, the sample for including by the second training sample set This text is as input, and using markup information corresponding with the sample text of input as desired output, training is obtained for target The text-processing model of text, target text are identical as the sample text source in the first training sample set.

In some embodiments, extracting unit, comprising: extract subelement, be configured to according to the ratio that counts, from the Training sample is extracted in one training sample set；Removal unit, it includes default for being configured to remove in extracted training sample The training sample of keyword obtains the second training sample set.

In some embodiments, statistic unit is further configured to: different length in the first training sample set of statistics Sample text ratio shared in positive sample, and count the sample text of different length in the first training sample set negative Shared ratio in sample；And extracting unit, it is further configured to: being existed according to the sample text of the different length counted Shared ratio in positive sample, extract positive sample from the positive sample that the first training sample set includes, and according to counting The sample text of different length ratio shared in negative sample extracts negative from the negative sample that the first training sample set includes Sample obtains the second training sample set.

In some embodiments, removal unit is further configured to: including positive sample in the extracted negative sample of removal In keyword training sample, obtain the second training sample set.

The third aspect, the embodiment of the present application provide a kind of equipment, comprising: one or more processors；Storage device, On be stored with one or more programs, when said one or multiple programs are executed by said one or multiple processors so that on It states one or more processors and realizes such as the above-mentioned method of first aspect.

Fourth aspect, the embodiment of the present application provide a kind of computer-readable medium, are stored thereon with computer program, should Such as first aspect above-mentioned method is realized when program is executed by processor.

Method and apparatus provided by the embodiments of the present application for generating model, by obtaining the first training sample set, Training sample in first training sample set includes sample text；Count the sample of different length in the first training sample set Text ratio shared in the first training sample set；According to the ratio counted, extracted from the first training sample set Training sample obtains the second training sample set；Using machine learning algorithm, the sample for including by the second training sample set is literary This conduct input, using markup information corresponding with the sample text of input as desired output, training is obtained for target text Text-processing model, target text is identical as the sample text source in the first training sample set, provides one kind and is based on Concrete scene extracts the model training mechanism of training sample, improves the accuracy of model output result.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:

Fig. 1 is that this application can be applied to exemplary system architecture figures therein；

Fig. 2 is the flow chart according to one embodiment of the method for generating model of the application；

Fig. 3 is a schematic diagram according to the application scenarios of the method for generating model of the application；

Fig. 4 is the flow chart according to another embodiment of the method for generating model of the application；

Fig. 5 is the structural schematic diagram according to one embodiment of the device for generating model of the application；

Fig. 6 is adapted for the structural schematic diagram of the computer system for the server or terminal of realizing the embodiment of the present application.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1 is shown can be using the method for generating model of the application or the implementation of the device for generating model The exemplary system architecture 100 of example.

As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105. Network 104 between terminal device 101,102,103 and server 105 to provide the medium of communication link.Network 104 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..

User can be used terminal device 101,102,103 and be interacted by network 104 with server 105, to receive or send out Send message etc..Various telecommunication customer end applications can be installed, such as text-processing class is answered on terminal device 101,102,103 With the application of, speech processes class, map class application, searching class application etc..

Terminal device 101,102,103 can be hardware, be also possible to software.When terminal device 101,102,103 is hard When part, it can be various electronic equipments, including but not limited to smart phone, tablet computer, intelligent sound box, portable calculating on knee Machine and desktop computer etc..When terminal device 101,102,103 is software, it may be mounted at above-mentioned cited electronics and set In standby.Multiple softwares or software module may be implemented into it, and single software or software module also may be implemented into.It does not do and has herein Body limits.

Server 105 can be to provide the server of various services, such as to installing on terminal device 101,102,103 Using providing the background server supported, the available first training sample set of server 105, in the first training sample set Training sample include sample text；The sample text of different length in the first training sample set is counted in the first training sample Shared ratio in set；According to the ratio counted, training sample is extracted from the first training sample set and obtains the second instruction Practice sample set；Using machine learning algorithm, the sample text for including using the second training sample set, will be with input as input The corresponding markup information of sample text as desired output, training obtains the text-processing model for target text, target Text is identical as the sample text source in the first training sample set.

It should be noted that the method provided by the embodiment of the present application for generating model can be held by server 105 Row, can also be executed, correspondingly, the device for generating model can be set in server by terminal device 101,102,103 In 105, also it can be set in terminal device 101,102,103.

It should be noted that server can be hardware, it is also possible to software.When server is hardware, may be implemented At the distributed server cluster that multiple servers form, individual server also may be implemented into.It, can when server is software To be implemented as multiple softwares or software module (such as providing Distributed Services), single software or software also may be implemented into Module.It is not specifically limited herein.

It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.

With continued reference to Fig. 2, the process of one embodiment of the method for generating model according to the application is shown 200.The method for being used to generate model, comprising the following steps:

Step 201, the first training sample set is obtained.

It in the present embodiment, can for generating the method executing subject (such as server shown in FIG. 1 or terminal) of model To obtain the first training sample set first, the training sample in the first training sample set includes sample text.Sample text Specific scene can be derived from, for example, the query statement that search engine receives, the phonetic order that intelligent sound box receives turns The text that the phonetic order that the text or vehicle intelligent equipment got in return receive is converted to can collect one section from line The text of time, as sample text.

Step 202, the sample text of different length in the first training sample set is counted in the first training sample set Shared ratio.

In the present embodiment, above-mentioned executing subject can in the first training sample set for being obtained in statistic procedure 201 not With the sample text of length ratio shared in the first training sample set.For example, the sample text of length 3 has 300000 Item, accounting 15%, 4 sample text of length have 60000, and accounting 30%, 5 sample text of length has 60000, and accounting is 30%, the sample text of length 6 has 300000, and accounting 15%, the sample text of other length has 200000, and accounting is 10%.

Step 203, according to the ratio counted, training sample is extracted from the first training sample set and obtains the second training Sample set.

In the present embodiment, above-mentioned executing subject can be according to the ratio counted in step 202, from the first training sample Training sample is extracted in set obtains the second training sample set.As an example, counting the sample text of length 3 in step 201 Originally there are 300000, accounting 15%, 4 sample text of length there are 600000, and accounting 30%, 5 sample text of length has 600000, accounting 30%, the sample text of length 6 has 300000, and the sample text of accounting 15%, other length has 200000, accounting 10%.Above-mentioned executing subject can extract the sample text of length 3 from the first training sample set 3000, accounting 15% extracts 4 sample text of length 6000, and accounting 30% extracts 5 sample text of length 6000, Accounting is 30%, sample text 3000 for extracting length 6, accounting 15%, sample text 2000 for extracting other length, Accounting is 10%.The deviation of specific sampling proportion and the ratio counted is all that can receive in the range of presetting 's.

In addition, can also be carried out to the sample text extracted after sample drawn text in the first training sample set Mark, is labeled compared to all sample texts in the first training sample set, saves system resource.With sample text Corresponding markup information can be marked to obtain by artificial or machine, markup information different, example according to the difference of concrete model It such as, is the model of mood for identification to training pattern, markup information may include the emotional informations such as glad, unhappy.

Step 204, using machine learning algorithm, the sample text for including using the second training sample set, will as input Markup information corresponding with the sample text of input obtains the text-processing mould for target text as desired output, training Type.

In the present embodiment, above-mentioned executing subject can use machine learning algorithm, by the second instruction obtained in step 203 Practice the sample text that sample set includes and is used as input, markup information corresponding with the sample text of input is defeated as it is expected Out, training obtains the text-processing model for target text.Text-processing model can be the model of mood for identification, use In the model etc. for judging that text is intended to.Herein, target text is identical as the sample text source in the first training sample set. The identical data that can be in source are produced from same application scene, for example, all deriving from intelligent sound box, all derive from search engine, Or all derive from vehicle intelligent equipment.

Specifically, above-mentioned executing subject can use machine learning algorithm, the second training sample set that step 203 is obtained Sample text in conjunction is as input, using markup information corresponding with the sample text of input as desired output, to introductory die Type (such as Recognition with Recurrent Neural Network, convolutional neural networks) is trained, available for the sample text of each training input Reality output.Wherein, reality output is initial model reality output.Then, above-mentioned executing subject can be declined using gradient The methods of, it is based on reality output and desired output, adjusts the parameter of initial model, the model obtained after each adjusting parameter is made Terminate training and in the case where meeting preset trained termination condition for the initial model of training next time, so that training obtains Text-processing model.In addition, can also include the pretreatment such as being segmented, deleting stop words to sample text in the process Step, and the machine learning models such as convolutional neural networks, Recognition with Recurrent Neural Network or bag of words are based on by pretreated word Be converted to term vector.

In some optional implementations of the present embodiment, text-processing model includes two disaggregated models, the first training sample Training sample in this set includes positive sample and negative sample.

In some optional implementations of the present embodiment, the sample text of different length in the first training sample set is counted This shared ratio in the first training sample set, comprising: the sample text of different length in the first training sample set of statistics This shared ratio in positive sample, and count the sample text of different length in the first training sample set institute in negative sample The ratio accounted for；And according to the ratio counted, training sample is extracted from the first training sample set and obtains the second training sample This set, comprising: according to the sample text of the different length counted ratio shared in positive sample, from the first training sample Extract positive sample in the positive sample that set includes, and shared in negative sample according to the sample text of the different length counted Ratio extracts negative sample from the negative sample that the first training sample set includes, and obtains the second training sample set.

In this implementation, in two disaggregated model of training, by the sample for adjusting separately different length in positive negative sample The ratio of this text so that in positive negative sample in the distribution of the text of different length and actual scene the text of different length point Cloth is more close to further improving the accuracy of the output of the model trained.

The method provided by the above embodiment of the application is by obtaining the first training sample set, the first training sample set In training sample include sample text；The sample text of different length in the first training sample set is counted in the first training sample Shared ratio in this set；According to the ratio counted, training sample is extracted from the first training sample set and obtains second Training sample set；Using machine learning algorithm, the sample text for including using the second training sample set as input, will with it is defeated The corresponding markup information of the sample text entered obtains the text-processing model for target text, mesh as desired output, training It is identical as the sample text source in the first training sample set to mark text, provides a kind of extract based on concrete scene and trains sample This model training mechanism improves the accuracy of model output result.

With continued reference to the signal that Fig. 3, Fig. 3 are according to the application scenarios of the method for generating model of the present embodiment Figure.In the application scenarios of Fig. 3, the available vehicle intelligent equipment 302 of server 301, vehicle intelligent equipment 303 is collected The text and corresponding markup information that voice messaging converts are as training sample, to obtain the first training sample set Close 304；The sample text of different length in the first training sample set 304 can then be counted in the first training sample set Shared ratio in 304；According to the ratio counted, training sample is extracted from the first training sample set 304 and obtains second Training sample set, finally utilizes machine learning algorithm, and the sample text for including using the second training sample set, will as input Markup information corresponding with the sample text of input is obtained as desired output, training for vehicle intelligent equipment 302, vehicle-mounted intelligence The text-processing model for the text that energy equipment 303 or the collected voice messaging of other vehicle intelligent equipments convert.

With further reference to Fig. 4, it illustrates the processes 400 of another embodiment of the method for generating model.The use In the process 400 for the method for generating model, comprising the following steps:

Step 401, the first training sample set is obtained.

It in the present embodiment, can for generating the method executing subject (such as server shown in FIG. 1 or terminal) of model To obtain the first training sample set first.

Step 402, the sample text of different length in the first training sample set is counted in the first training sample set Shared ratio.

In the present embodiment, above-mentioned executing subject can in the first training sample set for being obtained in statistic procedure 401 not With the sample text of length ratio shared in the first training sample set.

Step 403, according to the ratio counted, training sample is extracted from the first training sample set.

In the present embodiment, above-mentioned executing subject can be according to the ratio counted in step 402, from the first training sample Training sample is extracted in set.

Step 404, the training sample in extracted training sample including predetermined keyword is removed, the second training sample is obtained This set.

In the present embodiment, it includes default close that above-mentioned executing subject, which can remove in the training sample extracted in step 403, The training sample of keyword obtains the second training sample set.Keyword may include being easy to cause training pattern appearance biggish The word of deviation, for example, the word being closer to markup information.

In some optional implementations of the present embodiment, removing includes predetermined keyword in extracted training sample Training sample obtains the second training sample set, comprising: includes the keyword in positive sample in the extracted negative sample of removal Training sample obtains the second training sample set.In negative sample include positive sample in keyword will lead to part positive sample and There is higher similarity between negative sample, is easy to cause training pattern biggish deviation occur, interference can be reduced after removal, into One step improves the accuracy of the model output result of generation.As an example, being used to judge the mood in text to training pattern It is glad or unhappy, wherein happiness is positive sample, and unhappy is negative sample, can remove " happiness " negative sample for including.

In some optional implementations of the present embodiment, can also equally remove includes negative sample in extracted positive sample The training sample of keyword in this.

Step 405, using machine learning algorithm, the sample text for including using the second training sample set, will as input Markup information corresponding with the sample text of input obtains the text-processing mould for target text as desired output, training Type.

In the present embodiment, above-mentioned executing subject can use machine learning algorithm, by the second instruction obtained in step 404 Practice the sample text that sample set includes and is used as input, markup information corresponding with the sample text of input is defeated as it is expected Out, training obtains the text-processing model for target text.

In the present embodiment, step 401, step 402, step 403, the operation of step 405 and step 201, step 202, Step 203, the operation of step 204 are essentially identical, and details are not described herein.

Figure 4, it is seen that the method for generating model compared with the corresponding embodiment of Fig. 2, in the present embodiment Process 400 in by remove include in extracted training sample predetermined keyword training sample, further improve instruction Practice the quality of sample, further improves the accuracy of the model output result of generation in the scheme of the present embodiment description as a result,.

With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, this application provides one kind for generating mould One embodiment of the device of type, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, which can specifically answer For in various electronic equipments.

As shown in figure 5, the present embodiment includes: acquiring unit 501, statistic unit for generating the device 500 of model 502, extracting unit 503, training unit 504.Wherein, acquiring unit is configured to obtain the first training sample set, the first instruction The training sample practiced in sample set includes sample text；Statistic unit is configured to count in the first training sample set not With the sample text of length ratio shared in the first training sample set；Extracting unit is configured to what basis counted Ratio extracts training sample from the first training sample set and obtains the second training sample set；Training unit is configured to benefit With machine learning algorithm, the sample text for including using the second training sample set is as input, by the sample text pair with input For the markup information answered as desired output, training obtains the text-processing model for target text, target text and the first instruction The sample text source practiced in sample set is identical.

In the present embodiment, for generating acquiring unit 501, the statistic unit 502, extracting unit of the device 500 of model 503, the specific processing of training unit 504 can be with reference to step 201, step 202, step 203 and the step in Fig. 2 corresponding embodiment Rapid 204.

In some optional implementations of the present embodiment, extracting unit, comprising: extract subelement, be configured to basis The ratio counted extracts training sample from the first training sample set；Removal unit is configured to remove extracted instruction Practice the training sample in sample including predetermined keyword, obtains the second training sample set.

In some optional implementations of the present embodiment, statistic unit is further configured to: statistics the first training sample The sample text of different length ratio shared in positive sample in this set, and count different long in the first training sample set The sample text of degree ratio shared in negative sample；And extracting unit, it is further configured to: according to the difference counted The sample text of length ratio shared in positive sample, extracts positive sample from the positive sample that the first training sample set includes This, and the ratio shared in negative sample according to the sample text of the different length counted, from the first training sample set packet Negative sample is extracted in the negative sample included, obtains the second training sample set.

In some optional implementations of the present embodiment, removal unit is further configured to: removal is extracted negative Include the training sample of the keyword in positive sample in sample, obtains the second training sample set.

The device provided by the above embodiment of the application, by obtaining the first training sample set, the first training sample set Training sample in conjunction includes sample text；The sample text of different length in the first training sample set is counted in the first training Shared ratio in sample set；According to the ratio counted, training sample is extracted from the first training sample set and obtains the Two training sample set；Using machine learning algorithm, the sample text for including using the second training sample set, will be with as input The corresponding markup information of the sample text of input obtains the text-processing model for target text as desired output, training, Target text is identical as the sample text source in the first training sample set, provides a kind of extract based on concrete scene and trains The model training mechanism of sample improves the accuracy of model output result.

Below with reference to Fig. 6, it illustrates the server for being suitable for being used to realize the embodiment of the present application or the departments of computer science of terminal The structural schematic diagram of system 600.Server or terminal shown in Fig. 6 are only an example, should not be to the function of the embodiment of the present application Any restrictions can be brought with use scope.

As shown in fig. 6, computer system 600 includes central processing unit (CPU) 601, it can be read-only according to being stored in Program in memory (ROM) 602 or be loaded into the program in random access storage device (RAM) 603 from storage section 608 and Execute various movements appropriate and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data. CPU 601, ROM 602 and RAM 603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to always Line 604.

It can connect with lower component to I/O interface 605: the importation 606 including keyboard, mouse etc.；Including all The output par, c 607 of such as cathode-ray tube (CRT), liquid crystal display (LCD) and loudspeaker etc.；Storage including hard disk etc. Part 608；And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 passes through Communication process is executed by the network of such as internet.Driver 610 is also connected to I/O interface 605 as needed.Detachable media 611, such as disk, CD, magneto-optic disk, semiconductor memory etc., are mounted on as needed on driver 610, in order to from The computer program read thereon is mounted into storage section 608 as needed.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communications portion 609, and/or from detachable media 611 are mounted.When the computer program is executed by central processing unit (CPU) 601, limited in execution the present processes Above-mentioned function.It should be noted that computer-readable medium described herein can be computer-readable signal media or Computer-readable medium either the two any combination.Computer-readable medium for example can be --- but it is unlimited In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates The more specific example of machine readable medium can include but is not limited to: electrical connection, portable meter with one or more conducting wires Calculation machine disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or The above-mentioned any appropriate combination of person.In this application, computer-readable medium, which can be, any includes or storage program has Shape medium, the program can be commanded execution system, device or device use or in connection.And in the application In, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, wherein Carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to electric Magnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable Jie Any computer-readable medium other than matter, the computer-readable medium can be sent, propagated or transmitted for being held by instruction Row system, device or device use or program in connection.The program code for including on computer-readable medium It can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. or above-mentioned any conjunction Suitable combination.

The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof Machine program code, described program design language include object oriented program language-such as Java, Smalltalk, C+ +, it further include conventional procedural programming language-such as C language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.? Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service It is connected for quotient by internet).

Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.

Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet Include acquiring unit, statistic unit, extracting unit and training unit.Wherein, the title of these units not structure under certain conditions The restriction of the pairs of unit itself, such as acquiring unit are also described as " being configured to obtain the first training sample set Unit ".

As on the other hand, present invention also provides a kind of computer-readable medium, which be can be Included in device described in above-described embodiment；It is also possible to individualism, and without in the supplying device.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the device, so that should Device: the first training sample set is obtained, the training sample in the first training sample set includes sample text；The first instruction of statistics Practice the sample text of different length in sample set ratio shared in the first training sample set；According to the ratio counted Example extracts training sample from the first training sample set and obtains the second training sample set；Using machine learning algorithm, by The sample text that two training sample set include is as input, using markup information corresponding with the sample text of input as expectation Output, training obtain the text-processing model for target text, the sample text in target text and the first training sample set This source is identical.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims

1. a kind of method for generating model, comprising:

The first training sample set is obtained, the training sample in the first training sample set includes sample text；

The sample text for counting different length in the first training sample set is shared in the first training sample set Ratio；

According to the ratio counted, training sample is extracted from the first training sample set and obtains the second training sample set It closes；

Using machine learning algorithm, the sample text for including using the second training sample set is as input, by the sample with input For the corresponding markup information of text as desired output, training obtains the text-processing model for target text, the target text This is identical as the sample text source in the first training sample set.

2. according to the method described in claim 1, wherein, the ratio that the basis counts, from first training sample set Training sample is extracted in conjunction obtains the second training sample set, comprising:

According to the ratio counted, training sample is extracted from the first training sample set；

The training sample in extracted training sample including predetermined keyword is removed, the second training sample set is obtained.

3. described first instructs according to the method described in claim 2, wherein, the text-processing model includes two disaggregated models The training sample practiced in sample set includes positive sample and negative sample.

4. according to the method described in claim 3, wherein, the sample of different length in statistics the first training sample set This text ratio shared in the first training sample set, comprising:

The sample text of different length in the first training sample set ratio shared in positive sample is counted, and counts institute State the sample text of different length in the first training sample set ratio shared in negative sample；And

The ratio that the basis counts extracts training sample from the first training sample set and obtains the second training sample Set, comprising:

According to the sample text of the different length counted ratio shared in positive sample, from the first training sample set Including positive sample in extract positive sample, and the ratio shared in negative sample according to the sample text of the different length counted Example, extracts negative sample from the negative sample that the first training sample set includes, obtains the second training sample set.

5. the method according to claim 3 or 4, wherein include default key in the extracted training sample of the removal The training sample of word obtains the second training sample set, comprising:

The training sample including the keyword in positive sample in extracted negative sample is removed, the second training sample set is obtained.

6. a kind of for generating the device of model, comprising:

Acquiring unit is configured to obtain the first training sample set, the training sample packet in the first training sample set Include sample text；

Statistic unit is configured to count the sample text of different length in the first training sample set in first instruction Practice ratio shared in sample set；

Extracting unit, is configured to according to the ratio counted, and training sample is extracted from the first training sample set and is obtained To the second training sample set；

Training unit is configured to using machine learning algorithm, and the sample text for including using the second training sample set is as defeated Enter, using markup information corresponding with the sample text of input as desired output, training is obtained at the text for target text Model is managed, the target text is identical as the sample text source in the first training sample set.

7. device according to claim 6, wherein the extracting unit, comprising:

Subelement is extracted, is configured to extract training sample from the first training sample set according to the ratio counted；

Removal unit is configured to remove the training sample in extracted training sample including predetermined keyword, obtains second Training sample set.

8. device according to claim 7, wherein the text-processing model includes two disaggregated models, first instruction The training sample practiced in sample set includes positive sample and negative sample.

9. device according to claim 8, wherein the statistic unit is further configured to:

The extracting unit, is further configured to:

10. device according to claim 8 or claim 9, wherein the removal unit is further configured to:

11. a kind of electronic equipment, comprising:

One or more processors；

Storage device is stored thereon with one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processors Realize such as method as claimed in any one of claims 1 to 5.

12. a kind of computer-readable medium, is stored thereon with computer program, such as right is realized when which is executed by processor It is required that any method in 1-5.