CN109284763A - A kind of method and server generating participle training data - Google Patents
A kind of method and server generating participle training data Download PDFInfo
- Publication number
- CN109284763A CN109284763A CN201710589616.0A CN201710589616A CN109284763A CN 109284763 A CN109284763 A CN 109284763A CN 201710589616 A CN201710589616 A CN 201710589616A CN 109284763 A CN109284763 A CN 109284763A
- Authority
- CN
- China
- Prior art keywords
- mark
- text
- identifier
- processed
- cutting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/169—Annotation, e.g. comment data or footnotes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
This application provides a kind of methods and server for generating participle training data, wherein this method comprises: determining that there are the fields of cutting ambiguity in text to be processed by carrying out word segmentation processing to text to be processed;Multiple dicing position marks are marked to each word there are in the field of cutting ambiguity;The text to be processed after dicing position identifies will be marked as participle model training data.The scheme of the application solve the problems, such as it is existing for there are the field of cutting ambiguity also using by way of marking completely caused by need manually to be labeled, therefore, human cost can effectively be saved, the effect that participle training data is efficiently produced in the case where guaranteeing training data validity is reached.
Description
Technical field
The application belongs to field of computer technology more particularly to a kind of method and server for generating participle training data.
Background technique
Currently, in order to realize participle, be usually used training obtain participle model mode treat participle text divided
Word.In order to realize the training to participle model, need largely to segment training data.As participle training data, it is necessary to pre-
Good correct cutting is first marked as a result, for example: if a text: space No.1 basketball shoes, as training data, then just needing
The correct slit mode of space No.1 basketball shoes is marked, then this text could be used as effective training data.
For some fairly simple texts, machine can be sometimes labeled automatically, but some texts sometimes
This can have ambiguity, such as: female's birthday gift, when cutting, female be schoolgirl is formed together with raw cutting, or it is raw and
Day present cutting together, forms birthday gift, is difficult to be labeled by machine this when.Therefore, it is deposited for this
In the text of cutting ambiguity, can only just be carried out by the way of manually marking.
When data volume is king-sized, the cost manually marked is very high.For this problem, it not yet proposes at present effective
Solution.
Summary of the invention
The application is designed to provide a kind of method and server for generating participle training data, may be implemented guaranteeing to instruct
In the case where practicing data validity, without manually marking the purpose for producing participle training data.
The application provide it is a kind of generate participle training data method and server be achieved in that
A method of generating participle training data, which comprises
By carrying out word segmentation processing to text to be processed, determine that there are the fields of cutting ambiguity in the text to be processed;
Multiple dicing position marks are marked to each word there are in the field of cutting ambiguity;
The text to be processed after dicing position identifies will be marked as participle model training data.
A kind of server, including processor and for the memory of storage processor executable instruction, the processor
Following steps are realized when executing described instruction:
By carrying out word segmentation processing to text to be processed, determine that there are the fields of cutting ambiguity in the text to be processed;
Multiple dicing position marks are marked to each word there are in the field of cutting ambiguity;
The text to be processed after dicing position identifies will be marked as participle model training data.
A method of generating participle training data, which comprises
Based on one or more of user query dictionary and product dictionary, there are in the field of cutting ambiguity to described
Each word marks multiple dicing position marks;
The text to be processed after dicing position identifies will be marked as participle model training data.
A kind of computer readable storage medium is stored thereon with computer instruction, and it is above-mentioned that described instruction is performed realization
The step of method.
The method and server provided by the present application for generating participle training data, after obtaining text to be processed, if
Have that there are the fields of cutting ambiguity in the text, then the multiple dicing positions of character label in the field is identified, rather than adopted
With the mark marked completely, thus solve it is existing for there are the field of cutting ambiguity also by the way of marking completely and
It is caused to need the problem of being manually labeled, therefore, human cost can be effectively saved, reach and guaranteed that training data has
In the case where effect property, the effect of participle training data is efficiently produced.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The some embodiments recorded in application, for those of ordinary skill in the art, in the premise of not making the creative labor property
Under, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of method flow diagram of the generation method of participle training data provided by the present application;
Fig. 2 is a kind of annotation results signal provided by the present application being labeled based on space information to text to be processed
Figure;
Fig. 3 is provided by the present application to be shown based on part notation methods a kind of annotation results that text to be processed is labeled
It is intended to;
Fig. 4 is the method flow provided by the present application for obtaining participle training data and existing acquisition participle training data
Figure;.
Fig. 5 is a kind of mode structure schematic diagram of embodiment of terminal provided by the present application.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality
The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation
Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common
The application protection all should belong in technical staff's every other embodiment obtained without creative efforts
Range.
When segmenting training data for existing generation, the place of overlapping ambiguity is had, needs manually to be marked
Note, and manually marks and needs labeler especially clear to the standard of word segmentation, and annotation process needs remain with it is consistent.
And it will take a lot of manpower and time to complete data mark for artificial mark, mark cost is relatively high.
In view of existing notation methods are mainly using complete mask method, it may be assumed that
1) start identifier, for identifying the first character of field obtained by cutting, can be identified with Begin, referred to as B;
2) end identifier can be identified, referred to as E for identifying the last character of field obtained by cutting with End;
3) intermediate identifier can be identified, referred to as I for identifying the middle word of field obtained by cutting with Internal;
4) individual character identifier can be identified, referred to as identifying individual character in field obtained by cutting at the word of word with Single
For S.
For example, the correct slit mode in " red one-piece dress summer " are as follows: " red+one-piece dress+summer ", in this cutting side
Under formula, the corresponding mark to word is exactly: " red-B ", " color-E ", " even-B ", " clothing-I ", " skirt-E ", " summer-S ".
Above-mentioned so-called complete mask method, reference is in the participle training data generated, and each word is corresponding unique
Mark mark.That is, needing to guarantee for each word for segmenting training data, one in B, E, I and S is only infused
It is a.In this case, if encountered, there are overlapping ambiguities, such as: female's birthday gift, machine can not have according to participle mode
It imitates and determines slit mode, also can not be just labeled according to above-mentioned complete notation methods.This when, it is necessary in these
Appearance is picked out, and is then labeled by the way of manually marking.
Participle model training is carried out based on the participle training data carried out after marking completely, to obtain the participle for participle
Model.
In this example, it is contemplated that be primarily due to poorly determine slit mode, and each word there are text word string
It needs to provide specific unique mark to need manually to be labeled so as to cause when mark.One is provided in this example
Kind mask method, is no longer required for all words and is all marked completely, there are the words in the word string of cutting ambiguity for some, can adopt
Cannot be used up the mode marked entirely, that is, a word corresponds to two or more marks.
Fig. 1 is a kind of method flow diagram of herein described method one embodiment for generating participle training data.Although
This application provides as the following examples or method operating procedure shown in the drawings or apparatus structure, but it is based on routine or is not necessarily to
Creative labor may include more or less operating procedure or modular unit in the method or device.In logic
Property in the step of there is no necessary causalities or structure, the execution sequence of these steps or the modular structure of device are not limited to
The embodiment of the present application description and execution shown in the drawings sequence or modular structure.The method or modular structure is in practice
Device or end product in application, can according to embodiment or method shown in the drawings or modular structure connection progress sequence
It executes or executes (such as environment or even distributed processing environment of parallel processor or multiple threads) parallel.
It is specific as shown in Figure 1, a kind of method for generation participle training data that a kind of embodiment of the application provides can be with
Include:
S1: determine that there are the fields of cutting ambiguity in text to be processed;
It wherein, is that there are the fields of a variety of possible cutting results, and different cuttings can there are the field of cutting ambiguity
Character, which exists, in energy intersects.
Such as: text to be processed are as follows: " female's birthday gift intention is practical ", it is related to the text to be processed present in dictionary
Vocabulary have: it is schoolgirl, the birthday, birthday gift, present, intention, practical.Maximum forward matching way based on dictionary is to be processed
File " female's birthday gift intention is practical " segments, and can be carried out as follows:
1) positive since the first character of text to be processed (that is, toward second character direction, that is, from female to raw
Direction) longest word string in character library is found, and record.Since " female ", the maximum forward matching way based on dictionary can
To identify " schoolgirl ";
2) identification method identical with above-mentioned female is then used since second word " life ", because of " birthday " and " birthday gift
Object " is all word string present in dictionary, is based on maximum forward matching way, i.e. longest principle, obtains result are as follows: birthday gift.
3) identification method identical with above-mentioned female is then used since third word " day ", it can be found that not corresponding
Word string exists in dictionary, therefore, can skip, also there is no need to be recorded.
4) subsequent word is also identified all in accordance with aforesaid way, can be recognized respectively: present, intention, practical.
The cutting result gone out by above-mentioned text identification to be processed are as follows: schoolgirl, birthday gift, intention, practical, it can be seen that
There is intersection by " schoolgirl " and " birthday gift " in the maximum forward matching way other places based on dictionary in " female's birthday gift ", because
This, machine can not directly determine out how to carry out cutting mark, and corresponding " female's birthday gift " is that there are the words of cutting ambiguity
Section.And " intention " and " practical " is possible there is no a variety of cuttings, is all corresponding unique cutting as a result, there is no friendships
Fork, therefore, " intention " and " practical " are all the fields there is no cutting ambiguity.
In one embodiment, can using the maximum forward matching way based on dictionary to the text to be processed into
Row word segmentation processing, and determine whether that there are the segments of cutting ambiguity.For can be with for carrying out the matched dictionary of maximum forward
It is the dictionary of electric business platform, is also possible to the dictionary of search engine or the dictionary of News Field etc..In order to enable final
The reasonable participle to corresponding field text may be implemented in the participle model trained, and dictionary also may be selected by the word in the field
Library.It can certainly be that the dictionary of multiple fields is selected to be combined, in this regard, can select according to actual needs, the application couple
This is not construed as limiting.
S2: multiple dicing positions are marked to each word there are in the field of cutting ambiguity and are identified;
For determining that there is no the fields of cutting ambiguity can be labeled according to conventional notation methods.Such as:
" intention " and " practical " in upper example is exactly the field there is no cutting ambiguity, then directly carrying out according to normal notation methods
Mark.It is assumed that dicing position mark includes:
1) B, for identifying the first character of word;
2) E, for identifying the last character of word;
3) I, for identifying the word among word;
4) S, for identifying individual character into the word of word.
So, wound is labeled as B, and meaning is labeled as E, is labeled as B in fact, with being labeled as E.For another example walking shoes, flat to be labeled as B,
Bottom is labeled as I, and shoes are labeled as E.
Can be there are the field of cutting ambiguity of determining is labeled according to preset notation methods, because,
For having no idea accurately to be marked using machine mode there are the field of cutting ambiguity, as such, it can be that using part
The mode of mark, that is, for there are each characters in the field of cutting ambiguity, more than one dicing position mark can be identified
Know, so that being trained subsequent using the text to be processed as when training data, can effectively carry out participle model.
In one embodiment, it can be as follows to there are the marks of each word in the field of cutting ambiguity
Multiple dicing position marks:
1) dicing position marked there are the first character of the field of cutting ambiguity is identified as and starts identifier, or, single
Word identifier;
2) end identifier is identified as to the dicing position marked there are the last character of the field of cutting ambiguity, or,
Individual character identifier;
3) to there are the dicing positions of the mark of the word in the field of cutting ambiguity in addition to first character and the last character
It is identified as and starts identifier, end identifier, intermediate identifier, or, individual character identifier.
Such as: female's birthday gift, because there is the ambiguity for intersecting cutting in " schoolgirl " and " female's birthday gift ", it is to deposit
In the field of cutting ambiguity.In this regard, available according to above-mentioned notation methods: " female " is labeled as B/S, and " life " is labeled as B/E/I/
S, " day " are labeled as B/E/I/S, and " gift " is labeled as B/E/I/S, and " object " is labeled as E/S.
Above-mentioned is only a kind of to there are the schematic description that the field of cutting ambiguity is labeled, actually realize when
It waits, can also there is other notation methods.For example, it may be with there are each words in the field of cutting ambiguity may cut all
Dicing position in offshoot program is labeled.Such as: " female's birthday gift " possible cutting scheme are as follows: female/birthday gift, female
Life/day/present.So correspondingly, " female " is labeled as B/S, " life " is labeled as B/E, and " day " is labeled as I/S, and " gift " is labeled as B/
I, " object " are labeled as E.
It should be noted, however, that above-mentioned cited notation methods are only a kind of schematic descriptions, what is actually realized
When can according to need selection suitably to there are the modes of the field of cutting ambiguity being labeled.The application does not make this
It limits.
S3: the text to be processed after dicing position identifies will be marked as participle model training data.
The mark to text to be processed can be completed through the above way, the text to be processed after can be obtained by mark in this way
This, the training data for the participle model being also equivalent to.For example, can be using the text to be processed as the instruction of CRF model
Practice data to be trained CRF model, to obtain participle model.
In one embodiment, in order to obtain text to be processed, it can be and obtain user's search log, therefrom extract
Multiple searching requests, i.e. Query.Then using these searching requests as text to be processed, so as to obtain multiple trained numbers
According to.
In view of user is sometimes when inputting searching request, space etc. can be inputted and show that the information separated is worked as to characterize
Front position needs cutting, for example, user inputs " summer women's dress ", generally shows user when input, in the text
" summer " and " women's dress " be it is separated, this needs to carry out cutting.For separating this character before identifying, it is believed that be word
The last character or an individual character, for separate mark after this character, it is believed that be the first character or one of word
A individual character.If can effectively improve in conjunction with the foundation that this conscious operation behavior of user is either marked as cutting
The accuracy of machine automatic marking.
It therefore, can there is no cut in determining text to be processed in order to effectively improve the accuracy of the automatic standard of machine
The field of disagreement justice and there are before the field of cutting ambiguity, determine in the text to be processed with the presence or absence of user input point
Every mark;Two character label cuttings in the case where determining that the separation inputted there are user identifies, to mark front and back is separated
Station location marker.
In one embodiment, carrying out dicing position mark to two characters for separating mark front and back may include:
1) dicing position of the first character mark separated after identifying is identified as and starts identifier, or, individual character identifies
Symbol;;
It 2) is end identifier to the cutting bit identification of the first character mark separated before identifying, or, individual character identifier.
Separate the character of mark front and back for not being located at, can unify all to mark are as follows: B/E/I/S is carrying out maximum forward
It, can be in this, as a division reference frame when matching divides.Such as: " female's birthday gift intention is practical ", if to
Handle text are as follows: " female's birthday gift intention is practical " is known that female is an individual character or last carrying out the mark based on space
One word.Therefore, when cutting, cutting can be carried out automatically, by its cutting are as follows: " female/birthday gift/intention/practical ",
Which reduces probability existing for cutting ambiguity, so that generated training data is more accurate after mark.
Wherein, above-mentioned separation mark can include but is not limited at least one of: space, middle scribing line, is teased at underscore
Number, branch, as long as be able to achieve word string division symbol all can serve as separate mark.
In view of in electric business search field, dictionary has user's query word library and product dictionary.In order to enable finally train
Participle model can be applied to electric business field, generate segment training data when, can be based on user query dictionary and
One or more of product dictionary is identified to multiple dicing positions are marked there are each word in the field of cutting ambiguity;It will
Text to be processed after marking dicing position mark is as participle model training data.
Wherein, user query dictionary can be the dictionary that the search term inputted in searching plain frame based on user is established, product
Dictionary can be based on the product title of each product, the classification of product, product illustrate introduce etc. contents establish dictionary.
It is illustrated below with reference to method of the concrete scene to above-mentioned generation participle training data, however, being worth note
Meaning, the specific embodiment do not constitute an undue limitation on the present application merely to the application is better described.
Three user search requests as shown in table 1 below are obtained as text to be processed:
Table 1
Type | Data |
query | Female's birthday gift intention is practical |
query | Sandals female's summer is flat |
query | The grey autumn and winter |
Based on text to be processed shown in table 1, can be handled in accordance with the following steps, to generate participle training data:
S1: part mark is carried out using space:
In this example, it is labeled according to following identification means:
1) start identifier, for identifying the first character of field obtained by cutting, can be identified with Begin, referred to as B;
2) end identifier can be identified, referred to as E for identifying the last character of field obtained by cutting with End;
3) intermediate identifier can be identified, referred to as I for identifying the middle word of field obtained by cutting with Internal;
4) individual character identifier can be identified, referred to as identifying individual character in field obtained by cutting at the word of word with Single
For S.
It will can be known as mark completely with the mark of the word of accurate position determination, all possibilities are all marked out to the mode come
Referred to as part mask method.
Mark can be carried out as follows based on space:
1) dicing position of the first character mark separated after identifying is identified as and starts identifier, or, individual character identifies
It accords with (B/S);
It 2) is end identifier to the cutting bit identification of the first character mark separated before identifying, or, individual character identifier
(E/S);
3) B/E/I/S is labeled as other words.
As shown in Fig. 2, it is available such as Fig. 2 institute that three texts to be processed of above-mentioned table 1 are carried out the mark based on space
The annotation results shown.
S2: the segment without cutting ambiguity is marked using existing electric business dictionary:
Using existing electric business dictionary, identify that user searches for known vocabulary all in query.In an embodiment
In, it can be matched by maximum forward, it, can be by base if cutting ambiguity is not present in query for the word identified
Annotation results in space are revised as marking completely, and remaining ambiguity segment is still left part mark.
Selection for dictionary, can be first based on general dictionary, and in conjunction with the dictionary of target domain, (such as electric business is led
The dictionary in domain, main includes classification, brand, the attribute word etc. of commodity).
The method segmented for " female's birthday gift intention is practical " to maximum forward matching is illustrated:
1) have present in dictionary to the practical relevant vocabulary of female's birthday gift intention: { " schoolgirl ", " birthday ", " birthday gift
Object ", " present ", " intention ", " practical " }
2) it finds (toward the direction of the second word) longest segment from text first character is positive and (and to appear in dictionary
In) record, since " female ", it can identify " schoolgirl ";
3) then step identical with female since second word " life ", because " birthday " and " birthday gift " is all in dictionary
It is existing, it is based on longest principle, is selected " birthday gift ".
4) since third word " day ", no segment is appeared in dictionary, then skips and do not record;It, can as standard
Successively to identify and record: " present ", " intention ", " practical ".
It is to be cut into " birthday+present " to be still retained as " birthday gift " for " birthday gift ", is based on maximum forward principle,
Unified standard is to be retained as " birthday gift ", refers to matching in " female's birthday gift " by maximum forward there are cutting ambiguity and know
Not Chu " schoolgirl " and " birthday gift " there is intersection, carry out cutting can not be determined how.
S3: to there are the segments of cutting ambiguity to carry out part mark.
Specifically, being identified as first of word to the dicing position marked there are the first character of the field of cutting ambiguity
Word, or, individual character;The last one of word is identified as to the dicing position marked there are the last character of the field of cutting ambiguity
Word, or, individual character;To there are the dicing positions of the mark of the word in the field of cutting ambiguity in addition to first character and the last character
It is identified as the first character of word, the last character of word, the word among word, or, word of the individual character at word.
By the notation methods of above-mentioned S2 and S3, available annotation results as shown in Figure 3.
The crf model training of the progress part mark of module 3.
The acquisition participle training data being illustrated in figure 4 in the existing method flow and this example for obtaining participle training data
Method flow schematic diagram, as seen from Figure 4, existing acquisition training data is by manually to there are the pieces of cutting ambiguity
The mode that Duan Jinhang is marked completely needs to expend very big manpower.However, being by machine in this example to there are cutting ambiguities
Segment is not exclusively marked, so that no longer needing manually to be labeled, so that it may obtain participle training data.By right
A large amount of user searches for query and is as above operated available enough participle training datas, these are segmented training number
According to the input as participle model, can train to obtain the participle model eventually for participle.
In upper example, part mark is carried out to search query using the space information of the spontaneous input of user, due to being user
The space content inputted to find the product for wishing to find, therefore, the accuracy of this partial information is relatively high.Further
, for there are the multiple dicing positions of the character label in the field of cutting ambiguity to identify, for there is no the fields of cutting
Character is marked completely, and so as to complete automatic marking in the case where guaranteeing training data quality, it is artificial right not need
There are the fields of cutting ambiguity to be labeled, and has saved mark cost.
Embodiment of the method provided by the above embodiments of the present application can be in server, terminal or similar fortune
It calculates and is executed in device.For running on computer terminals, Fig. 5 is a kind of generation participle training data of the embodiment of the present application
Method terminal hardware block diagram.As shown in figure 5, terminal 10 may include one or more (figures
In only show one) (processor 102 can include but is not limited to Micro-processor MCV or programmable logic device to processor 102
The processing unit of FPGA etc.), memory 104 for storing data and the transmission module 106 for communication function.Ability
Domain those of ordinary skill is appreciated that structure shown in fig. 5 is only to illustrate, and does not cause to limit to the structure of above-mentioned electronic device
It is fixed.For example, terminal 10 may also include than shown in Fig. 5 more perhaps less component or have with shown in Fig. 5 not
Same configuration.
Memory 104 can be used for storing the software program and module of application software, such as the short message in the embodiment of the present invention
Corresponding program instruction/the module of the sending method of breath, the software program that processor 102 is stored in memory 104 by operation
And module realizes the transmission of the short message of above-mentioned application program thereby executing various function application and data processing
Method.Memory 104 may include high speed random access memory, may also include nonvolatile memory, such as one or more magnetism
Storage device, flash memory or other non-volatile solid state memories.In some instances, memory 104 can further comprise phase
The memory remotely located for processor 102, these remote memories can pass through network connection to terminal 10.On
The example for stating network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmission module 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include
The wireless network that the communication providers of terminal 10 provide.In an example, transmission module 106 includes that a network is suitable
Orchestration (Network Interface Controller, NIC), can be connected by base station with other network equipments so as to
Internet is communicated.In an example, transmission module 106 can be radio frequency (Radio Frequency, RF) module,
For wirelessly being communicated with internet.
Wherein, above-mentioned terminal can be the terminal device or software that guest operation uses.Specifically, can be with
It is that the terminals such as smart phone, tablet computer, laptop, desktop computer, smartwatch or other wearable devices are set
It is standby.It is of course also possible to be the software that can be run in above-mentioned terminal device.Such as: mobile phone Taobao, Alipay or browser etc.
Application software.
Processor realizes following steps when executing instruction:
S1: by carrying out word segmentation processing to text to be processed, determine that there are the fields of cutting ambiguity in text to be processed;
S2: multiple dicing positions are marked to each word there are in the field of cutting ambiguity and are identified;
S3: the text to be processed after dicing position identifies will be marked as participle model training data.
In one embodiment, above-mentioned dicing position mark can include but is not limited at least one of: start to mark
Know symbol, end identifier, intermediate identifier, individual character identifier.
In one embodiment, processing implement body can be used for as follows that there are the words of cutting ambiguity to described
Each word in section marks multiple dicing position marks:
The dicing position marked there are the first character of the field of cutting ambiguity is identified as and starts identifier, or, individual character
Identifier;
End identifier is identified as to the dicing position marked there are the last character of the field of cutting ambiguity, or, single
Word identifier;
To there are the dicing position marks of the mark of the word in the field of cutting ambiguity in addition to first character and the last character
Know to start identifier, end identifier, intermediate identifier, or, individual character identifier.
In one embodiment, processing implement body can be used for by the maximum forward matching way based on dictionary, really
Make in the text to be processed that there are the fields of cutting ambiguity.
In one embodiment, processor can be also used for before carrying out word segmentation processing to text to be processed, determine
With the presence or absence of the separation mark of user's input in the text to be processed;Determining the case where separation inputted there are user identifies
Under, two character label dicing positions for separating mark front and back are identified.
In one embodiment, above-mentioned separation mark can include but is not limited at least one of: space, lower stroke
Line, middle scribing line, comma, branch.
In one embodiment, processing implement body can be used for two characters separated before and after identifying according to such as lower section
Formula marks dicing position mark:
The dicing position of the first character mark separated after identifying is identified as and starts identifier, or, individual character identifier;
Cutting bit identification to the first character mark separated before identifying is end identifier, or, individual character identifier.
The method and server provided by the present application for generating participle training data, after obtaining text to be processed, if
Have that there are the fields of cutting ambiguity in the text, then the multiple dicing positions of character label in the field is identified, rather than adopted
It is labeled with the mode marked completely, to solve existing for there are the fields of cutting ambiguity also using mark completely
Mode caused by need the problem of being manually labeled, therefore, can effectively save human cost, reached guarantee instruct
In the case where practicing data validity, the effect of participle training data is efficiently produced.
Although this application provides the method operating procedure as described in embodiment or flow chart, based on conventional or noninvasive
The labour for the property made may include more or less operating procedure.The step of enumerating in embodiment sequence is only numerous steps
One of execution sequence mode, does not represent and unique executes sequence.It, can when device or client production in practice executes
To execute or parallel execute (such as at parallel processor or multithreading according to embodiment or method shown in the drawings sequence
The environment of reason).
The device or module that above-described embodiment illustrates can specifically realize by computer chip or entity, or by having
The product of certain function is realized.For convenience of description, it is divided into various modules when description apparatus above with function to describe respectively.
The function of each module can be realized in the same or multiple software and or hardware when implementing the application.It is of course also possible to
Realization the module for realizing certain function is combined by multiple submodule or subelement.
Method, apparatus or module described herein can realize that controller is pressed in a manner of computer readable program code
Any mode appropriate is realized, for example, controller can take such as microprocessor or processor and storage can be by (micro-)
The computer-readable medium of computer readable program code (such as software or firmware) that processor executes, logic gate, switch, specially
With integrated circuit (Application Specific Integrated Circuit, ASIC), programmable logic controller (PLC) and embedding
Enter the form of microcontroller, the example of controller includes but is not limited to following microcontroller: ARC 625D, Atmel AT91SAM,
Microchip PIC18F26K20 and Silicone Labs C8051F320, Memory Controller are also implemented as depositing
A part of the control logic of reservoir.It is also known in the art that in addition to real in a manner of pure computer readable program code
Other than existing controller, completely can by by method and step carry out programming in logic come so that controller with logic gate, switch, dedicated
The form of integrated circuit, programmable logic controller (PLC) and insertion microcontroller etc. realizes identical function.Therefore this controller
It is considered a kind of hardware component, and hardware can also be considered as to the device for realizing various functions that its inside includes
Structure in component.Or even, it can will be considered as the software either implementation method for realizing the device of various functions
Module can be the structure in hardware component again.
Part of module in herein described device can be in the general of computer executable instructions
Upper and lower described in the text, such as program module.Generally, program module includes executing particular task or realization specific abstract data class
The routine of type, programs, objects, component, data structure, class etc..The application can also be practiced in a distributed computing environment,
In these distributed computing environment, by executing task by the connected remote processing devices of communication network.In distribution
It calculates in environment, program module can be located in the local and remote computer storage media including storage equipment.
As seen through the above description of the embodiments, those skilled in the art can be understood that the application can
It is realized by the mode of software plus required hardware.Based on this understanding, the technical solution of the application is substantially in other words
The part that contributes to existing technology can be embodied in the form of software products, and can also pass through the implementation of Data Migration
It embodies in the process.The computer software product can store in storage medium, such as ROM/RAM, magnetic disk, CD, packet
Some instructions are included to use so that a computer equipment (can be personal computer, mobile terminal, server or network are set
It is standby etc.) execute method described in certain parts of each embodiment of the application or embodiment.
Each embodiment in this specification is described in a progressive manner, the same or similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.The whole of the application or
Person part can be used in numerous general or special purpose computing system environments or configuration.Such as: personal computer, server calculate
Machine, handheld device or portable device, mobile communication terminal, multicomputer system, based on microprocessor are at laptop device
System, programmable electronic equipment, network PC, minicomputer, mainframe computer, the distribution including any of the above system or equipment
Formula calculates environment etc..
Although depicting the application by embodiment, it will be appreciated by the skilled addressee that the application there are many deformation and
Variation is without departing from spirit herein, it is desirable to which the attached claims include these deformations and change without departing from the application's
Spirit.
Claims (19)
1. a kind of method for generating participle training data, which is characterized in that the described method includes:
By carrying out word segmentation processing to text to be processed, determine that there are the fields of cutting ambiguity in the text to be processed;
Multiple dicing position marks are marked to each word there are in the field of cutting ambiguity;
The text to be processed after dicing position identifies will be marked as participle model training data.
2. the method according to claim 1, wherein dicing position mark includes at least one of: opening
Beginning identifier, end identifier, intermediate identifier, individual character identifier.
3. according to the method described in claim 2, it is characterized in that, to each word mark there are in the field of cutting ambiguity
Infuse multiple dicing position marks, comprising:
The dicing position marked there are the first character of the field of cutting ambiguity is identified as and starts identifier, or, individual character identifies
Symbol;
End identifier is identified as to the dicing position marked there are the last character of the field of cutting ambiguity, or, individual character mark
Know symbol;
To there are the dicing positions of the mark of the word in the field of cutting ambiguity in addition to first character and the last character to be identified as
Start identifier, end identifier, intermediate identifier, or, individual character identifier.
4. the method according to claim 1, wherein being determined by carrying out word segmentation processing to text to be processed
There are after the field of cutting ambiguity in the text to be processed, the method also includes;
To there is no each words in the field of cutting ambiguity to mark corresponding dicing position mark.
5. the method according to claim 1, wherein determining institute by carrying out word segmentation processing to text to be processed
State in text to be processed that there are the fields of cutting ambiguity, comprising:
By the maximum forward matching way based on dictionary, word segmentation processing is carried out to the text to be processed, with determine it is described to
There are the fields of cutting ambiguity in processing text.
6. the method according to claim 1, wherein to text to be processed carry out word segmentation processing before, it is described
Method further include:
Determine the separation mark in the text to be processed with the presence or absence of user's input;
In the case where determining that the separation inputted there are user identifies, to two character label dicing positions for separating mark front and back
Mark.
7. according to the method described in claim 6, it is characterized in that, the separation mark include at least one of: space, under
Scribing line, middle scribing line, comma, branch.
8. the method according to the description of claim 7 is characterized in that carrying out dicing position to two characters for separating mark front and back
Mark includes:
The dicing position of the first character mark separated after identifying is identified as and starts identifier, or, individual character identifier;
Cutting bit identification to the first character mark separated before identifying is end identifier, or, individual character identifier.
9. the method according to claim 1, wherein the search that the text to be processed includes: electric business platform is asked
It asks.
10. a kind of method for generating participle training data, which is characterized in that the described method includes:
Based on one or more of user query dictionary and product dictionary, to there are each word marks in the field of cutting ambiguity
Infuse multiple dicing position marks;
The text to be processed after dicing position identifies will be marked as participle model training data.
11. a kind of server, including processor and for the memory of storage processor executable instruction, the processor is held
Following steps are realized when row described instruction:
By carrying out word segmentation processing to text to be processed, determine that there are the fields of cutting ambiguity in the text to be processed;
Multiple dicing position marks are marked to each word there are in the field of cutting ambiguity;
The text to be processed after dicing position identifies will be marked as participle model training data.
12. server according to claim 11, which is characterized in that dicing position mark include it is following at least it
One: starting identifier, end identifier, intermediate identifier, individual character identifier.
13. server according to claim 12, which is characterized in that it is right as follows that the processor is specifically used for
Each word there are in the field of cutting ambiguity marks multiple dicing position marks:
The dicing position marked there are the first character of the field of cutting ambiguity is identified as and starts identifier, or, individual character identifies
Symbol;
End identifier is identified as to the dicing position marked there are the last character of the field of cutting ambiguity, or, individual character mark
Know symbol;
To there are the dicing positions of the mark of the word in the field of cutting ambiguity in addition to first character and the last character to be identified as
Start identifier, end identifier, intermediate identifier, or, individual character identifier.
14. server according to claim 11, which is characterized in that the processor is also used to by text to be processed
This progress word segmentation processing determines in the text to be processed there are after the field of cutting ambiguity, to there is no cutting ambiguities
Each word in field marks corresponding dicing position mark.
15. server according to claim 11, which is characterized in that the processor is specifically used for by based on dictionary
Maximum forward matching way determines that there are the fields of cutting ambiguity in the text to be processed.
16. server according to claim 11, which is characterized in that the processor be also used to text to be processed into
Before row word segmentation processing, the separation mark in the text to be processed with the presence or absence of user's input is determined;Determining that there are users
In the case where the separation mark of input, two character label dicing positions for separating mark front and back are identified.
17. server according to claim 16, which is characterized in that the separation mark includes at least one of: empty
Lattice, underscore, middle scribing line, comma, branch.
18. server according to claim 16, which is characterized in that the processor is specifically used for separation mark front and back
Two characters mark as follows dicing position mark:
The dicing position of the first character mark separated after identifying is identified as and starts identifier, or, individual character identifier;
Cutting bit identification to the first character mark separated before identifying is end identifier, or, individual character identifier.
19. a kind of computer readable storage medium is stored thereon with computer instruction, described instruction, which is performed, realizes that right is wanted
The step of seeking any one of 1 to 9 the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710589616.0A CN109284763A (en) | 2017-07-19 | 2017-07-19 | A kind of method and server generating participle training data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710589616.0A CN109284763A (en) | 2017-07-19 | 2017-07-19 | A kind of method and server generating participle training data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109284763A true CN109284763A (en) | 2019-01-29 |
Family
ID=65184825
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710589616.0A Pending CN109284763A (en) | 2017-07-19 | 2017-07-19 | A kind of method and server generating participle training data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109284763A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110457436A (en) * | 2019-07-30 | 2019-11-15 | 腾讯科技(深圳)有限公司 | Information labeling method, apparatus, computer readable storage medium and electronic equipment |
CN110728137A (en) * | 2019-10-10 | 2020-01-24 | 京东数字科技控股有限公司 | Method and device for word segmentation |
CN111563399A (en) * | 2019-02-14 | 2020-08-21 | 阿里巴巴集团控股有限公司 | Method and device for acquiring structured information of electronic medical record |
CN111797626A (en) * | 2019-03-21 | 2020-10-20 | 阿里巴巴集团控股有限公司 | Named entity identification method and device |
CN114548103A (en) * | 2020-11-25 | 2022-05-27 | 马上消费金融股份有限公司 | Training method of named entity recognition model and recognition method of named entity |
Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101114282A (en) * | 2007-07-12 | 2008-01-30 | 华为技术有限公司 | Participle processing method and equipment |
CN101261623A (en) * | 2007-03-07 | 2008-09-10 | 国际商业机器公司 | Word splitting method and device for word border-free mark language based on search |
CN101950284A (en) * | 2010-09-27 | 2011-01-19 | 北京新媒传信科技有限公司 | Chinese word segmentation method and system |
CN102402502A (en) * | 2011-11-24 | 2012-04-04 | 北京趣拿信息技术有限公司 | Word segmentation processing method and device for search engine |
CN103020034A (en) * | 2011-09-26 | 2013-04-03 | 北京大学 | Chinese words segmentation method and device |
CN103077164A (en) * | 2012-12-27 | 2013-05-01 | 新浪网技术(中国)有限公司 | Text analysis method and text analyzer |
CN103324612A (en) * | 2012-03-22 | 2013-09-25 | 北京百度网讯科技有限公司 | Method and device for segmenting word |
CN103778161A (en) * | 2012-10-26 | 2014-05-07 | 同程网络科技股份有限公司 | Word segmentation ambiguity elimination method applicable to Chinese word bank |
CN103902521A (en) * | 2012-12-24 | 2014-07-02 | 高德软件有限公司 | Chinese statement identification method and device |
CN104933023A (en) * | 2015-05-12 | 2015-09-23 | 深圳市华傲数据技术有限公司 | Chinese address word segmentation and annotation method |
CN105138514A (en) * | 2015-08-24 | 2015-12-09 | 昆明理工大学 | Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction |
US20160027433A1 (en) * | 2014-07-24 | 2016-01-28 | Intrnational Business Machines Corporation | Method of selecting training text for language model, and method of training language model using the training text, and computer and computer program for executing the methods |
CN105718586A (en) * | 2016-01-26 | 2016-06-29 | 中国人民解放军国防科学技术大学 | Word division method and device |
CN106202039A (en) * | 2016-06-30 | 2016-12-07 | 昆明理工大学 | Vietnamese portmanteau word disambiguation method based on condition random field |
CN106407186A (en) * | 2016-10-09 | 2017-02-15 | 新译信息科技(深圳)有限公司 | Word segmentation model building method and apparatus |
CN106708807A (en) * | 2017-02-10 | 2017-05-24 | 深圳市空谷幽兰人工智能科技有限公司 | Non-supervision word segmentation mode training method and device |
CN106778887A (en) * | 2016-12-27 | 2017-05-31 | 努比亚技术有限公司 | The terminal and method of sentence flag sequence are determined based on condition random field |
-
2017
- 2017-07-19 CN CN201710589616.0A patent/CN109284763A/en active Pending
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101261623A (en) * | 2007-03-07 | 2008-09-10 | 国际商业机器公司 | Word splitting method and device for word border-free mark language based on search |
CN101114282A (en) * | 2007-07-12 | 2008-01-30 | 华为技术有限公司 | Participle processing method and equipment |
CN101950284A (en) * | 2010-09-27 | 2011-01-19 | 北京新媒传信科技有限公司 | Chinese word segmentation method and system |
CN103020034A (en) * | 2011-09-26 | 2013-04-03 | 北京大学 | Chinese words segmentation method and device |
CN102402502A (en) * | 2011-11-24 | 2012-04-04 | 北京趣拿信息技术有限公司 | Word segmentation processing method and device for search engine |
CN103324612A (en) * | 2012-03-22 | 2013-09-25 | 北京百度网讯科技有限公司 | Method and device for segmenting word |
CN103778161A (en) * | 2012-10-26 | 2014-05-07 | 同程网络科技股份有限公司 | Word segmentation ambiguity elimination method applicable to Chinese word bank |
CN103902521A (en) * | 2012-12-24 | 2014-07-02 | 高德软件有限公司 | Chinese statement identification method and device |
CN103077164A (en) * | 2012-12-27 | 2013-05-01 | 新浪网技术(中国)有限公司 | Text analysis method and text analyzer |
US20160027433A1 (en) * | 2014-07-24 | 2016-01-28 | Intrnational Business Machines Corporation | Method of selecting training text for language model, and method of training language model using the training text, and computer and computer program for executing the methods |
CN104933023A (en) * | 2015-05-12 | 2015-09-23 | 深圳市华傲数据技术有限公司 | Chinese address word segmentation and annotation method |
CN105138514A (en) * | 2015-08-24 | 2015-12-09 | 昆明理工大学 | Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction |
CN105718586A (en) * | 2016-01-26 | 2016-06-29 | 中国人民解放军国防科学技术大学 | Word division method and device |
CN106202039A (en) * | 2016-06-30 | 2016-12-07 | 昆明理工大学 | Vietnamese portmanteau word disambiguation method based on condition random field |
CN106407186A (en) * | 2016-10-09 | 2017-02-15 | 新译信息科技(深圳)有限公司 | Word segmentation model building method and apparatus |
CN106778887A (en) * | 2016-12-27 | 2017-05-31 | 努比亚技术有限公司 | The terminal and method of sentence flag sequence are determined based on condition random field |
CN106708807A (en) * | 2017-02-10 | 2017-05-24 | 深圳市空谷幽兰人工智能科技有限公司 | Non-supervision word segmentation mode training method and device |
Non-Patent Citations (3)
Title |
---|
周祺: "基于统计与词典相结合的中文分词的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王靖等: "一种优化的用于中文分词的CRF机器学习模型", 《软件时空》 * |
许高建等: "一种改进的中文分词歧义消除算法研究", 《合肥工业大学学报(自然科学版)》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111563399A (en) * | 2019-02-14 | 2020-08-21 | 阿里巴巴集团控股有限公司 | Method and device for acquiring structured information of electronic medical record |
CN111563399B (en) * | 2019-02-14 | 2023-04-28 | 阿里巴巴集团控股有限公司 | Method and device for obtaining structured information of electronic medical record |
CN111797626A (en) * | 2019-03-21 | 2020-10-20 | 阿里巴巴集团控股有限公司 | Named entity identification method and device |
CN110457436A (en) * | 2019-07-30 | 2019-11-15 | 腾讯科技(深圳)有限公司 | Information labeling method, apparatus, computer readable storage medium and electronic equipment |
CN110457436B (en) * | 2019-07-30 | 2022-12-27 | 腾讯科技(深圳)有限公司 | Information labeling method and device, computer readable storage medium and electronic equipment |
CN110728137A (en) * | 2019-10-10 | 2020-01-24 | 京东数字科技控股有限公司 | Method and device for word segmentation |
CN114548103A (en) * | 2020-11-25 | 2022-05-27 | 马上消费金融股份有限公司 | Training method of named entity recognition model and recognition method of named entity |
CN114548103B (en) * | 2020-11-25 | 2024-03-29 | 马上消费金融股份有限公司 | Named entity recognition model training method and named entity recognition method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109284763A (en) | A kind of method and server generating participle training data | |
CN101996195B (en) | Searching method and device of voice information in audio files and equipment | |
CN103123618B (en) | Text similarity acquisition methods and device | |
CN105528372B (en) | A kind of address search method and equipment | |
CN110515896B (en) | Model resource management method, model file manufacturing method, device and system | |
CN110442710A (en) | A kind of short text semantic understanding of knowledge based map and accurate matching process and device | |
CN102122280B (en) | Method and system for intelligently extracting content object | |
CN104469832B (en) | Mobile communications network accident analysis locating assist system | |
CN109408821B (en) | Corpus generation method and device, computing equipment and storage medium | |
CN107832440B (en) | Data mining method, device, server and computer readable storage medium | |
CN103631874B (en) | UGC label classification determining method and device for social platform | |
CN109409248A (en) | Semanteme marking method, apparatus and system based on deep semantic network | |
CN104504135A (en) | Promotion account structure generation method and device | |
CN104077385A (en) | Classification and retrieval method of files | |
CN105069063A (en) | Picture searching method and apparatus | |
CN106815193A (en) | Model training method and device and wrong word recognition methods and device | |
CN103473285A (en) | Web information extraction method and device based on location markers | |
CN110457704B (en) | Target field determination method and device, storage medium and electronic device | |
CN112380356A (en) | Method, device, electronic equipment and medium for constructing catering knowledge graph | |
CN106933919A (en) | The connection method of tables of data and device | |
CN103514284B (en) | Data display system and data display method | |
CN103559177A (en) | Geographical name identification method and geographical name identification device | |
CN103853771B (en) | A kind of method for pushing and system of search result | |
CN105447064B (en) | Electronic map data making and using method and device | |
CN104715040A (en) | Data classification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190129 |
|
RJ01 | Rejection of invention patent application after publication |