CN109284763A - A kind of method and server generating participle training data - Google Patents

A kind of method and server generating participle training data Download PDF

Info

Publication number
CN109284763A
CN109284763A CN201710589616.0A CN201710589616A CN109284763A CN 109284763 A CN109284763 A CN 109284763A CN 201710589616 A CN201710589616 A CN 201710589616A CN 109284763 A CN109284763 A CN 109284763A
Authority
CN
China
Prior art keywords
mark
text
identifier
processed
cutting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710589616.0A
Other languages
Chinese (zh)
Inventor
徐光伟
李林琳
谢朋峻
马春平
郎君
司罗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710589616.0A priority Critical patent/CN109284763A/en
Publication of CN109284763A publication Critical patent/CN109284763A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

This application provides a kind of methods and server for generating participle training data, wherein this method comprises: determining that there are the fields of cutting ambiguity in text to be processed by carrying out word segmentation processing to text to be processed;Multiple dicing position marks are marked to each word there are in the field of cutting ambiguity;The text to be processed after dicing position identifies will be marked as participle model training data.The scheme of the application solve the problems, such as it is existing for there are the field of cutting ambiguity also using by way of marking completely caused by need manually to be labeled, therefore, human cost can effectively be saved, the effect that participle training data is efficiently produced in the case where guaranteeing training data validity is reached.

Description

A kind of method and server generating participle training data
Technical field
The application belongs to field of computer technology more particularly to a kind of method and server for generating participle training data.
Background technique
Currently, in order to realize participle, be usually used training obtain participle model mode treat participle text divided Word.In order to realize the training to participle model, need largely to segment training data.As participle training data, it is necessary to pre- Good correct cutting is first marked as a result, for example: if a text: space No.1 basketball shoes, as training data, then just needing The correct slit mode of space No.1 basketball shoes is marked, then this text could be used as effective training data.
For some fairly simple texts, machine can be sometimes labeled automatically, but some texts sometimes This can have ambiguity, such as: female's birthday gift, when cutting, female be schoolgirl is formed together with raw cutting, or it is raw and Day present cutting together, forms birthday gift, is difficult to be labeled by machine this when.Therefore, it is deposited for this In the text of cutting ambiguity, can only just be carried out by the way of manually marking.
When data volume is king-sized, the cost manually marked is very high.For this problem, it not yet proposes at present effective Solution.
Summary of the invention
The application is designed to provide a kind of method and server for generating participle training data, may be implemented guaranteeing to instruct In the case where practicing data validity, without manually marking the purpose for producing participle training data.
The application provide it is a kind of generate participle training data method and server be achieved in that
A method of generating participle training data, which comprises
By carrying out word segmentation processing to text to be processed, determine that there are the fields of cutting ambiguity in the text to be processed;
Multiple dicing position marks are marked to each word there are in the field of cutting ambiguity;
The text to be processed after dicing position identifies will be marked as participle model training data.
A kind of server, including processor and for the memory of storage processor executable instruction, the processor Following steps are realized when executing described instruction:
By carrying out word segmentation processing to text to be processed, determine that there are the fields of cutting ambiguity in the text to be processed;
Multiple dicing position marks are marked to each word there are in the field of cutting ambiguity;
The text to be processed after dicing position identifies will be marked as participle model training data.
A method of generating participle training data, which comprises
Based on one or more of user query dictionary and product dictionary, there are in the field of cutting ambiguity to described Each word marks multiple dicing position marks;
The text to be processed after dicing position identifies will be marked as participle model training data.
A kind of computer readable storage medium is stored thereon with computer instruction, and it is above-mentioned that described instruction is performed realization The step of method.
The method and server provided by the present application for generating participle training data, after obtaining text to be processed, if Have that there are the fields of cutting ambiguity in the text, then the multiple dicing positions of character label in the field is identified, rather than adopted With the mark marked completely, thus solve it is existing for there are the field of cutting ambiguity also by the way of marking completely and It is caused to need the problem of being manually labeled, therefore, human cost can be effectively saved, reach and guaranteed that training data has In the case where effect property, the effect of participle training data is efficiently produced.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The some embodiments recorded in application, for those of ordinary skill in the art, in the premise of not making the creative labor property Under, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of method flow diagram of the generation method of participle training data provided by the present application;
Fig. 2 is a kind of annotation results signal provided by the present application being labeled based on space information to text to be processed Figure;
Fig. 3 is provided by the present application to be shown based on part notation methods a kind of annotation results that text to be processed is labeled It is intended to;
Fig. 4 is the method flow provided by the present application for obtaining participle training data and existing acquisition participle training data Figure;.
Fig. 5 is a kind of mode structure schematic diagram of embodiment of terminal provided by the present application.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common The application protection all should belong in technical staff's every other embodiment obtained without creative efforts Range.
When segmenting training data for existing generation, the place of overlapping ambiguity is had, needs manually to be marked Note, and manually marks and needs labeler especially clear to the standard of word segmentation, and annotation process needs remain with it is consistent. And it will take a lot of manpower and time to complete data mark for artificial mark, mark cost is relatively high.
In view of existing notation methods are mainly using complete mask method, it may be assumed that
1) start identifier, for identifying the first character of field obtained by cutting, can be identified with Begin, referred to as B;
2) end identifier can be identified, referred to as E for identifying the last character of field obtained by cutting with End;
3) intermediate identifier can be identified, referred to as I for identifying the middle word of field obtained by cutting with Internal;
4) individual character identifier can be identified, referred to as identifying individual character in field obtained by cutting at the word of word with Single For S.
For example, the correct slit mode in " red one-piece dress summer " are as follows: " red+one-piece dress+summer ", in this cutting side Under formula, the corresponding mark to word is exactly: " red-B ", " color-E ", " even-B ", " clothing-I ", " skirt-E ", " summer-S ".
Above-mentioned so-called complete mask method, reference is in the participle training data generated, and each word is corresponding unique Mark mark.That is, needing to guarantee for each word for segmenting training data, one in B, E, I and S is only infused It is a.In this case, if encountered, there are overlapping ambiguities, such as: female's birthday gift, machine can not have according to participle mode It imitates and determines slit mode, also can not be just labeled according to above-mentioned complete notation methods.This when, it is necessary in these Appearance is picked out, and is then labeled by the way of manually marking.
Participle model training is carried out based on the participle training data carried out after marking completely, to obtain the participle for participle Model.
In this example, it is contemplated that be primarily due to poorly determine slit mode, and each word there are text word string It needs to provide specific unique mark to need manually to be labeled so as to cause when mark.One is provided in this example Kind mask method, is no longer required for all words and is all marked completely, there are the words in the word string of cutting ambiguity for some, can adopt Cannot be used up the mode marked entirely, that is, a word corresponds to two or more marks.
Fig. 1 is a kind of method flow diagram of herein described method one embodiment for generating participle training data.Although This application provides as the following examples or method operating procedure shown in the drawings or apparatus structure, but it is based on routine or is not necessarily to Creative labor may include more or less operating procedure or modular unit in the method or device.In logic Property in the step of there is no necessary causalities or structure, the execution sequence of these steps or the modular structure of device are not limited to The embodiment of the present application description and execution shown in the drawings sequence or modular structure.The method or modular structure is in practice Device or end product in application, can according to embodiment or method shown in the drawings or modular structure connection progress sequence It executes or executes (such as environment or even distributed processing environment of parallel processor or multiple threads) parallel.
It is specific as shown in Figure 1, a kind of method for generation participle training data that a kind of embodiment of the application provides can be with Include:
S1: determine that there are the fields of cutting ambiguity in text to be processed;
It wherein, is that there are the fields of a variety of possible cutting results, and different cuttings can there are the field of cutting ambiguity Character, which exists, in energy intersects.
Such as: text to be processed are as follows: " female's birthday gift intention is practical ", it is related to the text to be processed present in dictionary Vocabulary have: it is schoolgirl, the birthday, birthday gift, present, intention, practical.Maximum forward matching way based on dictionary is to be processed File " female's birthday gift intention is practical " segments, and can be carried out as follows:
1) positive since the first character of text to be processed (that is, toward second character direction, that is, from female to raw Direction) longest word string in character library is found, and record.Since " female ", the maximum forward matching way based on dictionary can To identify " schoolgirl ";
2) identification method identical with above-mentioned female is then used since second word " life ", because of " birthday " and " birthday gift Object " is all word string present in dictionary, is based on maximum forward matching way, i.e. longest principle, obtains result are as follows: birthday gift.
3) identification method identical with above-mentioned female is then used since third word " day ", it can be found that not corresponding Word string exists in dictionary, therefore, can skip, also there is no need to be recorded.
4) subsequent word is also identified all in accordance with aforesaid way, can be recognized respectively: present, intention, practical.
The cutting result gone out by above-mentioned text identification to be processed are as follows: schoolgirl, birthday gift, intention, practical, it can be seen that There is intersection by " schoolgirl " and " birthday gift " in the maximum forward matching way other places based on dictionary in " female's birthday gift ", because This, machine can not directly determine out how to carry out cutting mark, and corresponding " female's birthday gift " is that there are the words of cutting ambiguity Section.And " intention " and " practical " is possible there is no a variety of cuttings, is all corresponding unique cutting as a result, there is no friendships Fork, therefore, " intention " and " practical " are all the fields there is no cutting ambiguity.
In one embodiment, can using the maximum forward matching way based on dictionary to the text to be processed into Row word segmentation processing, and determine whether that there are the segments of cutting ambiguity.For can be with for carrying out the matched dictionary of maximum forward It is the dictionary of electric business platform, is also possible to the dictionary of search engine or the dictionary of News Field etc..In order to enable final The reasonable participle to corresponding field text may be implemented in the participle model trained, and dictionary also may be selected by the word in the field Library.It can certainly be that the dictionary of multiple fields is selected to be combined, in this regard, can select according to actual needs, the application couple This is not construed as limiting.
S2: multiple dicing positions are marked to each word there are in the field of cutting ambiguity and are identified;
For determining that there is no the fields of cutting ambiguity can be labeled according to conventional notation methods.Such as: " intention " and " practical " in upper example is exactly the field there is no cutting ambiguity, then directly carrying out according to normal notation methods Mark.It is assumed that dicing position mark includes:
1) B, for identifying the first character of word;
2) E, for identifying the last character of word;
3) I, for identifying the word among word;
4) S, for identifying individual character into the word of word.
So, wound is labeled as B, and meaning is labeled as E, is labeled as B in fact, with being labeled as E.For another example walking shoes, flat to be labeled as B, Bottom is labeled as I, and shoes are labeled as E.
Can be there are the field of cutting ambiguity of determining is labeled according to preset notation methods, because, For having no idea accurately to be marked using machine mode there are the field of cutting ambiguity, as such, it can be that using part The mode of mark, that is, for there are each characters in the field of cutting ambiguity, more than one dicing position mark can be identified Know, so that being trained subsequent using the text to be processed as when training data, can effectively carry out participle model.
In one embodiment, it can be as follows to there are the marks of each word in the field of cutting ambiguity Multiple dicing position marks:
1) dicing position marked there are the first character of the field of cutting ambiguity is identified as and starts identifier, or, single Word identifier;
2) end identifier is identified as to the dicing position marked there are the last character of the field of cutting ambiguity, or, Individual character identifier;
3) to there are the dicing positions of the mark of the word in the field of cutting ambiguity in addition to first character and the last character It is identified as and starts identifier, end identifier, intermediate identifier, or, individual character identifier.
Such as: female's birthday gift, because there is the ambiguity for intersecting cutting in " schoolgirl " and " female's birthday gift ", it is to deposit In the field of cutting ambiguity.In this regard, available according to above-mentioned notation methods: " female " is labeled as B/S, and " life " is labeled as B/E/I/ S, " day " are labeled as B/E/I/S, and " gift " is labeled as B/E/I/S, and " object " is labeled as E/S.
Above-mentioned is only a kind of to there are the schematic description that the field of cutting ambiguity is labeled, actually realize when It waits, can also there is other notation methods.For example, it may be with there are each words in the field of cutting ambiguity may cut all Dicing position in offshoot program is labeled.Such as: " female's birthday gift " possible cutting scheme are as follows: female/birthday gift, female Life/day/present.So correspondingly, " female " is labeled as B/S, " life " is labeled as B/E, and " day " is labeled as I/S, and " gift " is labeled as B/ I, " object " are labeled as E.
It should be noted, however, that above-mentioned cited notation methods are only a kind of schematic descriptions, what is actually realized When can according to need selection suitably to there are the modes of the field of cutting ambiguity being labeled.The application does not make this It limits.
S3: the text to be processed after dicing position identifies will be marked as participle model training data.
The mark to text to be processed can be completed through the above way, the text to be processed after can be obtained by mark in this way This, the training data for the participle model being also equivalent to.For example, can be using the text to be processed as the instruction of CRF model Practice data to be trained CRF model, to obtain participle model.
In one embodiment, in order to obtain text to be processed, it can be and obtain user's search log, therefrom extract Multiple searching requests, i.e. Query.Then using these searching requests as text to be processed, so as to obtain multiple trained numbers According to.
In view of user is sometimes when inputting searching request, space etc. can be inputted and show that the information separated is worked as to characterize Front position needs cutting, for example, user inputs " summer women's dress ", generally shows user when input, in the text " summer " and " women's dress " be it is separated, this needs to carry out cutting.For separating this character before identifying, it is believed that be word The last character or an individual character, for separate mark after this character, it is believed that be the first character or one of word A individual character.If can effectively improve in conjunction with the foundation that this conscious operation behavior of user is either marked as cutting The accuracy of machine automatic marking.
It therefore, can there is no cut in determining text to be processed in order to effectively improve the accuracy of the automatic standard of machine The field of disagreement justice and there are before the field of cutting ambiguity, determine in the text to be processed with the presence or absence of user input point Every mark;Two character label cuttings in the case where determining that the separation inputted there are user identifies, to mark front and back is separated Station location marker.
In one embodiment, carrying out dicing position mark to two characters for separating mark front and back may include:
1) dicing position of the first character mark separated after identifying is identified as and starts identifier, or, individual character identifies Symbol;;
It 2) is end identifier to the cutting bit identification of the first character mark separated before identifying, or, individual character identifier.
Separate the character of mark front and back for not being located at, can unify all to mark are as follows: B/E/I/S is carrying out maximum forward It, can be in this, as a division reference frame when matching divides.Such as: " female's birthday gift intention is practical ", if to Handle text are as follows: " female's birthday gift intention is practical " is known that female is an individual character or last carrying out the mark based on space One word.Therefore, when cutting, cutting can be carried out automatically, by its cutting are as follows: " female/birthday gift/intention/practical ", Which reduces probability existing for cutting ambiguity, so that generated training data is more accurate after mark.
Wherein, above-mentioned separation mark can include but is not limited at least one of: space, middle scribing line, is teased at underscore Number, branch, as long as be able to achieve word string division symbol all can serve as separate mark.
In view of in electric business search field, dictionary has user's query word library and product dictionary.In order to enable finally train Participle model can be applied to electric business field, generate segment training data when, can be based on user query dictionary and One or more of product dictionary is identified to multiple dicing positions are marked there are each word in the field of cutting ambiguity;It will Text to be processed after marking dicing position mark is as participle model training data.
Wherein, user query dictionary can be the dictionary that the search term inputted in searching plain frame based on user is established, product Dictionary can be based on the product title of each product, the classification of product, product illustrate introduce etc. contents establish dictionary.
It is illustrated below with reference to method of the concrete scene to above-mentioned generation participle training data, however, being worth note Meaning, the specific embodiment do not constitute an undue limitation on the present application merely to the application is better described.
Three user search requests as shown in table 1 below are obtained as text to be processed:
Table 1
Type Data
query Female's birthday gift intention is practical
query Sandals female's summer is flat
query The grey autumn and winter
Based on text to be processed shown in table 1, can be handled in accordance with the following steps, to generate participle training data:
S1: part mark is carried out using space:
In this example, it is labeled according to following identification means:
1) start identifier, for identifying the first character of field obtained by cutting, can be identified with Begin, referred to as B;
2) end identifier can be identified, referred to as E for identifying the last character of field obtained by cutting with End;
3) intermediate identifier can be identified, referred to as I for identifying the middle word of field obtained by cutting with Internal;
4) individual character identifier can be identified, referred to as identifying individual character in field obtained by cutting at the word of word with Single For S.
It will can be known as mark completely with the mark of the word of accurate position determination, all possibilities are all marked out to the mode come Referred to as part mask method.
Mark can be carried out as follows based on space:
1) dicing position of the first character mark separated after identifying is identified as and starts identifier, or, individual character identifies It accords with (B/S);
It 2) is end identifier to the cutting bit identification of the first character mark separated before identifying, or, individual character identifier (E/S);
3) B/E/I/S is labeled as other words.
As shown in Fig. 2, it is available such as Fig. 2 institute that three texts to be processed of above-mentioned table 1 are carried out the mark based on space The annotation results shown.
S2: the segment without cutting ambiguity is marked using existing electric business dictionary:
Using existing electric business dictionary, identify that user searches for known vocabulary all in query.In an embodiment In, it can be matched by maximum forward, it, can be by base if cutting ambiguity is not present in query for the word identified Annotation results in space are revised as marking completely, and remaining ambiguity segment is still left part mark.
Selection for dictionary, can be first based on general dictionary, and in conjunction with the dictionary of target domain, (such as electric business is led The dictionary in domain, main includes classification, brand, the attribute word etc. of commodity).
The method segmented for " female's birthday gift intention is practical " to maximum forward matching is illustrated:
1) have present in dictionary to the practical relevant vocabulary of female's birthday gift intention: { " schoolgirl ", " birthday ", " birthday gift Object ", " present ", " intention ", " practical " }
2) it finds (toward the direction of the second word) longest segment from text first character is positive and (and to appear in dictionary In) record, since " female ", it can identify " schoolgirl ";
3) then step identical with female since second word " life ", because " birthday " and " birthday gift " is all in dictionary It is existing, it is based on longest principle, is selected " birthday gift ".
4) since third word " day ", no segment is appeared in dictionary, then skips and do not record;It, can as standard Successively to identify and record: " present ", " intention ", " practical ".
It is to be cut into " birthday+present " to be still retained as " birthday gift " for " birthday gift ", is based on maximum forward principle, Unified standard is to be retained as " birthday gift ", refers to matching in " female's birthday gift " by maximum forward there are cutting ambiguity and know Not Chu " schoolgirl " and " birthday gift " there is intersection, carry out cutting can not be determined how.
S3: to there are the segments of cutting ambiguity to carry out part mark.
Specifically, being identified as first of word to the dicing position marked there are the first character of the field of cutting ambiguity Word, or, individual character;The last one of word is identified as to the dicing position marked there are the last character of the field of cutting ambiguity Word, or, individual character;To there are the dicing positions of the mark of the word in the field of cutting ambiguity in addition to first character and the last character It is identified as the first character of word, the last character of word, the word among word, or, word of the individual character at word.
By the notation methods of above-mentioned S2 and S3, available annotation results as shown in Figure 3.
The crf model training of the progress part mark of module 3.
The acquisition participle training data being illustrated in figure 4 in the existing method flow and this example for obtaining participle training data Method flow schematic diagram, as seen from Figure 4, existing acquisition training data is by manually to there are the pieces of cutting ambiguity The mode that Duan Jinhang is marked completely needs to expend very big manpower.However, being by machine in this example to there are cutting ambiguities Segment is not exclusively marked, so that no longer needing manually to be labeled, so that it may obtain participle training data.By right A large amount of user searches for query and is as above operated available enough participle training datas, these are segmented training number According to the input as participle model, can train to obtain the participle model eventually for participle.
In upper example, part mark is carried out to search query using the space information of the spontaneous input of user, due to being user The space content inputted to find the product for wishing to find, therefore, the accuracy of this partial information is relatively high.Further , for there are the multiple dicing positions of the character label in the field of cutting ambiguity to identify, for there is no the fields of cutting Character is marked completely, and so as to complete automatic marking in the case where guaranteeing training data quality, it is artificial right not need There are the fields of cutting ambiguity to be labeled, and has saved mark cost.
Embodiment of the method provided by the above embodiments of the present application can be in server, terminal or similar fortune It calculates and is executed in device.For running on computer terminals, Fig. 5 is a kind of generation participle training data of the embodiment of the present application Method terminal hardware block diagram.As shown in figure 5, terminal 10 may include one or more (figures In only show one) (processor 102 can include but is not limited to Micro-processor MCV or programmable logic device to processor 102 The processing unit of FPGA etc.), memory 104 for storing data and the transmission module 106 for communication function.Ability Domain those of ordinary skill is appreciated that structure shown in fig. 5 is only to illustrate, and does not cause to limit to the structure of above-mentioned electronic device It is fixed.For example, terminal 10 may also include than shown in Fig. 5 more perhaps less component or have with shown in Fig. 5 not Same configuration.
Memory 104 can be used for storing the software program and module of application software, such as the short message in the embodiment of the present invention Corresponding program instruction/the module of the sending method of breath, the software program that processor 102 is stored in memory 104 by operation And module realizes the transmission of the short message of above-mentioned application program thereby executing various function application and data processing Method.Memory 104 may include high speed random access memory, may also include nonvolatile memory, such as one or more magnetism Storage device, flash memory or other non-volatile solid state memories.In some instances, memory 104 can further comprise phase The memory remotely located for processor 102, these remote memories can pass through network connection to terminal 10.On The example for stating network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmission module 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of terminal 10 provide.In an example, transmission module 106 includes that a network is suitable Orchestration (Network Interface Controller, NIC), can be connected by base station with other network equipments so as to Internet is communicated.In an example, transmission module 106 can be radio frequency (Radio Frequency, RF) module, For wirelessly being communicated with internet.
Wherein, above-mentioned terminal can be the terminal device or software that guest operation uses.Specifically, can be with It is that the terminals such as smart phone, tablet computer, laptop, desktop computer, smartwatch or other wearable devices are set It is standby.It is of course also possible to be the software that can be run in above-mentioned terminal device.Such as: mobile phone Taobao, Alipay or browser etc. Application software.
Processor realizes following steps when executing instruction:
S1: by carrying out word segmentation processing to text to be processed, determine that there are the fields of cutting ambiguity in text to be processed;
S2: multiple dicing positions are marked to each word there are in the field of cutting ambiguity and are identified;
S3: the text to be processed after dicing position identifies will be marked as participle model training data.
In one embodiment, above-mentioned dicing position mark can include but is not limited at least one of: start to mark Know symbol, end identifier, intermediate identifier, individual character identifier.
In one embodiment, processing implement body can be used for as follows that there are the words of cutting ambiguity to described Each word in section marks multiple dicing position marks:
The dicing position marked there are the first character of the field of cutting ambiguity is identified as and starts identifier, or, individual character Identifier;
End identifier is identified as to the dicing position marked there are the last character of the field of cutting ambiguity, or, single Word identifier;
To there are the dicing position marks of the mark of the word in the field of cutting ambiguity in addition to first character and the last character Know to start identifier, end identifier, intermediate identifier, or, individual character identifier.
In one embodiment, processing implement body can be used for by the maximum forward matching way based on dictionary, really Make in the text to be processed that there are the fields of cutting ambiguity.
In one embodiment, processor can be also used for before carrying out word segmentation processing to text to be processed, determine With the presence or absence of the separation mark of user's input in the text to be processed;Determining the case where separation inputted there are user identifies Under, two character label dicing positions for separating mark front and back are identified.
In one embodiment, above-mentioned separation mark can include but is not limited at least one of: space, lower stroke Line, middle scribing line, comma, branch.
In one embodiment, processing implement body can be used for two characters separated before and after identifying according to such as lower section Formula marks dicing position mark:
The dicing position of the first character mark separated after identifying is identified as and starts identifier, or, individual character identifier;
Cutting bit identification to the first character mark separated before identifying is end identifier, or, individual character identifier.
The method and server provided by the present application for generating participle training data, after obtaining text to be processed, if Have that there are the fields of cutting ambiguity in the text, then the multiple dicing positions of character label in the field is identified, rather than adopted It is labeled with the mode marked completely, to solve existing for there are the fields of cutting ambiguity also using mark completely Mode caused by need the problem of being manually labeled, therefore, can effectively save human cost, reached guarantee instruct In the case where practicing data validity, the effect of participle training data is efficiently produced.
Although this application provides the method operating procedure as described in embodiment or flow chart, based on conventional or noninvasive The labour for the property made may include more or less operating procedure.The step of enumerating in embodiment sequence is only numerous steps One of execution sequence mode, does not represent and unique executes sequence.It, can when device or client production in practice executes To execute or parallel execute (such as at parallel processor or multithreading according to embodiment or method shown in the drawings sequence The environment of reason).
The device or module that above-described embodiment illustrates can specifically realize by computer chip or entity, or by having The product of certain function is realized.For convenience of description, it is divided into various modules when description apparatus above with function to describe respectively. The function of each module can be realized in the same or multiple software and or hardware when implementing the application.It is of course also possible to Realization the module for realizing certain function is combined by multiple submodule or subelement.
Method, apparatus or module described herein can realize that controller is pressed in a manner of computer readable program code Any mode appropriate is realized, for example, controller can take such as microprocessor or processor and storage can be by (micro-) The computer-readable medium of computer readable program code (such as software or firmware) that processor executes, logic gate, switch, specially With integrated circuit (Application Specific Integrated Circuit, ASIC), programmable logic controller (PLC) and embedding Enter the form of microcontroller, the example of controller includes but is not limited to following microcontroller: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, Memory Controller are also implemented as depositing A part of the control logic of reservoir.It is also known in the art that in addition to real in a manner of pure computer readable program code Other than existing controller, completely can by by method and step carry out programming in logic come so that controller with logic gate, switch, dedicated The form of integrated circuit, programmable logic controller (PLC) and insertion microcontroller etc. realizes identical function.Therefore this controller It is considered a kind of hardware component, and hardware can also be considered as to the device for realizing various functions that its inside includes Structure in component.Or even, it can will be considered as the software either implementation method for realizing the device of various functions Module can be the structure in hardware component again.
Part of module in herein described device can be in the general of computer executable instructions Upper and lower described in the text, such as program module.Generally, program module includes executing particular task or realization specific abstract data class The routine of type, programs, objects, component, data structure, class etc..The application can also be practiced in a distributed computing environment, In these distributed computing environment, by executing task by the connected remote processing devices of communication network.In distribution It calculates in environment, program module can be located in the local and remote computer storage media including storage equipment.
As seen through the above description of the embodiments, those skilled in the art can be understood that the application can It is realized by the mode of software plus required hardware.Based on this understanding, the technical solution of the application is substantially in other words The part that contributes to existing technology can be embodied in the form of software products, and can also pass through the implementation of Data Migration It embodies in the process.The computer software product can store in storage medium, such as ROM/RAM, magnetic disk, CD, packet Some instructions are included to use so that a computer equipment (can be personal computer, mobile terminal, server or network are set It is standby etc.) execute method described in certain parts of each embodiment of the application or embodiment.
Each embodiment in this specification is described in a progressive manner, the same or similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.The whole of the application or Person part can be used in numerous general or special purpose computing system environments or configuration.Such as: personal computer, server calculate Machine, handheld device or portable device, mobile communication terminal, multicomputer system, based on microprocessor are at laptop device System, programmable electronic equipment, network PC, minicomputer, mainframe computer, the distribution including any of the above system or equipment Formula calculates environment etc..
Although depicting the application by embodiment, it will be appreciated by the skilled addressee that the application there are many deformation and Variation is without departing from spirit herein, it is desirable to which the attached claims include these deformations and change without departing from the application's Spirit.

Claims (19)

1. a kind of method for generating participle training data, which is characterized in that the described method includes:
By carrying out word segmentation processing to text to be processed, determine that there are the fields of cutting ambiguity in the text to be processed;
Multiple dicing position marks are marked to each word there are in the field of cutting ambiguity;
The text to be processed after dicing position identifies will be marked as participle model training data.
2. the method according to claim 1, wherein dicing position mark includes at least one of: opening Beginning identifier, end identifier, intermediate identifier, individual character identifier.
3. according to the method described in claim 2, it is characterized in that, to each word mark there are in the field of cutting ambiguity Infuse multiple dicing position marks, comprising:
The dicing position marked there are the first character of the field of cutting ambiguity is identified as and starts identifier, or, individual character identifies Symbol;
End identifier is identified as to the dicing position marked there are the last character of the field of cutting ambiguity, or, individual character mark Know symbol;
To there are the dicing positions of the mark of the word in the field of cutting ambiguity in addition to first character and the last character to be identified as Start identifier, end identifier, intermediate identifier, or, individual character identifier.
4. the method according to claim 1, wherein being determined by carrying out word segmentation processing to text to be processed There are after the field of cutting ambiguity in the text to be processed, the method also includes;
To there is no each words in the field of cutting ambiguity to mark corresponding dicing position mark.
5. the method according to claim 1, wherein determining institute by carrying out word segmentation processing to text to be processed State in text to be processed that there are the fields of cutting ambiguity, comprising:
By the maximum forward matching way based on dictionary, word segmentation processing is carried out to the text to be processed, with determine it is described to There are the fields of cutting ambiguity in processing text.
6. the method according to claim 1, wherein to text to be processed carry out word segmentation processing before, it is described Method further include:
Determine the separation mark in the text to be processed with the presence or absence of user's input;
In the case where determining that the separation inputted there are user identifies, to two character label dicing positions for separating mark front and back Mark.
7. according to the method described in claim 6, it is characterized in that, the separation mark include at least one of: space, under Scribing line, middle scribing line, comma, branch.
8. the method according to the description of claim 7 is characterized in that carrying out dicing position to two characters for separating mark front and back Mark includes:
The dicing position of the first character mark separated after identifying is identified as and starts identifier, or, individual character identifier;
Cutting bit identification to the first character mark separated before identifying is end identifier, or, individual character identifier.
9. the method according to claim 1, wherein the search that the text to be processed includes: electric business platform is asked It asks.
10. a kind of method for generating participle training data, which is characterized in that the described method includes:
Based on one or more of user query dictionary and product dictionary, to there are each word marks in the field of cutting ambiguity Infuse multiple dicing position marks;
The text to be processed after dicing position identifies will be marked as participle model training data.
11. a kind of server, including processor and for the memory of storage processor executable instruction, the processor is held Following steps are realized when row described instruction:
By carrying out word segmentation processing to text to be processed, determine that there are the fields of cutting ambiguity in the text to be processed;
Multiple dicing position marks are marked to each word there are in the field of cutting ambiguity;
The text to be processed after dicing position identifies will be marked as participle model training data.
12. server according to claim 11, which is characterized in that dicing position mark include it is following at least it One: starting identifier, end identifier, intermediate identifier, individual character identifier.
13. server according to claim 12, which is characterized in that it is right as follows that the processor is specifically used for Each word there are in the field of cutting ambiguity marks multiple dicing position marks:
The dicing position marked there are the first character of the field of cutting ambiguity is identified as and starts identifier, or, individual character identifies Symbol;
End identifier is identified as to the dicing position marked there are the last character of the field of cutting ambiguity, or, individual character mark Know symbol;
To there are the dicing positions of the mark of the word in the field of cutting ambiguity in addition to first character and the last character to be identified as Start identifier, end identifier, intermediate identifier, or, individual character identifier.
14. server according to claim 11, which is characterized in that the processor is also used to by text to be processed This progress word segmentation processing determines in the text to be processed there are after the field of cutting ambiguity, to there is no cutting ambiguities Each word in field marks corresponding dicing position mark.
15. server according to claim 11, which is characterized in that the processor is specifically used for by based on dictionary Maximum forward matching way determines that there are the fields of cutting ambiguity in the text to be processed.
16. server according to claim 11, which is characterized in that the processor be also used to text to be processed into Before row word segmentation processing, the separation mark in the text to be processed with the presence or absence of user's input is determined;Determining that there are users In the case where the separation mark of input, two character label dicing positions for separating mark front and back are identified.
17. server according to claim 16, which is characterized in that the separation mark includes at least one of: empty Lattice, underscore, middle scribing line, comma, branch.
18. server according to claim 16, which is characterized in that the processor is specifically used for separation mark front and back Two characters mark as follows dicing position mark:
The dicing position of the first character mark separated after identifying is identified as and starts identifier, or, individual character identifier;
Cutting bit identification to the first character mark separated before identifying is end identifier, or, individual character identifier.
19. a kind of computer readable storage medium is stored thereon with computer instruction, described instruction, which is performed, realizes that right is wanted The step of seeking any one of 1 to 9 the method.
CN201710589616.0A 2017-07-19 2017-07-19 A kind of method and server generating participle training data Pending CN109284763A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710589616.0A CN109284763A (en) 2017-07-19 2017-07-19 A kind of method and server generating participle training data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710589616.0A CN109284763A (en) 2017-07-19 2017-07-19 A kind of method and server generating participle training data

Publications (1)

Publication Number Publication Date
CN109284763A true CN109284763A (en) 2019-01-29

Family

ID=65184825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710589616.0A Pending CN109284763A (en) 2017-07-19 2017-07-19 A kind of method and server generating participle training data

Country Status (1)

Country Link
CN (1) CN109284763A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457436A (en) * 2019-07-30 2019-11-15 腾讯科技(深圳)有限公司 Information labeling method, apparatus, computer readable storage medium and electronic equipment
CN110728137A (en) * 2019-10-10 2020-01-24 京东数字科技控股有限公司 Method and device for word segmentation
CN111563399A (en) * 2019-02-14 2020-08-21 阿里巴巴集团控股有限公司 Method and device for acquiring structured information of electronic medical record
CN111797626A (en) * 2019-03-21 2020-10-20 阿里巴巴集团控股有限公司 Named entity identification method and device
CN114548103A (en) * 2020-11-25 2022-05-27 马上消费金融股份有限公司 Training method of named entity recognition model and recognition method of named entity

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101114282A (en) * 2007-07-12 2008-01-30 华为技术有限公司 Participle processing method and equipment
CN101261623A (en) * 2007-03-07 2008-09-10 国际商业机器公司 Word splitting method and device for word border-free mark language based on search
CN101950284A (en) * 2010-09-27 2011-01-19 北京新媒传信科技有限公司 Chinese word segmentation method and system
CN102402502A (en) * 2011-11-24 2012-04-04 北京趣拿信息技术有限公司 Word segmentation processing method and device for search engine
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN103077164A (en) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 Text analysis method and text analyzer
CN103324612A (en) * 2012-03-22 2013-09-25 北京百度网讯科技有限公司 Method and device for segmenting word
CN103778161A (en) * 2012-10-26 2014-05-07 同程网络科技股份有限公司 Word segmentation ambiguity elimination method applicable to Chinese word bank
CN103902521A (en) * 2012-12-24 2014-07-02 高德软件有限公司 Chinese statement identification method and device
CN104933023A (en) * 2015-05-12 2015-09-23 深圳市华傲数据技术有限公司 Chinese address word segmentation and annotation method
CN105138514A (en) * 2015-08-24 2015-12-09 昆明理工大学 Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction
US20160027433A1 (en) * 2014-07-24 2016-01-28 Intrnational Business Machines Corporation Method of selecting training text for language model, and method of training language model using the training text, and computer and computer program for executing the methods
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN106202039A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese portmanteau word disambiguation method based on condition random field
CN106407186A (en) * 2016-10-09 2017-02-15 新译信息科技(深圳)有限公司 Word segmentation model building method and apparatus
CN106708807A (en) * 2017-02-10 2017-05-24 深圳市空谷幽兰人工智能科技有限公司 Non-supervision word segmentation mode training method and device
CN106778887A (en) * 2016-12-27 2017-05-31 努比亚技术有限公司 The terminal and method of sentence flag sequence are determined based on condition random field

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101261623A (en) * 2007-03-07 2008-09-10 国际商业机器公司 Word splitting method and device for word border-free mark language based on search
CN101114282A (en) * 2007-07-12 2008-01-30 华为技术有限公司 Participle processing method and equipment
CN101950284A (en) * 2010-09-27 2011-01-19 北京新媒传信科技有限公司 Chinese word segmentation method and system
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN102402502A (en) * 2011-11-24 2012-04-04 北京趣拿信息技术有限公司 Word segmentation processing method and device for search engine
CN103324612A (en) * 2012-03-22 2013-09-25 北京百度网讯科技有限公司 Method and device for segmenting word
CN103778161A (en) * 2012-10-26 2014-05-07 同程网络科技股份有限公司 Word segmentation ambiguity elimination method applicable to Chinese word bank
CN103902521A (en) * 2012-12-24 2014-07-02 高德软件有限公司 Chinese statement identification method and device
CN103077164A (en) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 Text analysis method and text analyzer
US20160027433A1 (en) * 2014-07-24 2016-01-28 Intrnational Business Machines Corporation Method of selecting training text for language model, and method of training language model using the training text, and computer and computer program for executing the methods
CN104933023A (en) * 2015-05-12 2015-09-23 深圳市华傲数据技术有限公司 Chinese address word segmentation and annotation method
CN105138514A (en) * 2015-08-24 2015-12-09 昆明理工大学 Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN106202039A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese portmanteau word disambiguation method based on condition random field
CN106407186A (en) * 2016-10-09 2017-02-15 新译信息科技(深圳)有限公司 Word segmentation model building method and apparatus
CN106778887A (en) * 2016-12-27 2017-05-31 努比亚技术有限公司 The terminal and method of sentence flag sequence are determined based on condition random field
CN106708807A (en) * 2017-02-10 2017-05-24 深圳市空谷幽兰人工智能科技有限公司 Non-supervision word segmentation mode training method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周祺: "基于统计与词典相结合的中文分词的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王靖等: "一种优化的用于中文分词的CRF机器学习模型", 《软件时空》 *
许高建等: "一种改进的中文分词歧义消除算法研究", 《合肥工业大学学报(自然科学版)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563399A (en) * 2019-02-14 2020-08-21 阿里巴巴集团控股有限公司 Method and device for acquiring structured information of electronic medical record
CN111563399B (en) * 2019-02-14 2023-04-28 阿里巴巴集团控股有限公司 Method and device for obtaining structured information of electronic medical record
CN111797626A (en) * 2019-03-21 2020-10-20 阿里巴巴集团控股有限公司 Named entity identification method and device
CN110457436A (en) * 2019-07-30 2019-11-15 腾讯科技(深圳)有限公司 Information labeling method, apparatus, computer readable storage medium and electronic equipment
CN110457436B (en) * 2019-07-30 2022-12-27 腾讯科技(深圳)有限公司 Information labeling method and device, computer readable storage medium and electronic equipment
CN110728137A (en) * 2019-10-10 2020-01-24 京东数字科技控股有限公司 Method and device for word segmentation
CN114548103A (en) * 2020-11-25 2022-05-27 马上消费金融股份有限公司 Training method of named entity recognition model and recognition method of named entity
CN114548103B (en) * 2020-11-25 2024-03-29 马上消费金融股份有限公司 Named entity recognition model training method and named entity recognition method

Similar Documents

Publication Publication Date Title
CN109284763A (en) A kind of method and server generating participle training data
CN101996195B (en) Searching method and device of voice information in audio files and equipment
CN103123618B (en) Text similarity acquisition methods and device
CN105528372B (en) A kind of address search method and equipment
CN110515896B (en) Model resource management method, model file manufacturing method, device and system
CN110442710A (en) A kind of short text semantic understanding of knowledge based map and accurate matching process and device
CN102122280B (en) Method and system for intelligently extracting content object
CN104469832B (en) Mobile communications network accident analysis locating assist system
CN109408821B (en) Corpus generation method and device, computing equipment and storage medium
CN107832440B (en) Data mining method, device, server and computer readable storage medium
CN103631874B (en) UGC label classification determining method and device for social platform
CN109409248A (en) Semanteme marking method, apparatus and system based on deep semantic network
CN104504135A (en) Promotion account structure generation method and device
CN104077385A (en) Classification and retrieval method of files
CN105069063A (en) Picture searching method and apparatus
CN106815193A (en) Model training method and device and wrong word recognition methods and device
CN103473285A (en) Web information extraction method and device based on location markers
CN110457704B (en) Target field determination method and device, storage medium and electronic device
CN112380356A (en) Method, device, electronic equipment and medium for constructing catering knowledge graph
CN106933919A (en) The connection method of tables of data and device
CN103514284B (en) Data display system and data display method
CN103559177A (en) Geographical name identification method and geographical name identification device
CN103853771B (en) A kind of method for pushing and system of search result
CN105447064B (en) Electronic map data making and using method and device
CN104715040A (en) Data classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190129

RJ01 Rejection of invention patent application after publication