CN109829162A

CN109829162A - A kind of text segmenting method and device

Info

Publication number: CN109829162A
Application number: CN201910094380.2A
Authority: CN
Inventors: 王李鹏
Original assignee: New H3C Big Data Technologies Co Ltd
Current assignee: New H3C Big Data Technologies Co Ltd
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2019-05-31
Anticipated expiration: 2039-01-30
Also published as: CN109829162B

Abstract

This application provides a kind of text segmenting method and devices, wherein this method comprises: being character string by text conversion to be segmented；The character string for meeting preset length for including in character string is matched with the standard words in the dictionary constructed in advance, the determining and matched matched character string of standard words, corresponding dictionary label is distributed respectively for each character of matched character string in character string and each character in addition to matched character string, obtains dictionary sequence label；It determines the corresponding at least one participle label of each character in character string, obtains a variety of participle sequence labels；According to character string, dictionary sequence label and conditional probability prediction model trained in advance, determine that character string is marked as the conditional probability of every kind of participle sequence label；The corresponding participle sequence label of the conditional probability for meeting preset condition is determined as target participle sequence label, and participle text is treated based on target participle sequence label and carries out word segmentation processing.

Description

A kind of text segmenting method and device

Technical field

This application involves big data technical fields, in particular to a kind of text segmenting method and device.

Background technique

In natural language processing technique, participle technique is the basis of other Language Processings, and the accuracy of participle is to other It is particularly significant for Language Processing.Currently, when being analyzed and processed to text, for the text for including unstructured data This, when carrying out word segmentation processing with certain difficulty.

By taking electronic health record as an example, due to including many unstructured datas, such as medical history taking, the course of disease note in electronic health record Record and case history brief summary etc., carrying out automatic word segmentation to this kind of unstructured data is that electronic health record is analyzed and excavated most basic It is simultaneously also a very arduous task.

It can be seen that can quickly and accurately the text for including unstructured data be divided by needing one kind at present The technical solution of word.

Summary of the invention

In view of this, the application's is designed to provide a kind of text segmenting method and device, it can be quickly and accurately The text for including unstructured data is segmented.

In a first aspect, the application provides a kind of text segmenting method, comprising:

It is character string by the text conversion to be segmented；

By the character string for meeting preset length for including in the character string and the standard words in the dictionary that in advance constructs It is matched, the determining and matched matched character string of the standard words, is each word of matched character string in the character string Symbol and each character in addition to the matched character string distribute corresponding dictionary label respectively, obtain dictionary sequence label；

It determines the corresponding at least one participle label of each character in the character string, obtains a variety of participle label sequences Column；

According to the character string, the dictionary sequence label and conditional probability prediction model trained in advance, determine The character string is marked as the conditional probability of every kind of participle sequence label；

The corresponding participle sequence label of the conditional probability for meeting preset condition is determined as target participle sequence label, and base Word segmentation processing is carried out to the text to be segmented in target participle sequence label.

Second aspect, the application provide a kind of text participle device, comprising:

Conversion module, for being character string by the text conversion to be segmented；

First determining module, for constructing the character string for meeting preset length for including in the character string with preparatory Dictionary in standard words matched, it is determining with the matched matched character string of the standard words, be in the character string Each character with character string and each character in addition to the matched character string distribute corresponding dictionary label respectively, obtain Dictionary sequence label；

Second determining module is obtained for determining the corresponding at least one participle label of each character in the character string To a variety of participle sequence labels；

Conditional probability prediction module, for according to the character string, the dictionary sequence label and training in advance Conditional probability prediction model determines that the character string is marked as the conditional probability of every kind of participle sequence label；

Word segmentation processing module, for the corresponding participle sequence label of the conditional probability for meeting preset condition to be determined as target Sequence label is segmented, and word segmentation processing is carried out to the text to be segmented based on target participle sequence label.

The third aspect, the embodiment of the present application also provide a kind of electronic equipment, comprising: processor, memory and bus, it is described Memory is stored with the executable machine readable instructions of the processor, when electronic equipment operation, the processor with it is described By bus communication between memory, the machine readable instructions executed when being executed by the processor it is above-mentioned in a first aspect, or The step of any possible this segmenting method of embodiment Chinese of first aspect.

Fourth aspect, the embodiment of the present application also provide a kind of computer readable storage medium, the computer-readable storage medium Computer program is stored in matter, which executes above-mentioned in a first aspect, or first aspect when being run by processor The step of any possible this segmenting method of embodiment Chinese.

This application provides a kind of text segmenting method and devices, can be first character sequence by text conversion to be segmented Column, can match the character string for meeting preset length in character string with the standard words in the dictionary constructed in advance, base later In the available dictionary sequence label of matching result, the corresponding at least one of character each in determining character string can also be passed through Label is segmented, a variety of participle sequence labels are obtained.It is possible to further using dictionary sequence label and character string as model Input is marked as conditional probability when every kind of participle sequence label using conditional probability prediction model prediction character string, after It is continuous just to determine that target segments sequence label based on obtained conditional probability, and participle text is treated based on target participle sequence label This progress word segmentation processing.

It include predicting the two participle predictions based on dictionary matching and based on conditional probability prediction model in aforesaid way Process, by combining above-mentioned two participle prediction process, on the one hand, using the dictionary sequence label obtained through dictionary matching as base The reference factor when prediction of conditional probability prediction model can to eventually pass through point that conditional probability prediction model predicts The accuracy of word result is higher, promotes the accuracy rate of prediction word segmentation result；On the other hand, conditional probability prediction model is introduced, In the case where giving the corresponding character string of text to be segmented and dictionary sequence label, prediction character string is marked as certain Conditional probability when kind participle sequence label, can directly obtain the corresponding participle sequence label of character string, also in this way To obtain the corresponding participle label of alphabet in text to be identified by primary prediction process, text thus can also be improved The efficiency of this participle.

To enable the above objects, features, and advantages of the application to be clearer and more comprehensible, below in conjunction with specific embodiment, and cooperate Appended attached drawing, is described in detail below.

Detailed description of the invention

Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only some embodiments of the application, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.

Fig. 1 shows the flow diagram of text segmenting method provided by the embodiments of the present application；

Fig. 2 shows it is provided by the embodiments of the present application based on Forward Maximum Method algorithm treat participle text carry out it is matched Flow diagram；

Fig. 3 show it is provided by the embodiments of the present application based on reverse maximum matching algorithm treat participle text carry out it is matched Flow diagram；

Fig. 4 shows the process signal of the labeled participle sequence label of prediction character string provided by the embodiments of the present application Figure；

Fig. 5 shows the flow diagram of the training process of conditional probability prediction model provided by the embodiments of the present application；

Fig. 6 shows a kind of structural schematic diagram of text participle device provided by the embodiments of the present application；

Fig. 7 shows the structural schematic diagram of electronic equipment provided by the embodiments of the present application.

Specific embodiment

To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application Middle attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only It is some embodiments of the present application, instead of all the embodiments.The application being usually described and illustrated herein in the accompanying drawings is real The component for applying example can be arranged and be designed with a variety of different configurations.Therefore, below to the application's provided in the accompanying drawings The detailed description of embodiment is not intended to limit claimed scope of the present application, but is merely representative of the selected reality of the application Apply example.Based on embodiments herein, those skilled in the art institute obtained without making creative work There are other embodiments, shall fall in the protection scope of this application.

Currently, when being segmented to the text for including unstructured data, if using based on supervised learning Segmenting method then needs to construct the sample corpus for carrying the participle label manually marked as training set, then by training The sample corpus of concentration segments prediction model to train to obtain, to predict the word segmentation result of text.Due to sample in training set Corpus it is in large scale, using the segmenting method of supervised learning need to expend a large amount of manpowers go mark sample corpus participle mark Label, human cost is higher and the difficulty of the more comprehensive training set of building is larger, building efficiency is lower.However, if using base The word segmentation result of text is determined in the segmenting method of unsupervised learning, for the segmenting method of opposite supervised learning, and meeting There is a problem of that the accuracy rate of word segmentation result is lower.

In view of the above-mentioned problems, this application provides a kind of text segmenting method and devices.It is the application shown in referring to Fig.1 The flow diagram for the text segmenting method that embodiment provides, includes the following steps:

Step 101, by text conversion to be segmented be character string.

Step 102, by the character string for meeting preset length for including in character string and the mark in the dictionary that in advance constructs Quasi- word is matched, determining with the matched matched character string of standard words, be in character string each character of matched character string and Each character in addition to matched character string distributes corresponding dictionary label respectively, obtains dictionary sequence label.

Step 103 determines the corresponding at least one participle label of each character in character string, obtains a variety of participle labels Sequence.

Step 104, according to character string, dictionary sequence label and conditional probability prediction model trained in advance, determine Character string is marked as the conditional probability of every kind of participle sequence label.

The corresponding participle sequence label of the conditional probability for meeting preset condition is determined as target participle label by step 105 Sequence, and participle text is treated based on target participle sequence label and carries out word segmentation processing.

Since text to be segmented is made of multiple characters, therefore text to be segmented can be split as unit of single character At individual character, then each character is arranged successively, constitutes character string.By being character sequence by text conversion to be segmented Column can be converted to the corresponding participle of each character in prediction character string then treat the process that is segmented of participle text The process of label, due to there are many corresponding participle label possibility of each character, therefore by determining the corresponding mesh of each character Mark participle label can treat participle text and be segmented.

It based on this, is proposed in the embodiment of the present application, entire word can be directly predicted using conditional probability prediction model It accords with the corresponding target of sequence and segments sequence label, alphabet point in character string can be obtained by primary prediction process in this way Not corresponding target segments label, so as to improve the efficiency of text participle.Also, in order to promote prediction target participle label Accuracy rate, can character string is predicted using conditional probability prediction model target segment sequence label before, first will The character for meeting preset length for including in character string is matched with the standard words in the dictionary constructed in advance respectively, and base Obtain dictionary sequence label in matching result, it is subsequent can be using dictionary sequence label as reference factor, together with character string It is input in conditional probability prediction model, to obtain more accurate prediction result.

In the following, being carried out specifically to the matching process based on dictionary with the pre- flow gauge based on conditional probability prediction model It is bright.

Implementing procedure one, the matching process based on dictionary

It is to be appreciated that the matching process based on dictionary can both be applied and is trained to conditional probability prediction model During, to generate the corresponding sample dictionary sequence label of sample character string in sample set, it can also apply and be based on During the conditional probability prediction model prediction participle sequence label that training obtains, to generate the corresponding dictionary of text to be segmented Sequence label.Based on the same technical idea, therefore the application highlights and generates the corresponding dictionary of text to be segmented two processes The process of sequence label.

In the embodiment of the present application, it can be constructed by the character string for meeting preset length for including in character string and in advance Standard words in dictionary are matched, and are based on matching result, determine dictionary sequence label.Wherein, the mistake of dictionary is specifically constructed Journey is referred to the prior art, not reinflated explanation in the application.

In specific implementation, it can be constructed first by the character string for meeting preset length for including in character string and in advance Standard words in dictionary are matched, the determining and matched matched character string of standard words.It and then can be to be matched in character string Each character of character string and each character in addition to matched character string distribute corresponding dictionary label respectively, obtain dictionary mark Sign sequence.

Illustratively, matching process can using Forward Maximum Method algorithm, reverse maximum matching algorithm or it is two-way most Big matching algorithm.Wherein, so-called self-reinforcing in double directions can be understood as point by obtaining Forward Maximum Method algorithm The word segmentation result that word result and reverse maximum matching algorithm obtain is compared, with the process of the correct word segmentation result of determination.

It should be noted that the character string for meeting preset length can be the character string for containing at least one character.If answering When carrying out dictionary matching with Forward Maximum Method algorithm or reverse maximum matching algorithm, the above-mentioned character string for meeting preset length can To be: contain at least one character and the character total quantity that includes without departing from standard words longest in dictionary character total quantity Character string.

In the following, being illustrated respectively to above-mentioned matching algorithm:

(1), Forward Maximum Method algorithm.

Referring to shown in Fig. 2, to carry out matched flow diagram to character string based on Forward Maximum Method algorithm, including Following steps:

Step 201 successively takes a character as character string to be matched from front to back from character string.

In one example, a can be using value as the character total quantity of standard words longest in dictionary.

Step 202 judges in dictionary with the presence or absence of standard words identical with character string to be matched.

If the determination result is YES, 203 are thened follow the steps；If judging result be it is no, then follow the steps 204.

Character string to be matched is determined as and the matched matched character string of standard words, and then return step by step 203 201, taking next length is the character string of a, until having traversed alphabet in character string.

After removing the character that character string to be matched is located at last, remaining character is formed newly for step 204 Character string to be matched simultaneously executes step 202, returns to step 201 with the matched matched character string of standard words until finding out, Taking next length is the character string of a；Alternatively, being returned to step after alphabet removes in character string to be matched 201, taking next length is the character string of a.

After being matched based on above-mentioned matching process to character string, available first matching result, the first matching knot Record has the matched character string in character string and the character in addition to matched character string in fruit.Wherein, matched character string can To be made of multiple characters, can also be made of single character, the application does not limit this.

(2), reverse maximum matching algorithm.

Referring to shown in Fig. 3, to carry out matched flow diagram to character string based on reverse maximum matching algorithm, including Following steps:

Step 301 successively takes a character as character string to be matched from back to front from character string.

Wherein, the meaning of a is the same as described in above-mentioned Forward Maximum Method algorithm.

Step 302 judges in dictionary with the presence or absence of standard words identical with character string to be matched.

If the determination result is YES, 303 are thened follow the steps；If judging result be it is no, then follow the steps 304.

Character string to be matched is determined as and the matched matched character string of standard words, and then return step by step 303 301, taking next length is the character string of a, until having traversed alphabet in character string.

Character string to be matched is located at after primary character removes by step 304, by remaining character composition it is new to Matched character string simultaneously executes step 302, until find out with the matched matched character string of standard words, return to step 301, take Next length is the character string of a；Or by after alphabet removes in character string to be matched, 301 is returned to step, is taken Next length is the character string of a.

After being matched based on above-mentioned matching process to character string, available second matching result, the second matching knot The matched character string in character string and the character in addition to matched character string are recorded in fruit.Wherein, matched character string can be with It is made of, can also be made of single character, the application does not limit this multiple characters.

(3), self-reinforcing in double directions.

It, can after based on the first matching result and the second matching result is obtained with matching process shown in Fig. 3 shown in Fig. 2 To compare the first matching result and the second matching result, therefrom select preferable matching result as final matching result.

If the first matching result and the second matching result are consistent, any matching result can choose as final With result.

If the first matching result and the second matching result are inconsistent, the first matching result and second can be compared The number of character with the number of matched character string in result, in addition to matched character string and by the matching word of single character The number of string is accorded with, and then therefrom selects preferable matching result as final matching result.For example, can be according to matching character The number of string is The more the better, the matching character of the more fewer better and single character of the number of character in addition to matched character string More fewer, better principle selects final matching result to the number of string.

In a kind of possible embodiment, matched character string in character string is being determined and in addition to matched character string Character after, can by each character in character string according to it is following rule distribute dictionary label, obtain by dictionary label The dictionary sequence label of composition:

For any one character in character string, if the character is the character in matched character string, for the character point With the first dictionary label, if the character is the character in addition to matched character string, the second dictionary label will be distributed for the character.

In one example, the first dictionary label can be indicated with 1, and the second dictionary label can be indicated with 0.Certainly, it actually answers The first dictionary label and the second dictionary label can also be configured according to actual needs in, for example, the first dictionary label Y table Show, the second dictionary label indicates that the application does not limit this with N.

Illustratively, by taking text to be segmented is electronic health record as an example, describe that " double lungs are not heard and dry and wet in electronic health record Rale, do not hear and film chest fricative ", it is assumed that in dictionary include " dry moist rales ", " fricative ", then " dry moist rales ", " fricative " can be determined that matched character string, and then dictionary mark as shown in Table 1 can be generated according to above embodiment (1 left side one of table is classified as the corresponding character string of text to be segmented to label sequence, and the first dictionary label is indicated with 1, the second dictionary label Indicated with 0):

Table 1

Implementing procedure two, the prediction process based on conditional probability prediction model trained in advance.

It, first can be true before the labeled target participle sequence label of prediction character string in the embodiment of the present application Make every kind of participle sequence label that character string may be labeled.

In a kind of possible embodiment, for character each in character string, it can determine that each character is corresponding extremely Few a kind of participle label, it is possible to further arbitrarily select a kind of point from the corresponding at least one participle label of each character Word label segments label as target, and using sequence composed by the corresponding target participle label of each character as one kind Segment sequence label.

Wherein, the participle label that each character may be labeled has at least one, specifically includes: the starting position of word Second label in the middle position of the first label, word, the third label of the end position of word, individual character word the 4th mark Label.In one example, the first label of the starting position of word can be indicated with B (Begin), indicate word with I (Intermediate) Middle position the second label, with E (End) indicate be word end position third label, indicate single with S (single) 4th label of the monosyllabic word of a character composition.

Wherein, for each character, corresponding participle label has this 4 kinds of situations of B, I, S, E, it is assumed that wraps in character string P character is included, it, can if segmenting label as target for optional one in corresponding this 4 kinds participles label of each character With the participle sequence label of generation for 4^pKind.

In the embodiment of the present application, after determining a variety of participle sequence labels, character string, dictionary label can be based on Sequence and conditional probability prediction model trained in advance are marked as every kind of participle sequence label to predict character string Conditional probability.

Specific prediction process, referring to shown in Fig. 4:

Step 401, according to character string and/or dictionary sequence label, determine multiple feature templates.

Step 402, according to determining multiple feature templates, generate at least one function of state and at least one transfer letter Number.

Step 403 determines each function of state in the case where character string is marked as every kind of participle sequence label The value of value and each transfer function.

Step 404 segments every kind the value of the corresponding each function of state of sequence label and taking for each transfer function Value is input in conditional probability prediction model trained in advance, is calculated separately character string and is marked as every kind of participle sequence label Conditional probability.

For convenient for the understanding to pre- flow gauge shown in Fig. 4, firstly, to according to character string and/or dictionary sequence label Determining multiple feature templates are introduced.

Illustratively, feature templates may include at least one of lower template:

For indicating the character feature template of single character in the character string；

For indicating the character feature template of the incidence relation of kinds of characters in the character string；

For indicating the dictionary feature templates of single dictionary label in the dictionary sequence label；

For indicating the dictionary feature templates of the incidence relation in the dictionary sequence label between different dictionary labels；

The compound characteristics template being made of the character feature template and the dictionary feature templates.

Three of the above feature templates are also used as a meta template (Unigram template) or two meta template (Bigram template)。

Wherein, a meta template is determined for function of state, and template style is, for example, Uk:%x [i, j], wherein letter U Expression template is a meta template；K expression is the serial number of template；X indicates the two dimension being made of character string and dictionary sequence label Sequence；In the disclosure, j indicates that the position of column is shown to be first row as j=0, and first row refers to the character in two-dimensional sequence Sequence, secondary series is shown to be as j=1, and secondary series refers to the dictionary sequence label in two-dimensional sequence；In the disclosure, i indicates word Sequence or i-th of position namely current location in dictionary sequence label are accorded with, as j=0, x [i, 0] is indicated in two-dimensional sequence Character string in i-th of position character, as j=1, x [i, 1] indicate two-dimensional sequence in dictionary sequence label in i-th The dictionary label of a position.

Two meta templates are determined for transfer function, and template style is, for example, Bk:%x [i, j], and wherein letter b indicates Template is two meta templates；Other parameters can be found in the explanation of an above-mentioned meta template, and which is not described herein again.

Illustratively, continue to use the character string of electronic health record shown in above-mentioned table 1 and corresponding dictionary sequence label it Between corresponding relationship, features described above template is illustrated.

For the character string " double lungs are not heard and dry moist rales, do not hear and film chest fricative " that electronic health record is constituted, that , the feature templates that can be generated are referring to shown in table 2:

Table 2

Wherein, U01 to U18 shown in table 2 is a meta template, and B01 is two meta templates.

U01 to U05 is the character feature template for indicating single character in character string.For example, U01:%x [i-2, 0] indicate character string in the i-th -2 positions character, i.e., before current location and with two, current location interval character The character of position；U03:%x [i, 0] indicates the character of i-th of position in character string, the i.e. character of current location；U05:%x [i+2,0] indicate character string in the i-th+2 positions character, i.e., after current location and with current location interval two The character of the position of a character.

U06 to U12 is the character feature template for indicating the incidence relation of kinds of characters in character string.For example, U06:%x [i-2,0]/%x [i-1,0] is indicated in character string (i-1)-th in the character and character string of the i-th -2 positions The character of a position；U07:%x [i-1,0]/%x [i, 0] indicates the character and character of (i-1)-th position in character string The character of i-th of position in sequence.

U13 is the dictionary feature templates for indicating single dictionary label in dictionary sequence label.For example, U13:%x [i, 1] the dictionary label of i-th of position in dictionary sequence label can be indicated.

U14 is the compound characteristics template being made of character feature template and dictionary feature templates.U14:%x [i, 0]/%x [i, 1] can indicate the dictionary label of i-th of position in the character and dictionary sequence label of i-th of position in character string.

U15 to U18 is the dictionary feature for indicating the incidence relation in dictionary sequence label between different dictionary labels Template.For example, U15:%x [i-2,1]/%x [i-1,1] indicate dictionary sequence label in the i-th -2 positions dictionary label, with And in dictionary sequence label (i-1)-th position dictionary label.

B01 is two meta templates, and B01 can also be attributed to the character feature mould for indicating single character in character string Plate.B01:%x [i, 0] can indicate the character of i-th of position in character string.Certainly, dictionary feature templates in practical application Two meta templates are also constituted with compound characteristics template, the application does not limit this.

In the embodiment of the present application, a meta template be can be generated function of state s (y, x, i, j), and each meta template can give birth to At W*p function of state, wherein p indicates the character number for including in character string, also may indicate that and wrap in dictionary sequence label The dictionary label number contained also may indicate that the number for the participle label for including in participle sequence label, character number, dictionary mark It is identical to sign number, participle label number three, W indicates the type of participle label, in the disclosure, W=4, i.e. " B ", " E ", " I ", " S " this 4 kinds participle labels.

Continue to use above-mentioned example, as shown in table 1, in character string include " double ", " lung ", " not ", " news ", " and ", " dry ", " wet ", " property ", " hello ", " sound ", ", ", " film ", " chest ", " rubbing ", " wiping ", "." this 16 characters, i.e. p=16 segments label Type includes " B ", " E ", " I ", " S " this 4 kinds, i.e. W=4, therefore it can be concluded that 16*4=64 can be generated in each meta template A function of state.

Wherein, two meta templates can be generated transfer function t (y, x, i, j), and W*W*p shape can be generated in each two meta template State function, wherein p, W meaning are same as above.

Continue to continue to use above-mentioned example, as shown in table 1, it can be deduced that each two meta template can be generated 16*4*4=256 Transfer function.

Further, after determining above-mentioned all kinds of feature templates, it can be based on unitary template generation function of state, also It can be based on binary template generation transfer function, specific embodiment is as follows:

Embodiment one,

Since an above-mentioned meta template can be one of character feature template, dictionary feature templates and compound characteristics template Or it is a variety of, therefore the function of state s (y, x, i, j) based on above-mentioned unitary template generation includes following several situations:

First, it is assumed that character string includes p character, dictionary sequence label includes p dictionary label, segments sequence label Comprising p participle label, three is equal.

Situation 1: if feature templates include character feature template, according to character feature template, the function of state s of generation (y, x, i, j) are as follows:

Wherein, x indicates the two-dimensional sequence being made of character string and dictionary sequence label；When j=0, two-dimensional sequence is indicated In character string；x_{I ± d, j=0}Indicate character string i-th ± d position character, i take 1 into p arbitrary integer, d take 0 to Any positive integer in p-i；Y indicates participle sequence label；y_iIndicate i-th of participle label of participle sequence label y；

Character of the s (y, x, i, j) in i-th ± d position for meeting character string is m and segments the i-th of sequence label y A participle label is n₁Under conditions of value be k₁, conversely, s (y, x, i, j) value is k₂。

Situation 2: if feature templates include dictionary feature templates, according to dictionary feature templates, the function of state s of generation (y, x, i, j) are as follows:

Wherein, when j=1, the dictionary sequence label in two-dimensional sequence is indicated；x_{I ± d, j=1}Indicate the i-th of dictionary sequence label The character of ± d position, i take 1 into p arbitrary integer, d take 0 any positive integer into p-i；Other parameters meaning is same as above.

Dictionary label of the s (y, x, i, j) in i-th ± d position for meeting dictionary sequence label is h and participle label sequence I-th of participle label for arranging y is n₁Under conditions of value be k₁, conversely, s (y, x, i, j) value is k₂。

Situation 3: if feature templates include compound characteristics template, according to compound characteristics template, the function of state s of generation (y, x, i, j) are as follows:

Character of the s (y, x, i, j) in i-th ± d position for meeting character string is i-th ± d of m, dictionary sequence label The dictionary label of a position is h and segments i-th of participle label of sequence label y to be n₁Under conditions of value be k₁, conversely, s (y, x, i, j) value is k₂。

Wherein, k₁Such as it can be with value for 1, k₂Such as it can be with value for 0.It certainly, can also be according to reality in practical application Border situation configures k₁And k₂Value, the application do not limit this.

Wherein, participle label n1 and n2 can be these four participle any one of labels of above-mentioned B, I, E, S.

For ease of understanding, it below with reference to the content of Tables 1 and 2, illustrates to the function of state s (y, x, i, j) of generation Explanation.

Example one, hypothesis character feature template are U03:%x [i, 0], and the character of i-th of position is directed toward word in character string It accords with " double ", then utilizing U03:%x [i, 0], the function of state s (y, x, i, j) of generation is following four situation (in disclosure reality Apply in example, the corresponding participle label of the character of each position is four kinds, i.e. B, I, E, S):

The aforementioned four function of state s determined for template U03:%x [i, 0]₁To s₄, character string is determined Multiple participle sequence labels in any one participle sequence label, it is thus necessary to determine that the corresponding state letter of the participle sequence label Several value s₁To s₄, then, it needs successively to traverse each character in character string, determines the corresponding function of state of each character Value, it is assumed that the character currently traversed be " double ", if the participle label for corresponding to character " double " in the participle sequence label is " B ", the s in aforementioned four function of state₁Value 1, other function of state s₂To s₄Value is 0.About other feature template generation Function of state or transfer function value method of determination, be also referred to the above process, no longer make introductions all round here.

Example two, hypothesis character feature template are U04:%x [i+1,0], and the character of i+1 position refers in character string To character " lung ", then utilizing U04:%x [i+1,0], the function of state s (y, x, i, j) of generation is following four situation:

Example three, hypothesis character feature template are U08:%x [i, 0]/%x [i+1,0], i-th of position in character string Character be directed toward character " double ", the character of i+1 position is directed toward character " lung " in character string, then utilize U08:%x [i, 0] the function of state s (y, x, i, j) of/%x [i+1,0], generation are following four situation:

Certainly, for other character feature templates in a meta template, above-mentioned example one is also referred to example three Mode generates function of state, specifically not reinflated explanation.

Example four, hypothesis dictionary feature templates are U13:%x [i, 1], the dictionary mark of i-th of position in dictionary sequence label Label are 0, then utilizing U13:%x [i, 1], the function of state s (y, x, i, j) of generation is following four situation:

Example five, hypothesis dictionary feature templates are U17:%x [i, 1]/%x [i+1,1], i-th in dictionary sequence label The dictionary label of position is 0, and the dictionary label of i+1 position is 0, then U17:%x [i, 1]/%x [i+1,1] is utilized, it is raw At function of state s (y, x, i, j) be following four situation:

Certainly, for other dictionary feature templates in a meta template, above-mentioned example four is also referred to example five Mode generates function of state, specifically not reinflated explanation.

Example six, hypothesis compound characteristics template are U14:%x [i, 0]/%x [i, 1], i-th position in character string Character is directed toward character " double ", and the dictionary label of i-th of position is 0 in dictionary sequence label, then utilizing U14:%x [i, 0]/% The function of state s (y, x, i, j) of x [i, 1], generation are following four situation:

Certainly, for other compound characteristics templates in a meta template, the mode for being also referred to above-mentioned example six is generated Function of state, specifically not reinflated explanation.

Embodiment two,

Above-mentioned two meta template be also possible to one of character feature template, dictionary feature templates and compound characteristics template or It is a variety of.Transfer function based on above-mentioned binary template generation includes following several situations:

Situation 1: if feature templates include character feature template, according to character feature template, the transfer function t of generation (y, x, i, j) are as follows:

Wherein, x indicates the two-dimensional sequence being made of character string and dictionary sequence label；When j=0, two-dimensional sequence is indicated In character string；x_{I ± d, j=0}Indicate character string i-th ± d position character, i take 1 into p arbitrary integer, d take 0 to Any positive integer in p-i；Y indicates participle sequence label；y_iIndicate i-th of participle label of participle sequence label y；y_i-1Table Show (i-1)-th participle label of participle sequence label y；

Character of the t (y, x, i, j) in i-th ± d position for meeting character string is m and segments the i-th of sequence label y A participle label is n₁, participle sequence label y (i-1)-th participle label for n₂Under conditions of value be k₁, conversely, t (y, x, I, j) value be k₂。

Situation 2: if feature templates include dictionary feature templates, according to dictionary feature templates, the transfer function t of generation (y, x, i, j) are as follows:

Wherein, x indicates the two-dimensional sequence being made of character string and dictionary sequence label；When j=1, two-dimensional sequence is indicated In dictionary sequence label；x_{I ± d, j=1}Indicate dictionary sequence label i-th ± d position dictionary label, i take 1 into p arbitrarily Integer, p are the character total number for including in character string, and d takes 0 any positive integer into p-i；Y indicates participle sequence label； y_iIndicate i-th of participle label of participle sequence label y；y_i-1Indicate (i-1)-th participle label of participle sequence label y；

Dictionary label of the t (y, x, i, j) in i-th ± d position for meeting dictionary sequence label is h and participle label sequence I-th of participle label for arranging y is n₁, participle sequence label y (i-1)-th participle label for n₂Under conditions of value be k₁, instead It, t (y, x, i, j) value is k₂。

Situation 3: if feature templates include compound characteristics template, according to compound characteristics template, the transfer function t of generation (y, x, i, j) are as follows:

Wherein, character of the t (y, x, i, j) in i-th ± d position for meeting alphanumeric tag sequence is m, dictionary sequence label I-th ± d position dictionary label be h and segment sequence label y i-th of participle label be n₁, participle sequence label y (i-1)-th participle label be n₂Under conditions of value be k₁, conversely, t (y, x, i, j) value is k₂。

For ease of understanding, it below with reference to the content of Tables 1 and 2, illustrates to the transfer function t (y, x, i, j) of generation Explanation.

Assuming that character feature template is B01:%x [i, 0], the character of i-th of position is directed toward character " lung " in character string, B01:%x [i, 0] so is utilized, the transfer function t (y, x, i, j) of generation includes 16 kinds of situations, wherein with y_iThis feelings of=B Condition, four kinds of producible transfer function t (y, x, i, j) are as follows:

It is, of course, also possible to be directed to y_i=I, y_i=E, y_iThese three situations of=S can also generate four kinds of transfer function t respectively (y, x, i, j), specifically not reinflated explanation.

Further, it after obtaining function of state and transfer function according to aforesaid way, can also determine in character sequence Column are marked as the value of each function of state and the value of each transfer function in the case where every kind of participle sequence label.In turn The value of value and each transfer function that every kind segments the corresponding each function of state of sequence label is input to preparatory training Conditional probability prediction model in, calculate separately character string be marked as every kind participle sequence label conditional probability.

In the embodiment of the present application, trained conditional probability prediction model is condition random field (conditional in advance Random field, CRF).Wherein, condition random field can be understood as under conditions of given one group of input stochastic variable another group The conditional probability distribution model of stochastic variable is exported, the assumed condition of model is that output stochastic variable composition markov is random ?.In the case where being applied to treat the scene that participle text is segmented, the stochastic variable of input can be for by character string and dictionary The stochastic variable of the two-dimensional sequence x of sequence label composition, output can be participle sequence label y.

Participle text is treated in the embodiment of the present application to be segmented, and can be actually converted to prediction character string and is marked as The problem of conditional probability of every kind of participle sequence label.Wherein, the bigger participle sequence label of the conditional probability predicted, explanation A possibility that being correct participle sequence label, is bigger.

Illustratively, the calculation formula of condition random field are as follows:

Wherein,

In above-mentioned formula, and p (y | x) it indicates to be marked by the two-dimensional sequence x that character string and dictionary sequence label form For the conditional probability for segmenting sequence label y；

I indicates i-th of position in character string or dictionary sequence label；

J indicates the columns of two-dimensional sequence x, when j=0, indicates the character string in two-dimensional sequence x, when j=1, indicates two dimension Dictionary sequence label in sequence x；

P indicates the character number for including in character string, also may indicate that the dictionary label for including in dictionary sequence label Number also may indicate that the number for the participle label for including in participle sequence label；

M is the quantity that the participle sequence label y that participle mark obtains is carried out to character string x；

Z (x) is standardizing factor；

s_l(y, x, i, j) indicates that first of function of state, L indicate the total number of the function of state according to unitary template generation, Wherein, a meta template may include at least one of character feature template, dictionary feature templates, compound characteristics template, it is assumed that The number of one meta template is e₁It is a, then L=e₁* W*p, W are the type for segmenting label；

t_k(y, x, i, j) indicates that k-th of transfer function, K indicate the total number of the transfer function according to binary template generation, Wherein, two meta templates may include at least one of character feature template, dictionary feature templates, compound characteristics template, it is assumed that The number of two meta templates is e₂It is a, then K=e₂* the meaning of W*W*p, W are same as above.

Wherein, μ_lFor the first weight of function of state.λ_kFor the second weight of transfer function.Transfer function and function of state Weight λ_kAnd μ_lIt is to be solved and being trained to conditional probability prediction model, specific solution procedure will below It illustrates.

By the calculation formula of above-mentioned condition random field it is found that when calculating character sequence is marked as every kind of participle sequence label Conditional probability when, since the value of each function of state and each in the case where given participle sequence label, can be sought out The value of a transfer function, therefore the value of the value of each function of state and each transfer function is substituting to above-mentioned condition probability In prediction model, the conditional probability that character string is marked as given participle sequence label can be sought out.

Illustratively, character string described in above-mentioned table 1 and corresponding dictionary sequence label and 2 institute of above-mentioned table are continued to use The feature templates stated, if a meta template U01 to U18 generates function of state s in selection table 2_l(y, x, i, j), then can give birth to At function of state s_lThe L=18*4*16=1152 function of state of total number of (y, x, i, j), i.e. s₁(y, x, i, j) is to s₁₁₅₂ (y,x,i,j).If two meta template B01 generate transfer function t in selection table 2_k(y, x, i, j), then the transfer that can be generated Function t_kThe K=4*4*16=256 transfer function of total number of (y, x, i, j), i.e. t₁(y, x, i, j) is to t₂₅₆(y,x,i,j)。

The two-dimensional sequence x formed in the given character string described in table 1 and corresponding dictionary sequence label and some Segment sequence label y in the case where, can be by i=1, j=0, determine the value and each transfer of each function of state respectively The value of function, until determining the value of each function of state and each transfer letter when i=p (p=16 in this example), j=1 Several values, and then conditional probability when character string is marked as given participle sequence label can be sought out.

Wherein, when seeking any one function of state, wherein the setting condition of the function of state can be " y_i=n₁, x_{I ± d, j=0}=m ", y_i=n₁,x_{I ± d, j=1}=h or " y_i=n₁,x_{I ± d, j=0}=m, x_{I ± d, j=1}=h ", by judging the function of state Setting condition it is whether true, if so, then determine that the function of state takes 1, if not, can determine that the function of state takes 0.

Wherein, when seeking any one transfer function, wherein the setting condition of the transfer function can be " y_i=n₁, y_i-1=n₂,x_{I ± d, j=0}=m ", whether the setting condition by judging the transfer function is true, if so, then determine the transfer letter Number takes 1, if not, it can determine that the transfer function takes 0.

It should be noted that the corresponding dictionary label of matched character string has been got well due to labeled in dictionary sequence label, It, can be by the corresponding dictionary mark of matched character string when higher based on the accuracy of dictionary matching process under some special scenes The word segmentation result that be equivalent to can for reference is signed, it is possible to derive matching word based on the corresponding dictionary label of matched character string The corresponding participle label of symbol string, in this way can be various without assigning to each character in this part of matched character string in character string Label is segmented, but the participle label that the result marked in dictionary sequence label carrys out configurations match character string can be directly based upon, The number of conditional probability prediction model predicted condition probability can also be saved in this way, so that participle prediction process efficiency is higher.

For example, continue to continue to use electronic health record shown in table 1 and corresponding dictionary sequence label, by by electronic health record and Dictionary is matched after obtaining dictionary sequence label, can determine that " dry moist rales ", " fricative " are matched character string, That is " dry moist rales ", " fricative " can be used as a point good word, originally " dry moist rales " and " fricative " Corresponding participle label has 4⁸Kind is possible, but using dictionary sequence label as reference factor in this programme, general based on condition Rate prediction model segments in sequence label " dry moist rales " come when predicting that electronic health record is marked as every kind of participle sequence label Corresponding participle label can be determined as " B (dry) I (wet) I (property) I (hello) E (sound) ", " fricative " corresponding participle label can To be determined as " B (rub) I (wiping) E (sound) ", therefore segments sequence label and can reduce 3⁸Kind possibility, namely reduce 3⁸Kind can The participle sequence label of energy.In this way, relative to by all participle sequence labels, all one by one compared with design conditions probability, the application is mentioned The above scheme of confession can also save the number of conditional probability prediction model predicted condition probability, so that participle prediction process effect Rate is higher.

After obtaining character string and being marked as the conditional probability of each participle sequence label, it can will meet default The corresponding participle sequence label of the conditional probability of condition is determined as target participle sequence label.Illustratively, by conditional probability The corresponding participle sequence label of the maximum conditional probability of numerical value is determined as target participle sequence label.Then, it is segmented based on target Sequence label treats participle text and carries out word segmentation processing.

In one example, continue to continue to use electronic health record shown in table 1, corresponding by segmenting sequence label to each target Conditional probability be compared after, the corresponding target of the highest conditional probability of the numerical value of the conditional probability of selection segments label sequence Column, ginseng are shown in Table 3:

Table 3

Electronic health record	Dictionary sequence label	Target segments sequence label
			It is double	0	B
Lung	0	E
			Not	0	B
It hears	0	E
			And	0	S
It is dry	1	B
			It is wet	1	I
Property	1	I
			Hello	1	I
Sound	1	E
			,	0	S
Not	0	B
			It hears	0	E
And	0	S
			Film	0	B
Chest	0	E
			It rubs	1	B
It wipes	1	I
			Sound	1	E
。	0	S

After carrying out word segmentation processing to electronic health record based on the participle sequence label of target shown in table 3, available participle Result include: " double lungs ", " not hearing ", " and ", " dry moist rales ", ", ", " film chest ", " fricative ", "."

In the embodiment of the present application, marked being marked as each participle using conditional probability prediction model calculating character sequence Sign sequence when conditional probability when, the influence factor of conditional probability is in addition to the corresponding function of state of each participle sequence label and transfer The value of function further includes the weight λ of transfer function and function of state_kAnd μ_l.Wherein, weight λ_kAnd μ_lIt is by general to condition Rate prediction model is trained and solves.

In the following, being illustrated to the training process of the embodiment of the present application conditional Probabilistic Prediction Model.Referring to Figure 5, For the flow diagram of the training process of conditional probability prediction model provided by the embodiments of the present application, include the following steps:

Step 501 obtains sample set, includes multiple groups sample in sample set, includes sample to be segmented in every group of sample The corresponding sample character string of text, sample dictionary sequence label and at least one sample segment sequence label.

Step 502 is directed to every group of sample, according at least one of sample character string, sample dictionary sequence label, really Sample character string in fixed this group of sample is marked as each function of state in the case where every kind of sample participle sequence label The value of value and each transfer function.

Step 503, by the value of the value for each function of state determined by every group of sample and each transfer function It is input in conditional probability prediction model to be trained, determines the corresponding conditional probability function of every group of sample, conditional probability function In include function of state the first weight and transfer function the second weight.

The corresponding conditional probability function of every group of sample determined is input to default loss as independent variable by step 504 In function, by adjusting the value for the first weight for including in default loss function and the value of the second weight, default damage is determined Lose the penalty values of function.

Step 505, when penalty values meet the default condition of convergence, determine the first current value and the second weight of the first weight The second current value, and determination obtained in the case where the first weight is the first current value, the second weight is the second current value Conditional probability prediction model.

Specifically, can be given above-mentioned after being input to conditional probability function in default loss function as independent variable λ_kAnd μ_lTwo kinds are assigned initial value to training parameter, treat training parameter λ according to Newton iteration method or gradient descent method_kAnd μ_lInto Row adjustment updates, until the penalty values of default loss function stop updating when meeting the default condition of convergence, thus obtains wait instruct Practice parameter lambda_kAnd μ_lValue, to determine that the λ in condition random field formula_kAnd μ_lMould is predicted to get to conditional probability Type.

In the embodiment of the present application, during training condition Probabilistic Prediction Model, since dictionary sequence label can also be made The reference factor of sequence label is segmented for prediction, therefore can accelerate model convergence, that is to say, that can use the sample of relatively small amount This corpus can train to obtain conditional probability prediction model, it is possible thereby to without largely with the participle label manually marked Sample corpus saves human cost, the building efficiency of training for promotion collection.It, can be with after obtaining conditional probability prediction model By test sample set, the prediction accuracy of the conditional probability prediction model is tested, specific test process is here not Reinflated explanation.

Conceived based on same application, text participle dress corresponding with text segmenting method is additionally provided in the embodiment of the present application It sets, since the principle that the device in the embodiment of the present application solves the problems, such as is similar to the above-mentioned text segmenting method of the embodiment of the present application, Therefore the implementation of device may refer to the implementation of method, and overlaps will not be repeated.

Referring to shown in Fig. 6, for a kind of structural schematic diagram of text participle device 60 provided by the embodiments of the present application, comprising:

Conversion module 61, for being character string by text conversion to be segmented；

First determining module 62, for by the character string for meeting preset length for including in the character string and preparatory structure The standard words in dictionary built are matched, the determining and matched matched character string of the standard words, are in the character string Each character of the matched character string and each character in addition to the matched character string distribute corresponding dictionary mark respectively Label, obtain dictionary sequence label；

Second determining module 63, for determining the corresponding at least one participle label of each character in the character string, Obtain a variety of participle sequence labels；

Conditional probability prediction module 64, for according to the character string, the dictionary sequence label and training in advance Conditional probability prediction model, determine the character string be marked as every kind participle sequence label conditional probability；

Word segmentation processing module 65, for the corresponding participle sequence label of the conditional probability for meeting preset condition to be determined as mesh Mark participle sequence label, and word segmentation processing is carried out to the text to be segmented based on target participle sequence label.

In some embodiments of the present application, the conditional probability prediction module 64, according to the character string, institute's predicate Allusion quotation sequence label and conditional probability prediction model trained in advance, determine that the character string is marked as every kind of participle label When the conditional probability of sequence, it is specifically used for:

According to the character string and/or the dictionary sequence label, multiple feature templates are determined；

According to determining multiple feature templates, at least one function of state and at least one transfer function are generated；

Determine the value of each function of state in the case where the character string is marked as every kind of participle sequence label With the value of each transfer function；

The value of value and each transfer function that every kind segments the corresponding each function of state of sequence label is input to In advance in trained conditional probability prediction model, the item that the character string is marked as every kind of participle sequence label is calculated separately Part probability.

In some embodiments of the present application, the feature templates include at least one of lower template:

In some embodiments of the present application, the character string includes p character, and the dictionary sequence label includes p Dictionary label, the participle sequence label include p participle label；

If the feature templates include the character feature template, the conditional probability prediction module 64 is according to the word Accord with feature templates, the function of state s (y, x, i, j) of generation are as follows:

If the feature templates include the dictionary feature templates, the conditional probability prediction module 64 is according to institute's predicate Allusion quotation feature templates, the function of state s (y, x, i, j) of generation are as follows:

If the feature templates include the compound characteristics template, the conditional probability prediction module 64 is according to described multiple Close feature templates, the function of state s (y, x, i, j) of generation are as follows:

Wherein, x indicates the two-dimensional sequence being made of the character string and the dictionary sequence label；Y indicates described point Word sequence label；When j=0, x indicates the character string in the two-dimensional sequence；When j=1, x indicates the two-dimensional sequence In the dictionary sequence label；I take 1 into p arbitrary integer；x_{I ± d, j=0}Indicate i-th ± d position of the character string Character, x_{I ± d, j=1}Indicate the dictionary label of i-th ± d position of the dictionary sequence label, d takes 0 any just whole into p-i Number；y_iIndicate i-th of participle label of the participle sequence label y；n₁Indicate i-th of participle mark of the participle sequence label y Label, m indicate that the character of i-th ± d position in the character string, h indicate i-th ± d position in the dictionary sequence label Dictionary label.

If the feature templates include the character feature template, the conditional probability prediction module 64 is according to the word Accord with feature templates, the transfer function t (y, x, i, j) of generation are as follows:

Wherein, x indicates the two-dimensional sequence being made of the character string and the dictionary sequence label；Y indicates described point Word sequence label；When j=0, x indicates the character string in the two-dimensional sequence；I take 1 into p arbitrary integer；x_{I ± d, j=0} Indicate that the character of i-th ± d position of the character string, d take 0 any positive integer into p-i；y_iIndicate the participle mark Sign i-th of participle label of sequences y；y_i-1Indicate (i-1)-th participle label of participle sequence label y；n₁Indicate the participle mark Sign i-th of participle label of sequences y, n₂Indicate that (i-1)-th participle label of the participle sequence label y, m indicate the character The character of i-th ± d position in sequence.

In some embodiments of the present application, at least one described participle label includes: the first mark of the starting position of word It signs, the 4th label of the third label of the end position of second label in the middle position of word, word and individual character word；

In some embodiments of the present application, second determining module 63, each character in determining the character string Corresponding at least one participle label is specifically used for when obtaining a variety of participle sequence labels:

Determine the corresponding at least one participle label of each character in the character string；

It arbitrarily selects a kind of participle label to segment as target from the corresponding at least one participle label of each character to mark Label, and using sequence composed by the corresponding target participle label of each character as a kind of participle sequence label.

In some embodiments of the present application, first determining module 62, to match word described in the character string The each character and each character in addition to the matched character string for according with string distribute corresponding dictionary label respectively, obtain dictionary When sequence label, it is specifically used for:

Each character in the character string is distributed to dictionary label according to following rule, obtain being made of dictionary label Dictionary sequence label:

For any one character in the character string, if the character is the character in the matched character string, for The character distributes the first dictionary label, if the character is the character in addition to the matched character string, will distribute for the character Second dictionary label.

In some embodiments of the present application, described device further include:

Model training module 66, for obtaining the conditional probability prediction model according to following manner training:

Sample set is obtained, includes multiple groups sample in the sample set, includes sample text to be segmented in every group of sample Corresponding sample character string, sample dictionary sequence label and at least one sample segment sequence label；

For every group of sample, according at least one of the sample character string, the sample dictionary sequence label, really The sample character string in fixed this group of sample is marked as each state letter in the case where every kind of sample participle sequence label The value of several value and each transfer function；

By the value of the value for each function of state determined by every group of sample and each transfer function be input to In trained conditional probability prediction model, determines the corresponding conditional probability function of every group of sample, wrapped in the conditional probability function Include the first weight of the function of state and the second weight of the transfer function；

The corresponding conditional probability function of every group of sample determined is input in default loss function as independent variable, is led to Cross the value of the value and second weight that adjust first weight for including in the default loss function, determine described in The penalty values of default loss function；

When the penalty values meet the default condition of convergence, the first current value and described second of first weight is determined Second current value of weight, and determine first weight is first current value, second weight is described second The conditional probability prediction model obtained in the case where current value.

Description about the interaction flow between the process flow and each module of each module in device is referred to The related description in embodiment of the method is stated, I will not elaborate.

The embodiment of the present application provides a kind of electronic equipment 700, is illustrated in figure 7 electronics provided by the embodiments of the present application and sets Standby 700 structural schematic diagram, comprising: processor 701, memory 702 and bus 703, memory 702 are stored with processor 701 Executable machine readable instructions, it is logical by bus 703 between processor 701 and memory 702 when electronic equipment operation Letter, processor execute machine readable instructions, execute when executing such as the text segmenting method proposed in above method embodiment Step.

The embodiment of the present application provides a kind of computer readable storage medium, is stored on the computer readable storage medium Computer program executes the text participle as proposed in above method embodiment when the computer program is run by processor The step of method.

Specifically, which can be general storage medium, such as mobile disk, hard disk, on the storage medium Computer program when being run, above-mentioned text segmenting method is able to carry out, so as to quickly and accurately to including non- The text of structural data is segmented.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description It with the specific work process of device, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.In the application In provided several embodiments, it should be understood that disclosed systems, devices and methods, it can be real by another way It is existing.The apparatus embodiments described above are merely exemplary, for example, the division of the unit, only a kind of logic function It can divide, there may be another division manner in actual implementation, in another example, multiple units or components can combine or can collect At another system is arrived, or some features can be ignored or not executed.Another point, shown or discussed mutual coupling Conjunction or direct-coupling or communication connection can be the indirect coupling or communication connection by some communication interfaces, device or unit, It can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.

It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in the executable non-volatile computer-readable storage medium of a processor.Based on this understanding, the application Technical solution substantially the part of the part that contributes to existing technology or the technical solution can be with software in other words The form of product embodies, which is stored in a storage medium, including some instructions use so that One computer equipment (can be personal computer, server or the network equipment etc.) executes each embodiment institute of the application State all or part of the steps of method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. is various to deposit Store up the medium of program code.

The above is only the protection scopes of the specific embodiment of the application, but the application to be not limited thereto, any to be familiar with Those skilled in the art within the technical scope of the present application, can easily think of the change or the replacement, and should all cover Within the protection scope of the application.Therefore, the protection scope of the application should be subject to the protection scope in claims.

Claims

1. a kind of text segmenting method characterized by comprising

It is character string by text conversion to be segmented；

The character string for meeting preset length for including in the character string and the standard words in the dictionary that in advance constructs are carried out Matching, the determining and matched matched character string of the standard words are each word of matched character string described in the character string Symbol and each character in addition to the matched character string distribute corresponding dictionary label respectively, obtain dictionary sequence label；

It determines the corresponding at least one participle label of each character in the character string, obtains a variety of participle sequence labels；

According to the character string, the dictionary sequence label and conditional probability prediction model trained in advance, determine described in Character string is marked as the conditional probability of every kind of participle sequence label；

The corresponding participle sequence label of the conditional probability for meeting preset condition is determined as target participle sequence label, and is based on institute It states target participle sequence label and word segmentation processing is carried out to the text to be segmented.

2. the method as described in claim 1, which is characterized in that described according to the character string, the dictionary sequence label And conditional probability prediction model trained in advance, determine that the character string is marked as the condition of every kind of participle sequence label Probability, comprising:

Determine in the case where the character string is marked as every kind of participle sequence label the value of each function of state and each The value of a transfer function；

The value of value and each transfer function that every kind segments the corresponding each function of state of sequence label is input in advance In trained conditional probability prediction model, calculate separately the character string be marked as every kind participle sequence label condition it is general Rate.

3. method according to claim 2, which is characterized in that the feature templates include at least one of lower template:

4. method as claimed in claim 3, which is characterized in that the character string includes p character, the dictionary label sequence Column include p dictionary label, and the participle sequence label includes p participle label；

If the feature templates include the character feature template, according to the character feature template, the function of state s of generation (y, x, i, j) are as follows:

If the feature templates include the dictionary feature templates, according to the dictionary feature templates, the function of state s of generation (y, x, i, j) are as follows:

If the feature templates include the compound characteristics template, according to the compound characteristics template, the function of state s of generation (y, x, i, j) are as follows:

Wherein, x indicates the two-dimensional sequence being made of the character string and the dictionary sequence label；Y indicates the participle mark Sign sequence；When j=0, x indicates the character string in the two-dimensional sequence；When j=1, x is indicated in the two-dimensional sequence The dictionary sequence label；I take 1 into p arbitrary integer；x_{I ± d, j=0}Indicate the word of i-th ± d position of the character string Symbol, x_{I ± d, j=1}Indicate the dictionary label of i-th ± d position of the dictionary sequence label, d takes 0 any just whole into p-i Number；y_iIndicate i-th of participle label of the participle sequence label y；n₁Indicate i-th of participle mark of the participle sequence label y Label, m indicate that the character of i-th ± d position in the character string, h indicate i-th ± d position in the dictionary sequence label Dictionary label.

5. method as claimed in claim 3, which is characterized in that the character string includes p character, the participle label sequence Column include p participle label；

If the feature templates include the character feature template, according to the character feature template, the transfer function t of generation (y, x, i, j) are as follows:

Wherein, x indicates the two-dimensional sequence being made of the character string and the dictionary sequence label；Y indicates the participle mark Sign sequence；When j=0, x indicates the character string in the two-dimensional sequence；I take 1 into p arbitrary integer；x_{I ± d, j=0}It indicates The character of i-th ± d position of the character string, d take 0 any positive integer into p-i；y_iIndicate the participle label sequence Arrange i-th of participle label of y；y_i-1Indicate (i-1)-th participle label of participle sequence label y；n₁Indicate the participle label sequence Arrange i-th of participle label of y, n₂Indicate that (i-1)-th participle label of the participle sequence label y, m indicate the character string In i-th ± d position character.

6. method as claimed in claim 1 to 5, which is characterized in that at least one participle label includes: word First label of starting position, second label in the middle position of word, word end position third label and monosyllabic word 4th label of language；

It determines the corresponding at least one participle label of each character in the character string, obtains a variety of participle sequence labels, wrap It includes:

Arbitrarily select a kind of participle label as target participle label from the corresponding at least one participle label of each character, and Using sequence composed by the corresponding target participle label of each character as a kind of participle sequence label.

7. method as claimed in claim 1 to 5, which is characterized in that for matched character string described in the character string Each character and each character in addition to the matched character string distribute corresponding dictionary label respectively, obtain dictionary label Sequence, comprising:

Each character in the character string is distributed to dictionary label according to following rule, obtain the word being made of dictionary label Allusion quotation sequence label:

For any one character in the character string, if the character is the character in the matched character string, for the word Symbol the first dictionary label of distribution will be character distribution second if the character is the character in addition to the matched character string Dictionary label.

8. the method as described in claim 1, which is characterized in that obtain the conditional probability according to following manner training and predict mould Type:

Sample set is obtained, includes multiple groups sample in the sample set, includes that sample text to be segmented is corresponding in every group of sample Sample character string, sample dictionary sequence label and at least one sample segment sequence label；

This group of sample is determined according at least one of the sample character string, sample dictionary sequence label for every group of sample Sample character string in this is marked as in the case where every kind of sample participle sequence label the value of each function of state and each The value of a transfer function；

The value of the value for each function of state determined by every group of sample and each transfer function is input to wait train Conditional probability prediction model in, determine the corresponding conditional probability function of every group of sample, include institute in the conditional probability function State the first weight of function of state and the second weight of the transfer function；

The corresponding conditional probability function of every group of sample determined is input in default loss function as independent variable, passes through tune The value for first weight for including in the whole default loss function and the value of second weight determine described default The penalty values of loss function；

When the penalty values meet the default condition of convergence, the first current value and second weight of first weight are determined The second current value, and determine first weight is first current value, second weight is described second current The conditional probability prediction model obtained in the case where value.

9. a kind of text segments device characterized by comprising

Conversion module, for being character string by text conversion to be segmented；

First determining module, for by the character string for meeting preset length for including in the character string and the word that in advance constructs Standard words in allusion quotation are matched, the determining and matched matched character string of the standard words, are described in the character string Each character with character string and each character in addition to the matched character string distribute corresponding dictionary label respectively, obtain Dictionary sequence label；

Second determining module obtains more for determining the corresponding at least one participle label of each character in the character string Kind participle sequence label；

Conditional probability prediction module, for according to the character string, the dictionary sequence label and condition trained in advance Probabilistic Prediction Model determines that the character string is marked as the conditional probability of every kind of participle sequence label；

Word segmentation processing module, for the corresponding participle sequence label of the conditional probability for meeting preset condition to be determined as target participle Sequence label, and word segmentation processing is carried out to the text to be segmented based on target participle sequence label.

10. device as claimed in claim 9, which is characterized in that the conditional probability prediction module, according to the character sequence Column, the dictionary sequence label and conditional probability prediction model trained in advance, determine that the character string is marked as often When the conditional probability of kind participle sequence label, it is specifically used for:

11. device as claimed in claim 10, which is characterized in that the feature templates include at least one in lower template Kind:

12. device as claimed in claim 11, which is characterized in that the character string includes p character, the dictionary label Sequence includes p dictionary label, and the participle sequence label includes p participle label；

If the feature templates include the character feature template, the conditional probability prediction module is according to the character feature Template, the function of state s (y, x, i, j) of generation are as follows:

If the feature templates include the dictionary feature templates, the conditional probability prediction module is according to the dictionary feature Template, the function of state s (y, x, i, j) of generation are as follows:

If the feature templates include the compound characteristics template, the conditional probability prediction module is according to the compound characteristics Template, the function of state s (y, x, i, j) of generation are as follows:

13. device as claimed in claim 11, which is characterized in that the character string includes p character, the dictionary label Sequence includes p dictionary label, and the participle sequence label includes p participle label；

If the feature templates include the character feature template, the conditional probability prediction module is according to the character feature Template, the transfer function t (y, x, i, j) of generation are as follows:

14. the device as described in claim 9 to 13 is any, which is characterized in that at least one described participle label includes: word The first label of starting position, second label in middle position of word, word end position third label and individual character 4th label of word；

Second determining module, the corresponding at least one participle label of each character, obtains in determining the character string When a variety of participle sequence labels, it is specifically used for:

15. the device as described in claim 9 to 13 is any, which is characterized in that first determining module, for the character Each character of matched character string described in sequence and each character in addition to the matched character string distribute corresponding respectively Dictionary label is specifically used for when obtaining dictionary sequence label: