CN109829162B

CN109829162B - Text word segmentation method and device

Info

Publication number: CN109829162B
Application number: CN201910094380.2A
Authority: CN
Inventors: 王李鹏
Original assignee: New H3C Big Data Technologies Co Ltd
Current assignee: New H3C Big Data Technologies Co Ltd
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2022-04-08
Anticipated expiration: 2039-01-30
Also published as: CN109829162A

Abstract

The application provides a text word segmentation method and a text word segmentation device, wherein the method comprises the following steps: converting a text to be word segmented into a character sequence; matching character strings meeting preset length in the character sequence with standard words in a pre-constructed dictionary, determining matched character strings matched with the standard words, and respectively allocating corresponding dictionary labels to each character of the matched character strings in the character sequence and each character except the matched character strings to obtain a dictionary label sequence; determining at least one word segmentation label corresponding to each character in the character sequence to obtain a plurality of word segmentation label sequences; determining the conditional probability of each participle label sequence marked by the character sequence according to the character sequence, the dictionary label sequence and a pre-trained conditional probability prediction model; and determining the word segmentation label sequence corresponding to the conditional probability meeting the preset condition as a target word segmentation label sequence, and performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.

Description

Text word segmentation method and device

Technical Field

The application relates to the technical field of big data, in particular to a text word segmentation method and device.

Background

In the natural language processing technology, the word segmentation technology is the basis of other language processing, and the accuracy of word segmentation is very important for other language processing. At present, when a text is analyzed and processed, a certain difficulty exists in word segmentation processing for the text containing unstructured data.

Taking an electronic medical record as an example, because the electronic medical record contains a lot of unstructured data, such as medical history records, medical record summary, and the like, performing automatic word segmentation on such unstructured data is a very difficult task at the same time as the most basic analysis and mining of the electronic medical record.

Therefore, a technical solution for rapidly and accurately segmenting words of a text containing unstructured data is needed.

Disclosure of Invention

In view of the above, an object of the present application is to provide a text word segmentation method and device, which can rapidly and accurately segment a text containing unstructured data.

In a first aspect, the present application provides a text word segmentation method, including:

converting the text to be segmented into a character sequence;

matching character strings meeting preset length contained in the character sequence with standard words in a pre-constructed dictionary, determining matched character strings matched with the standard words, and respectively allocating corresponding dictionary labels to each character of the matched character strings in the character sequence and each character except the matched character strings to obtain a dictionary label sequence;

determining at least one word segmentation label corresponding to each character in the character sequence to obtain a plurality of word segmentation label sequences;

determining the conditional probability of the character sequence marked as each participle label sequence according to the character sequence, the dictionary label sequence and a pre-trained conditional probability prediction model;

determining a word segmentation label sequence corresponding to the conditional probability meeting the preset condition as a target word segmentation label sequence, and performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.

In a second aspect, the present application provides a text word segmentation apparatus, including:

the conversion module is used for converting the text to be segmented into a character sequence;

the first determining module is used for matching the character strings which are contained in the character sequence and meet the preset length with standard words in a dictionary which is constructed in advance, determining matched character strings which are matched with the standard words, and respectively allocating corresponding dictionary labels to each character of the matched character strings in the character sequence and each character except the matched character strings to obtain a dictionary label sequence;

the second determining module is used for determining at least one word segmentation label corresponding to each character in the character sequence to obtain a plurality of word segmentation label sequences;

the conditional probability prediction module is used for determining the conditional probability of each participle tag sequence marked by the character sequence according to the character sequence, the dictionary tag sequence and a pre-trained conditional probability prediction model;

and the word segmentation processing module is used for determining a word segmentation label sequence corresponding to the conditional probability meeting the preset condition as a target word segmentation label sequence and performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the method of the first aspect described above, or any possible implementation of the first aspect.

In a fourth aspect, this application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to perform the steps of the text segmentation method in the first aspect or any one of the possible implementations of the first aspect.

The application provides a text word segmentation method and a text word segmentation device, firstly, a text to be segmented can be converted into a character sequence, then, a character string which meets a preset length in the character sequence can be matched with a standard word in a pre-constructed dictionary, a dictionary label sequence can be obtained based on a matching result, and various word segmentation label sequences can be obtained by determining at least one word segmentation label corresponding to each character in the character sequence. Further, the dictionary tag sequence and the character sequence can be used as input of the model, the conditional probability when the character sequence is marked as each word segmentation tag sequence is predicted by using the conditional probability prediction model, then the target word segmentation tag sequence is determined based on the obtained conditional probability, and word segmentation processing is carried out on the text to be segmented based on the target word segmentation tag sequence.

The method comprises two segmentation prediction processes of dictionary matching and conditional probability prediction model prediction, and by combining the two segmentation prediction processes, on one hand, a dictionary label sequence obtained by dictionary matching is used as a reference factor during prediction based on the conditional probability prediction model, so that the accuracy of a segmentation result predicted by the conditional probability prediction model is high, and the accuracy of the predicted segmentation result is improved; on the other hand, a conditional probability prediction model is introduced, and under the condition that a character sequence and a dictionary label sequence corresponding to a text to be segmented are given, the conditional probability when the character sequence is marked as a certain segmentation label sequence is predicted, so that the segmentation label sequence corresponding to the character sequence can be directly obtained, namely, segmentation labels respectively corresponding to all characters in the text to be recognized can be obtained through one-time prediction process, and the text segmentation efficiency can be improved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic flow chart illustrating a text word segmentation method provided in an embodiment of the present application;

fig. 2 is a schematic diagram illustrating a process of matching a to-be-segmented text based on a forward maximum matching algorithm according to an embodiment of the present application;

fig. 3 is a schematic flowchart illustrating a process of matching a to-be-segmented text based on an inverse maximum matching algorithm according to an embodiment of the present application;

FIG. 4 is a flow chart illustrating a word segmentation tag sequence for predicting that a character sequence is marked according to an embodiment of the present application;

FIG. 5 is a flow chart illustrating a training process of a conditional probability prediction model provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram illustrating a text word segmentation apparatus according to an embodiment of the present application;

fig. 7 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

At present, when a word segmentation method based on supervised learning is adopted when a text containing unstructured data is segmented, a sample corpus carrying manually labeled word segmentation labels needs to be constructed as a training set, and then a word segmentation prediction model is obtained by training with the help of the sample corpus in the training set so as to predict word segmentation results of the text. Due to the fact that the sample corpus is large in scale in the training set, the word segmentation method with supervised learning needs to consume a large amount of manpower to label the word segmentation labels of the sample corpus, the manpower cost is high, the difficulty in constructing a comprehensive training set is high, and the construction efficiency is low. However, if the word segmentation method based on unsupervised learning is adopted to determine the word segmentation result of the text, the accuracy of the word segmentation result is lower compared with the word segmentation method based on supervised learning.

In order to solve the problems, the application provides a text word segmentation method and a text word segmentation device. Referring to fig. 1, a schematic flow chart of a text word segmentation method provided in an embodiment of the present application includes the following steps:

step 101, converting a text to be segmented into a character sequence.

And 102, matching character strings meeting the preset length in the character sequence with standard words in a pre-constructed dictionary, determining matched character strings matched with the standard words, and respectively allocating corresponding dictionary labels to each character of the matched character strings in the character sequence and each character except the matched character strings to obtain a dictionary label sequence.

And 103, determining at least one word segmentation label corresponding to each character in the character sequence to obtain various word segmentation label sequences.

And step 104, determining the conditional probability of the character sequence marked as each participle label sequence according to the character sequence, the dictionary label sequence and a pre-trained conditional probability prediction model.

And 105, determining the word segmentation label sequence corresponding to the conditional probability meeting the preset conditions as a target word segmentation label sequence, and performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.

The text to be participled is composed of a plurality of characters, so that the text to be participled can be divided into separate characters by taking a single character as a unit, and then all the characters are sequentially arranged to form a character sequence. The process of segmenting the text to be segmented into words can be converted into the process of predicting the segmentation label corresponding to each character in the character sequence by converting the text to be segmented into the character sequence.

Based on this, the embodiment of the application proposes that a conditional probability prediction model can be directly used to predict a target word segmentation label sequence corresponding to the whole character sequence, so that target word segmentation labels corresponding to all characters in the character sequence can be obtained through one prediction process, and the efficiency of text word segmentation can be improved. In addition, in order to improve the accuracy of predicting the target word segmentation tags, before the target word segmentation tag sequence of the character sequence is predicted by using the conditional probability prediction model, the characters which meet the preset length and are contained in the character sequence are respectively matched with the standard words in the pre-constructed dictionary, the dictionary tag sequence is obtained based on the matching result, and then the dictionary tag sequence can be used as a reference factor and input into the conditional probability prediction model together with the character sequence to obtain a more accurate prediction result.

Next, a dictionary-based matching process and a prediction process based on a conditional probability prediction model will be described in detail.

Implementation process I and matching process based on dictionary

It should be understood that the matching process based on the dictionary can be applied to the process of training the conditional probability prediction model to generate a sample dictionary tag sequence corresponding to the sample character sequence in the sample set, or can be applied to the process of predicting the word segmentation tag sequence based on the conditional probability prediction model obtained by training to generate a dictionary tag sequence corresponding to the text to be segmented. The two processes are based on the same technical concept, so the process of generating the dictionary label sequence corresponding to the text to be segmented is emphasized in the application.

In the embodiment of the application, the character strings which are included in the character sequence and meet the preset length can be matched with the standard words in the pre-constructed dictionary, and the dictionary label sequence is determined based on the matching result. The specific process of constructing the dictionary may refer to the prior art, and is not described in this application.

In specific implementation, a character string which meets a preset length and is included in the character sequence may be first matched with a standard word in a dictionary established in advance, and a matching character string matched with the standard word may be determined. And then, corresponding dictionary labels can be respectively allocated to each character of the matched character string in the character sequence and each character except the matched character string to obtain a dictionary label sequence.

Illustratively, the matching process may employ a forward maximum matching algorithm, an inverse maximum matching algorithm, or a two-way maximum matching algorithm. The bidirectional maximum matching algorithm may be understood as a process of determining a correct word segmentation result by comparing a word segmentation result obtained by a forward maximum matching algorithm with a word segmentation result obtained by a reverse maximum matching algorithm.

It should be noted that the character string satisfying the preset length may be a character string including at least one character. If the dictionary matching is performed by applying the forward maximum matching algorithm or the reverse maximum matching algorithm, the character strings satisfying the preset length may be: a character string which at least contains one character and the total number of the contained characters does not exceed the total number of the characters of the longest standard word in the dictionary.

The following describes the matching algorithm:

(1) and (4) a forward maximum matching algorithm.

Referring to fig. 2, a schematic flowchart of matching a character sequence based on a forward maximum matching algorithm is shown, which includes the following steps:

step 201, a characters are sequentially taken from the front to the back in the character sequence as character strings to be matched.

In one example, a may be taken as the total number of characters of the longest standard word in the dictionary.

Step 202, judging whether a standard word same as the character string to be matched exists in the dictionary.

If yes, go to step 203; if the determination result is negative, step 204 is executed.

And step 203, determining the character string to be matched as a matched character string matched with the standard word, further returning to step 201, and taking the next character string with the length of a until all characters in the character sequence are traversed.

Step 204, after the last character of the character string to be matched is removed, the remaining characters form a new character string to be matched and the step 202 is executed, until a matched character string matched with the standard word is found out, the step 201 is returned to and executed, and a character string with the length of a is taken; or, after all the characters in the character string to be matched are removed, the step 201 is executed again, and a character string with the length of a is taken down.

After matching the character sequences based on the matching process, a first matching result can be obtained, and the first matching result records the matching character strings in the character sequences and characters except the matching character strings. The matching character string may be composed of a plurality of characters, or may be composed of a single character, which is not limited in this application.

(2) And (5) performing reverse maximum matching algorithm.

Referring to fig. 3, a schematic flowchart of matching a character sequence based on an inverse maximum matching algorithm is shown, which includes the following steps:

and 301, sequentially taking a characters from the back to the front from the character sequence as a character string to be matched.

Wherein the meaning of a is the same as that described in the forward maximum matching algorithm.

Step 302, judging whether a standard word same as the character string to be matched exists in the dictionary.

If yes, go to step 303; if the determination result is negative, go to step 304.

Step 303, determining the character string to be matched as a matched character string matched with the standard word, further returning to step 301, and taking a next character string with the length of a until all characters in the character sequence are traversed.

Step 304, after the first character of the character string to be matched is removed, the remaining characters form a new character string to be matched and the step 302 is executed until a matched character string matched with the standard word is found, the step 301 is returned to and executed, and a character string with the length of a is taken; or after all characters in the character string to be matched are removed, the step 301 is returned to and executed, and a character string with the length of a is taken down.

After matching the character sequence based on the matching process, a second matching result can be obtained, and the matching character string in the character sequence and the characters except the matching character string are recorded in the second matching result. The matching character string may be composed of a plurality of characters, or may be composed of a single character, which is not limited in this application.

(3) And (3) a bidirectional maximum matching algorithm.

After obtaining the first matching result and the second matching result based on the matching processes shown in fig. 2 and fig. 3, the first matching result and the second matching result may be compared, and a better matching result may be selected as a final matching result.

If the first matching result and the second matching result are consistent, any one of the matching results can be selected as a final matching result.

If the first matching result is inconsistent with the second matching result, the number of the matching character strings in the first matching result and the second matching result, the number of the characters except the matching character strings and the number of the matching character strings of a single character can be compared, and a better matching result is selected as a final matching result. For example, the final matching result may be selected on the basis of the principle that the larger the number of matching character strings, the better the number of characters other than the matching character strings, and the better the number of matching character strings of a single character.

In one possible implementation, after determining the matching character string and the characters except the matching character string in the character sequence, the dictionary tag may be assigned to each character in the character sequence according to the following rule, so as to obtain a dictionary tag sequence composed of dictionary tags:

for any character in the character sequence, if the character is a character in a matching character string, a first dictionary label is allocated to the character, and if the character is a character except the matching character string, a second dictionary label is allocated to the character.

In one example, the first dictionary tag can be represented by 1 and the second dictionary tag can be represented by 0. Of course, in practical applications, the first dictionary label and the second dictionary label may be configured according to actual requirements, for example, the first dictionary label is represented by Y, and the second dictionary label is represented by N, which is not limited in the present application.

Exemplarily, taking a text to be segmented as an electronic medical record as an example, the electronic medical record records "lung-unvoiced and dry-wet rales and membrane chest fricatives", and if the dictionary includes "dry-wet rales" and "fricatives", the "dry-wet rales" and the "fricatives" may be determined as matching character strings, and then a dictionary tag sequence shown in table 1 may be generated according to the above embodiment (the left side of table 1 is a character sequence corresponding to the text to be segmented, the first dictionary tag is denoted by 1, and the second dictionary tag is denoted by 0):

TABLE 1

And implementing a second flow and a prediction process based on a pre-trained conditional probability prediction model.

In the embodiment of the present application, before predicting a target participle tag sequence marked by a character sequence, each participle tag sequence that the character sequence may be marked may be determined first.

In a possible implementation manner, for each character in the character sequence, at least one word segmentation label corresponding to each character may be determined, further, one word segmentation label may be arbitrarily selected from the at least one word segmentation label corresponding to each character as a target word segmentation label, and a sequence formed by the target word segmentation labels corresponding to each character is used as a word segmentation label sequence.

At least one word segmentation label which can be possibly marked on each character specifically comprises the following steps: the first label of the starting position of the word, the second label of the middle position of the word, the third label of the ending position of the word and the fourth label of the single word. In one example, b (begin) may be used to represent a first tag of a start position of a word, i (intermediate) may be used to represent a second tag of a middle position of the word, e (end) may be used to represent a third tag of an end position of the word, and s (single) may be used to represent a fourth tag of a single word composed of single characters.

Wherein, for each character, there are 4 cases B, I, S, E in the corresponding participle tag, and if any one of the 4 participle tags corresponding to each character is taken as the target participle tag, the generated participle tag sequence is 4^pAnd (4) seed preparation.

In the embodiment of the application, after determining a plurality of word segmentation tag sequences, the conditional probability that a character sequence is marked as each word segmentation tag sequence can be predicted based on the character sequence, the dictionary tag sequence and a pre-trained conditional probability prediction model.

The specific prediction process is shown in fig. 4:

step 401, determining a plurality of feature templates according to the character sequence and/or the dictionary tag sequence.

Step 402, generating at least one state function and at least one transfer function according to the plurality of determined feature templates.

And step 403, determining the values of each state function and each transfer function under the condition that the character sequence is marked as each participle label sequence.

And step 404, inputting the values of the state functions and the values of the transfer functions corresponding to each participle tag sequence into a pre-trained conditional probability prediction model, and respectively calculating the conditional probability of the character sequence marked as each participle tag sequence.

To facilitate an understanding of the prediction flow illustrated in fig. 4, a plurality of feature templates determined from a character sequence and/or a dictionary tag sequence will first be described.

Illustratively, the feature templates may include at least one of the following templates:

a character feature template for representing individual characters in the sequence of characters;

the character feature template is used for representing the incidence relation of different characters in the character sequence;

a dictionary feature template for representing a single dictionary tag in the sequence of dictionary tags;

a dictionary feature template for representing an associative relationship between different dictionary tags in the sequence of dictionary tags;

and the composite characteristic template consists of the character characteristic template and the dictionary characteristic template.

The above three feature templates can also be used as a unitary template (Unigram template) or a binary template (Bigram template).

Wherein, the unary template can be used to determine the state function, the template format is Uk:% x [ i, j ], wherein the letter U indicates that the template is the unary template; k represents a serial number of the template; x represents a two-dimensional sequence consisting of a sequence of characters and a sequence of dictionary tags; in the present disclosure, j denotes a position of a column, and indicates a first column when j is 0, the first column refers to a character sequence in the two-dimensional sequence, and indicates a second column when j is 1, the second column refers to a dictionary tag sequence in the two-dimensional sequence; in the present disclosure, i represents the i-th position in the character sequence or the dictionary tag sequence, that is, the current position, and when j is 0, x [ i,0] represents the character at the i-th position in the character sequence in the two-dimensional sequence, and when j is 1, x [ i,1] represents the dictionary tag at the i-th position in the dictionary tag sequence in the two-dimensional sequence.

A binary template may be used to determine the transfer function, the template format being, for example, Bk:% x [ i, j ], where the letter B indicates that the template is a binary template; other parameters can be referred to the description of the unary template, and are not described in detail here.

Illustratively, the above feature templates are exemplified by following the correspondence relationship between the character sequences of the electronic medical record and the corresponding dictionary tag sequences shown in table 1.

For the character sequence "lung-loud and dry-wet-loud, membrane chest fricative" formed by electronic medical record, the characteristic templates that can be generated are shown in table 2:

TABLE 2

Among them, U01 to U18 shown in Table 2 are monovalent templates, and B01 is a binary template.

U01-U05 are character feature templates for representing individual characters in a sequence of characters. For example, U01:% x [ i-2,0] represents the character at the i-2 nd position in the sequence of characters, i.e., the character at the position that precedes the current position and is two characters apart from the current position; u03:% x [ i,0] represents the character at the ith position in the character sequence, i.e. the character at the current position; u05:% x [ i +2,0] represents the character at the i +2 th position in the sequence of characters, i.e., the position after the current position and separated from the current position by two characters.

U06-U12 are character feature templates used to represent associations of different characters in a sequence of characters. For example, U06% x [ i-2,0 ]/% x [ i-1,0] indicates the character at the i-2 th position in the character sequence, and the character at the i-1 st position in the character sequence; u07:% x [ i-1,0 ]/% x [ i,0] denotes the character at the i-1 st position in the character sequence, and the character at the i-th position in the character sequence.

U13 is a dictionary feature template for representing a single dictionary label in a sequence of dictionary labels. For example, U13:% x [ i,1] may represent the dictionary tag at the ith position in the dictionary tag sequence.

U14 is a composite feature template consisting of a character feature template and a dictionary feature template. U14:% x [ i,0 ]/% x [ i,1] can represent the character at the ith position in the character sequence and the dictionary label at the ith position in the dictionary label sequence.

U15-U18 are dictionary feature templates used to represent associations between different dictionary labels in a dictionary label sequence. For example, U15:% x [ i-2,1 ]/% x [ i-1,1] represents the dictionary label at the i-2 th position in the dictionary label sequence and the dictionary label at the i-1 st position in the dictionary label sequence.

B01 is a binary template, and B01 can also be attributed as a character feature template for representing a single character in a sequence of characters. B01:% x [ i,0] can represent the character at the ith position in the sequence of characters. Of course, in practical applications, the dictionary feature template and the composite feature template also form a binary template, which is not limited in the present application.

In the embodiment of the present application, the unary templates may generate a state function S (y, x, I, j), and each unary template may generate W × p state functions, where p represents the number of characters included in a character sequence, may also represent the number of dictionary tags included in a dictionary tag sequence, may also represent the number of segmentation tags included in a segmentation tag sequence, the number of characters, the number of dictionary tags, and the number of segmentation tags are the same, and W represents the type of the segmentation tag, and in this disclosure, W is 4, that is, 4 segmentation tags such as "B", "E", "I", and "S".

Following the above example, as shown in table 1, the character sequence includes "double", "lung", "not", "smell", "and", "dry", "wet", "sex", "calc", "pitch", "", "membrane", "chest", "mole", "rubbing". "these 16 characters, i.e. p ═ 16, the class of the word segmentation labels includes 4 kinds of" B "," E "," I "," S ", i.e. W ═ 4, so it can be concluded that each unary template can generate 16 × 4 ═ 64 state functions.

The binary templates can generate transfer functions t (y, x, i, j), and each binary template can generate W p state functions, wherein p and W have the same meaning.

Continuing with the above example, as shown in table 1, it can be concluded that each binary template can generate 16 × 4 — 256 transfer functions.

Further, after determining the various feature templates, a state function may be generated based on a unary template, and a transfer function may also be generated based on a binary template, where a specific embodiment is as follows:

the first embodiment,

Since the unary template may be one or more of a character feature template, a dictionary feature template, and a composite feature template, the state function s (y, x, i, j) generated based on the unary template includes the following cases:

first, assume that the character sequence includes p characters, the dictionary tag sequence includes p dictionary tags, and the word segmentation tag sequence includes p word segmentation tags, which are equal to each other.

Case 1: if the feature template comprises a character feature template, generating a state function s (y, x, i, j) according to the character feature template, wherein the state function s (y, x, i, j) is as follows:

wherein x represents a two-dimensional sequence consisting of a sequence of characters and a sequence of dictionary tags; when j is 0, the character sequence in the two-dimensional sequence is represented; x is the number of_i±d,j＝0The characters represent the i +/-d positions of the character sequence, i is any integer from 1 to p, and d is any positive integer from 0 to p-i; y represents a word segmentation tag sequence; y is_iAn ith word segmentation label representing a word segmentation label sequence y;

s (y, x, i, j) is m at the i + -d position of the character sequence, and the i-th participle label of the participle label sequence y is n₁Under the condition of k₁Otherwise, s (y, x, i, j) is k₂。

Case 2: if the feature template comprises a dictionary feature template, then a state function s (y, x, i, j) generated according to the dictionary feature template is:

wherein, when j is 1, representing dictionary label sequence in the two-dimensional sequence; x is the number of_i±d,j＝1The character of i + -d position of the dictionary label sequence is represented, i is any integer from 1 to p, d is any positive integer from 0 to p-iAn integer number; other parameters have the same meanings as above.

The dictionary label of s (y, x, i, j) at the i +/-d position of the dictionary label sequence is h, and the ith participle label of the participle label sequence y is n₁Under the condition of k₁Otherwise, s (y, x, i, j) is k₂。

Case 3: if the feature template comprises a composite feature template, then the state function s (y, x, i, j) generated according to the composite feature template is:

s (y, x, i, j) where the character at the i + -d-th position of the character sequence is m, the dictionary label at the i + -d-th position of the dictionary label sequence is h, and the i-th participle label of the participle label sequence y is n₁Under the condition of k₁Otherwise, s (y, x, i, j) is k₂。

Wherein k is₁For example, the value of 1, k can be selected₂For example, it may take the value 0. Of course, in practical application, k may be configured according to practical situations₁And k₂The value of (a) is not limited in this application.

Wherein, the segmentation labels n1 and n2 can be any one of the four segmentation labels of B, I, E, S.

For ease of understanding, the generated state function s (y, x, i, j) is illustrated below with reference to the contents of tables 1 and 2.

Example one, assuming that the character feature template is U03:% x [ i,0], and the character at the ith position in the character sequence points to the character "double", then the state function s (y, x, i, j) generated by U03:% x [ i,0] is the following four cases (in the embodiment of the present disclosure, the word segmentation labels corresponding to the characters at each position are four, namely B, I, E, S):

for template U03:% x [ i,0]The four determined state functions s₁To s₄For any word segmentation tag sequence in a plurality of word segmentation tag sequences determined by a character sequence, the value s of the state function corresponding to the word segmentation tag sequence needs to be determined₁To s₄Then, it is necessary to sequentially traverse each character in the character sequence, determine the value of the state function corresponding to each character, assume that the currently traversed character is "double", and if the word segmentation label corresponding to the character "double" in the word segmentation label sequence is "B", s in the four state functions is₁Taking the value 1, other state functions s₂To s₄The value is 0. The above process can also be referred to as a value determination method of a state function or a transfer function generated by other feature templates, and is not described one by one here.

Example two, assuming that the character feature template is U04:% x [ i +1,0], and the character at the i +1 th position in the character sequence points to the character "lung", then the state function s (y, x, i, j) generated using U04:% x [ i +1,0] is the following four cases:

example three, assuming that the character feature template is U08% x [ i,0 ]/% x [ i +1,0], the character at the ith position in the character sequence points to the character "double", and the character at the ith +1 position in the character sequence points to the character "lung", then the state function s (y, x, i, j) generated using U08% x [ i,0 ]/% x [ i +1,0] is the following four cases:

of course, for other character feature templates in the unary template, the state function may also be generated in the manner of the above example one to example three, and the detailed description is not repeated.

Example four, assuming that the dictionary feature template is U13:% x [ i,1] and the dictionary label at the ith position in the dictionary label sequence is 0, then the state function s (y, x, i, j) generated using U13:% x [ i,1] is the following four cases:

example five, assuming that the dictionary feature template is U17% x [ i,1 ]/% x [ i +1,1], the dictionary label at the ith position in the dictionary label sequence is 0, and the dictionary label at the ith +1 position is 0, then the state function s (y, x, i, j) generated using U17% x [ i,1 ]/% x [ i +1,1] is the following four cases:

of course, for other dictionary feature templates in the unary template, the state function may also be generated in the manner of the above example four to example five, and the detailed description is not repeated.

Example six, assuming that the composite feature template is U14:% x [ i,0 ]/% x [ i,1], the character at the ith position in the sequence of characters points to the character "double", and the dictionary label at the ith position in the sequence of dictionary labels is 0, then the state function s (y, x, i, j) generated using U14:% x [ i,0 ]/% x [ i,1] is the following four cases:

of course, for other composite feature templates in the unary template, the state function may also be generated in the manner described in the above sixth example, and the description thereof will not be specifically provided.

The second embodiment,

The binary template may also be one or more of a character feature template, a dictionary feature template, and a composite feature template. The transfer function generated based on the binary template comprises the following conditions:

Case 1: if the feature template comprises a character feature template, generating a transfer function t (y, x, i, j) according to the character feature template, wherein the transfer function t (y, x, i, j) is as follows:

wherein x represents a two-dimensional sequence consisting of a sequence of characters and a sequence of dictionary tags; when j is 0, the character sequence in the two-dimensional sequence is represented; x is the number of_i±d,j＝0The character of i + -d th position of the character sequence is represented, i is any integer from 1 to p, d is any integer from 0 to p-iAny positive integer; y represents a word segmentation tag sequence; y is_iAn ith word segmentation label representing a word segmentation label sequence y; y is_i-1The i-1 th word segmentation label represents the word segmentation label sequence y;

t (y, x, i, j) m at the i + -d position of the character sequence and n at the i-th participle label of the participle label sequence₁The i-1 th word segmentation label of the word segmentation label sequence y is n₂Under the condition of k₁Otherwise, t (y, x, i, j) takes the value of k₂。

Case 2: if the feature template comprises a dictionary feature template, the transfer function t (y, x, i, j) generated according to the dictionary feature template is:

wherein x represents a two-dimensional sequence consisting of a sequence of characters and a sequence of dictionary tags; when j is 1, representing a dictionary label sequence in the two-dimensional sequence; x is the number of_i±d,j＝1Representing dictionary labels at the i +/-d position of the dictionary label sequence, wherein i is any integer from 1 to p, p is the total number of characters contained in the character sequence, and d is any positive integer from 0 to p-i; y represents a word segmentation tag sequence; y is_iAn ith word segmentation label representing a word segmentation label sequence y; y is_i-1The i-1 th word segmentation label represents the word segmentation label sequence y;

t (y, x, i, j) has a dictionary label h at the i + -d position of the dictionary label sequence and an i-th participle label n of the participle label sequence y₁The i-1 th word segmentation label of the word segmentation label sequence y is n₂Under the condition of k₁Otherwise, t (y, x, i, j) takes the value of k₂。

Case 3: if the feature template comprises a composite feature template, the transfer function t (y, x, i, j) generated according to the composite feature template is:

wherein t (y, x, i, j) is m at the i + -d position of the character label sequence, h at the i + -d position of the dictionary label sequence, and n at the i-th word segmentation label of the word segmentation label sequence₁The i-1 th word segmentation label of the word segmentation label sequence y is n₂Under the condition of k₁Otherwise, t (y, x, i, j) takes the value of k₂。

For ease of understanding, the generated transfer function t (y, x, i, j) is illustrated below with reference to the contents of tables 1 and 2.

Suppose the character feature template is B01:% x [ i,0]With the character at the ith position in the character sequence pointing to the character "Lung", then B01:% x [ i, 0:%]The generated transfer function t (y, x, i, j) includes 16 cases, where y is given_iIn this case, B, four transfer functions t (y, x, i, j) can be generated:

of course, y may also be targeted_i＝I、y_i＝E、y_iIn the three cases of S, four transfer functions t (y, x, i, j) may be generated, and a detailed description thereof will not be provided.

Further, after the state functions and the transfer functions are obtained according to the above manner, the values of the state functions and the values of the transfer functions can be determined when the character sequence is marked as each word segmentation label sequence. And then inputting the values of the state functions and the values of the transfer functions corresponding to each participle label sequence into a pre-trained conditional probability prediction model, and respectively calculating the conditional probability of the character sequence marked as each participle label sequence.

In the embodiment of the present application, the pre-trained conditional probability prediction model is a Conditional Random Field (CRF). A conditional random field is understood to be a conditional probability distribution model of a set of output random variables given the set of input random variables, the assumption of the model being that the output random variables constitute a markov random field. In the scene of word segmentation of a text to be segmented, the input random variable can be a two-dimensional sequence x consisting of a character sequence and a dictionary tag sequence, and the output random variable can be a word segmentation tag sequence y.

In the embodiment of the application, the text to be participled is participled, and the problem that the conditional probability of each participle label sequence marked by a predicted character sequence can be actually converted. The more the predicted word segmentation label sequence with the higher conditional probability, the higher the possibility of indicating the word segmentation label sequence to be correct.

Illustratively, the formula for the conditional random field is:

wherein the content of the first and second substances,

in the above formula, p (y | x) represents a conditional probability that a two-dimensional sequence x composed of a character sequence and a dictionary tag sequence is labeled as a participle tag sequence y;

i represents the ith position in the character sequence or dictionary tag sequence;

j represents the number of columns of the two-dimensional sequence x, and when j is 0, it represents a character sequence in the two-dimensional sequence x, and when j is 1, it represents a dictionary tag sequence in the two-dimensional sequence x;

p represents the number of characters contained in the character sequence, the number of dictionary labels contained in the dictionary label sequence and the number of participle labels contained in the participle label sequence;

m is the number of word segmentation label sequences y obtained by carrying out word segmentation labeling on the character sequences x;

z (x) is a normalization factor;

s_l(y, x, i, j) represents the L-th state function, and L represents the total number of state functions generated according to the unary template, wherein the unary template can comprise at least one of a character feature template, a dictionary feature template and a composite feature template, and the number of the unary templates is assumed to be e₁Then L ═ e₁W is the type of word segmentation label;

t_k(y, x, i, j) represents the kth transfer function, and K represents the total number of transfer functions generated according to a binary template, wherein the binary template may comprise at least one of a character feature template, a dictionary feature template and a composite feature template, and the number of the binary templates is assumed to be e₂Then K ═ e₂W p, W has the same meaning as above.

Wherein, mu_lIs a first weight of the state function. Lambda [ alpha ]_kIs the second weight of the transfer function. Weight lambda of transfer function and state function_kAnd mu_lThe solution is obtained by training a conditional probability prediction model, and a specific solution process will be described in detail later.

It can be known from the above calculation formula of the conditional random field that, when the conditional probability of a character sequence marked as each participle tag sequence is calculated, because the values of each state function and each transfer function can be obtained under the condition of giving the participle tag sequence, the conditional probability of the character sequence marked as the given participle tag sequence can be obtained by substituting the values of each state function and each transfer function into the conditional probability prediction model.

Illustratively, the character sequence and corresponding dictionary tag sequence described above in Table 1 are followed, andif the characteristic templates shown in Table 2 are selected from the unary templates U01-U18 in Table 2, the state function s is generated_l(y, x, i, j), then a state function s can be generated_lThe total number L of (y, x, i, j) is 18, 4, 16, 1152 state functions, i.e. s₁(y, x, i, j) to s₁₁₅₂(y, x, i, j). If the binary template B01 in Table 2 is selected to generate the transfer function t_k(y, x, i, j), then the transfer function t can be generated_kThe total number K of (y, x, i, j) is 4, 16, 256 transfer functions, i.e. t₁(y, x, i, j) to t₂₅₆(y,x,i,j)。

When a two-dimensional sequence x composed of the character sequence and the corresponding dictionary tag sequence described in table 1 and a word segmentation tag sequence y are given, the value of each state function and the value of each transfer function may be determined separately from i ═ 1 and j ═ 0 until the value of each state function and the value of each transfer function are determined when i ═ p (p ═ 16 in this example) and j ═ 1 are determined, and then the conditional probability when the character sequence is marked as the given word segmentation tag sequence may be found.

When any one state function is obtained, the setting condition of the state function can be y_i＝n₁,x_i±d,j＝0＝m”、y_i＝n₁,x_i±d,j＝1H, or y_i＝n₁,x_i±d,j＝0＝m,x_i±d,j＝1When the setting condition of the state function is satisfied, the state function is determined to take 1 if the setting condition is satisfied, and the state function can be determined to take 0 if the setting condition is not satisfied.

When any transfer function is obtained, the setting condition of the transfer function may be "y_i＝n₁,y_i-1＝n₂,x_i±d,j＝0When the setting condition of the transfer function is satisfied, the transfer function is determined to take 1 if the setting condition is satisfied, and the transfer function can be determined to take 0 if the setting condition is not satisfied.

It should be noted that, because the dictionary label corresponding to the matching character string is already marked in the dictionary label sequence, in some specific scenarios, when the accuracy is higher based on the dictionary matching process, the dictionary label corresponding to the matching character string can be equivalent to the participle result for reference, and then the participle label corresponding to the matching character string can be derived based on the dictionary label corresponding to the matching character string, so that each character of the matching character string in the character sequence does not need to be endowed with various participle labels, but the participle label of the matching character string can be configured directly based on the result marked in the dictionary label sequence, and thus, the number of times of predicting the conditional probability by the conditional probability prediction model can be saved, and the efficiency of the participle prediction process is higher.

For example, after the electronic medical records and the corresponding dictionary tag sequences shown in table 1 are continued to be followed, and the electronic medical records and the dictionaries are matched to obtain the dictionary tag sequences, it can be determined that "dry-wet rale" and "fricative" are matched character strings, that is, "dry-wet rale" and "fricative" can be used as well-divided words, and the word-divided tags corresponding to the original "dry-wet rale" and "fricative" have 4⁸Possibly, in the scheme, the dictionary tag sequence is used as a reference factor, when the electronic medical record is predicted to be marked as each segmentation tag sequence based on the conditional probability prediction model, the segmentation tag corresponding to the dry-wet rale in the segmentation tag sequence can be determined as ' B (dry) I (wet) I (sexual) I (rale) E (pitch), ' the segmentation tag corresponding to the fricative ' can be determined as ' B (mole) I (fricative) E (pitch), ' so that the segmentation tag sequence can be reduced by 3⁸Possibility, i.e. reduction by 3⁸A possible sequence of word segmentation tags. Compared with the method that the conditional probabilities are calculated one by one for all the word segmentation label sequences, the method and the device for predicting the word segmentation labels can save the times of predicting the conditional probabilities by the conditional probability prediction model, and enable the efficiency of the word segmentation prediction process to be higher.

After the conditional probability that the character sequence is marked as each participle tag sequence is obtained, the participle tag sequence corresponding to the conditional probability meeting the preset condition can be determined as a target participle tag sequence. Illustratively, the word segmentation label sequence corresponding to the conditional probability with the maximum value of the conditional probability is determined as the target word segmentation label sequence. And then, performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.

In one example, continuing with the electronic medical record shown in table 1, after comparing the conditional probabilities corresponding to each target word segmentation tag sequence, the target word segmentation tag sequence corresponding to the conditional probability with the highest value of the conditional probabilities is selected, as shown in table 3:

TABLE 3

Electronic medical record	Dictionary tag sequence	Target word segmentation tag sequence
			Double is	0	B
Lung (lung)	0	E
			Is prepared from	0	B
Wen	0	E
			And	0	S
dry matter	1	B
			Wet	1	I
Property of (2)	1	I
			Rhizome of Japanese fern	1	I
Sound	1	E
			，	0	S
Is prepared from	0	B
			Wen	0	E
And	0	S
			film	0	B
Chest	0	E
			Massage device	1	B
Eraser	1	I
			Sound	1	E
。	0	S

After performing word segmentation processing on the electronic medical record based on the target word segmentation tag sequence shown in table 3, the obtained word segmentation result includes: "double lung", "not smelling", "and", "dry and wet rale", "(membrane chest", "rubbing sound", "). ".

In the embodiment of the application, when the conditional probability prediction model is used to calculate the conditional probability when the character sequence is marked as each participle tag sequence, the influencing factors of the conditional probability include the values of the state function and the transfer function corresponding to each participle tag sequence, and the weight λ of the transfer function and the state function_kAnd mu_l. Wherein, the weight value lambda_kAnd mu_lIs solved by training the conditional probability prediction model.

Next, a training process of the conditional probability prediction model in the embodiment of the present application will be described. Referring to fig. 5, a schematic flow chart of a training process of a conditional probability prediction model provided in the embodiment of the present application is shown, including the following steps:

step 501, a sample set is obtained, wherein the sample set comprises a plurality of groups of samples, and each group of samples comprises a sample character sequence, a sample dictionary tag sequence and at least one sample word segmentation tag sequence corresponding to a sample text to be segmented.

Step 502, determining, for each group of samples, a value of each state function and a value of each transfer function when the sample character sequences in the group of samples are marked as each sample word segmentation tag sequence according to at least one of the sample character sequences and the sample dictionary tag sequences.

Step 503, inputting the values of the state functions and the values of the transfer functions determined by each group of samples into a conditional probability prediction model to be trained, and determining the conditional probability functions corresponding to each group of samples, where the conditional probability functions include the first weights of the state functions and the second weights of the transfer functions.

Step 504, inputting the determined conditional probability function corresponding to each group of samples as an independent variable into a preset loss function, and determining a loss value of the preset loss function by adjusting a value of the first weight and a value of the second weight included in the preset loss function.

And 505, when the loss value meets a preset convergence condition, determining a first current value of the first weight and a second current value of the second weight, and determining a conditional probability prediction model obtained under the condition that the first weight is the first current value and the second weight is the second current value.

Specifically, after the conditional probability function is input as an argument to the preset loss function, the above λ may be given_kAnd mu_lAssigning initial values to two parameters to be trained, and treating the parameter lambda to be trained according to a Newton iteration method or a gradient descent method_kAnd mu_lAdjusting and updating are carried out until the loss value of the preset loss function meets the preset convergence condition, and then the updating is stopped, so that the parameter lambda to be trained is obtained_kAnd mu_lSo as to determine the lambda in the conditional random field formula_kAnd mu_lTo obtainA conditional probability prediction model.

In the embodiment of the application, in the process of training the conditional probability prediction model, the dictionary tag sequence can also be used as a reference factor for predicting the participle tag sequence, so that the model convergence can be accelerated, that is, a relatively small amount of sample corpora can be adopted to obtain the conditional probability prediction model through training, so that a large amount of sample corpora with artificially labeled participle tags can be omitted, the labor cost is saved, and the construction efficiency of a training set is improved. After the conditional probability prediction model is obtained, the prediction accuracy of the conditional probability prediction model can be tested through the test sample set, and a specific test process is not explained herein.

Based on the same application concept, a text word segmentation device corresponding to the text word segmentation method is further provided in the embodiment of the present application, and because the principle of solving the problem of the device in the embodiment of the present application is similar to that of the text word segmentation method in the embodiment of the present application, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 6, a schematic structural diagram of a text word segmentation apparatus 60 provided in the embodiment of the present application includes:

the conversion module 61 is used for converting the text to be segmented into a character sequence;

a first determining module 62, configured to match a character string that meets a preset length and is included in the character sequence with a standard word in a dictionary that is constructed in advance, determine a matching character string that matches the standard word, and assign a corresponding dictionary tag to each character of the matching character string and each character except the matching character string in the character sequence, respectively, so as to obtain a dictionary tag sequence;

a second determining module 63, configured to determine at least one word segmentation tag corresponding to each character in the character sequence, so as to obtain multiple word segmentation tag sequences;

a conditional probability prediction module 64, configured to determine a conditional probability that the character sequence is marked as each word segmentation tag sequence according to the character sequence, the dictionary tag sequence, and a pre-trained conditional probability prediction model;

and a word segmentation processing module 65, configured to determine a word segmentation tag sequence corresponding to the conditional probability meeting a preset condition as a target word segmentation tag sequence, and perform word segmentation processing on the text to be word segmented based on the target word segmentation tag sequence.

In some embodiments of the present application, the conditional probability prediction module 64, when determining the conditional probability that the character sequence is marked as each participle tag sequence according to the character sequence, the dictionary tag sequence and a pre-trained conditional probability prediction model, is specifically configured to:

determining a plurality of feature templates according to the character sequence and/or the dictionary label sequence;

generating at least one state function and at least one transfer function according to the plurality of determined feature templates;

determining the values of each state function and each transfer function under the condition that the character sequence is marked as each participle label sequence;

and inputting the values of the state functions and the values of the transfer functions corresponding to each participle label sequence into a pre-trained conditional probability prediction model, and respectively calculating the conditional probability of the character sequence marked as each participle label sequence.

In some embodiments of the present application, the feature template comprises at least one of:

In some embodiments of the present application, the sequence of characters comprises p characters, the sequence of dictionary tags comprises p dictionary tags, and the sequence of participle tags comprises p participle tags;

if the feature template includes the character feature template, the conditional probability prediction module 64 generates a state function s (y, x, i, j) according to the character feature template, where s (y, x, i, j) is:

if the feature template includes the dictionary feature template, the conditional probability prediction module 64 generates a state function s (y, x, i, j) according to the dictionary feature template, where s (y, x, i, j) is:

if the feature template includes the composite feature template, the conditional probability prediction module 64 generates a state function s (y, x, i, j) according to the composite feature template, where s (y, x, i, j) is:

wherein x represents a two-dimensional sequence consisting of the sequence of characters and the sequence of dictionary tags; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; when j is 1, x represents the dictionary tag sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number of_i±d,j＝0A character, x, representing the i + -d-th position of said sequence of characters_i±d,j＝1Dictionary labels representing the i +/-d positions of the dictionary label sequence, wherein d is any positive integer from 0 to p-i; y is_iAn ith word segmentation label representing the word segmentation label sequence y; n is₁The ith word segmentation label represents the word segmentation label sequence y, m represents the characters at the ith +/-d position in the character sequence, and h represents the ith in the dictionary label sequenceDictionary labels for i ± d positions.

if the feature template includes the character feature template, the conditional probability prediction module 64 generates a transfer function t (y, x, i, j) according to the character feature template, where t (y, x, i, j) is:

wherein x represents a two-dimensional sequence consisting of the sequence of characters and the sequence of dictionary tags; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number of_i±d,j＝0Representing the characters at the i +/-d position of the character sequence, wherein d is any positive integer from 0 to p-i; y is_iAn ith word segmentation label representing the word segmentation label sequence y; y is_i-1The i-1 th word segmentation label represents the word segmentation label sequence y; n is₁An ith word segmentation label, n, representing the word segmentation label sequence y₂The i-1 th word segmentation label of the word segmentation label sequence y is represented, and m represents the characters at the i +/-d th positions in the character sequence.

In some embodiments of the present application, the at least one word segmentation tag comprises: the first label of the starting position of the word, the second label of the middle position of the word, the third label of the ending position of the word and the fourth label of the single word;

in some embodiments of the present application, when determining at least one word segmentation tag corresponding to each character in the character sequence to obtain multiple word segmentation tag sequences, the second determining module 63 is specifically configured to:

determining at least one word segmentation label corresponding to each character in the character sequence;

and randomly selecting one word segmentation label from at least one word segmentation label corresponding to each character as a target word segmentation label, and taking a sequence formed by the target word segmentation labels corresponding to the characters as a word segmentation label sequence.

In some embodiments of the present application, when the first determining module 62 assigns a corresponding dictionary tag to each character of the matching character string and each character except the matching character string in the character sequence, respectively, to obtain a dictionary tag sequence, specifically:

allocating a dictionary label to each character in the character sequence according to the following rules to obtain a dictionary label sequence consisting of dictionary labels:

and aiming at any character in the character sequence, if the character is the character in the matched character string, a first dictionary label is allocated to the character, and if the character is the character except for the matched character string, a second dictionary label is allocated to the character.

In some embodiments of the present application, the apparatus further comprises:

a model training module 66, configured to train the conditional probability prediction model according to the following manners:

obtaining a sample set, wherein the sample set comprises a plurality of groups of samples, and each group of samples comprises a sample character sequence, a sample dictionary label sequence and at least one sample word segmentation label sequence corresponding to a sample text to be segmented;

for each group of samples, determining values of each state function and each transfer function under the condition that the sample character sequences in the group of samples are marked as word segmentation label sequences of each sample according to at least one of the sample character sequences and the sample dictionary label sequences;

inputting the values of the state functions and the values of the transfer functions determined by each group of samples into a conditional probability prediction model to be trained, and determining the conditional probability function corresponding to each group of samples, wherein the conditional probability function comprises a first weight of the state function and a second weight of the transfer function;

inputting the determined conditional probability function corresponding to each group of samples into a preset loss function as an independent variable, and determining a loss value of the preset loss function by adjusting the value of the first weight and the value of the second weight included in the preset loss function;

and when the loss value meets a preset convergence condition, determining a first current value of the first weight and a second current value of the second weight, and determining a conditional probability prediction model obtained under the condition that the first weight is the first current value and the second weight is the second current value.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

An embodiment of the present application provides an electronic device 700, and as shown in fig. 7, a schematic structural diagram of the electronic device 700 provided in the embodiment of the present application includes: a processor 701, a memory 702 and a bus 703, the memory 702 storing machine-readable instructions executable by the processor 701, the processor 701 and the memory 702 communicating via the bus 703 when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the text segmentation method as set forth in the above method embodiments.

The present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to execute the steps of the text word segmentation method proposed in the above method embodiments.

Specifically, the storage medium can be a general-purpose storage medium, such as a removable disk, a hard disk, or the like, and when the computer program on the storage medium is executed, the text word segmentation method can be executed, so that the text containing the unstructured data can be rapidly and accurately segmented.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A text word segmentation method is characterized by comprising the following steps:

converting a text to be word segmented into a character sequence;

2. The method of claim 1, wherein determining a conditional probability that the character sequence is labeled as each participle tag sequence based on the character sequence, the dictionary tag sequence, and a pre-trained conditional probability prediction model comprises:

3. The method of claim 2, wherein the feature template comprises at least one of:

4. The method of claim 3, wherein the sequence of characters comprises p characters, the sequence of dictionary tags comprises p dictionary tags, and the sequence of participle tags comprises p participle tags;

if the feature template comprises the character feature template, generating a state function s (y, x, i, j) according to the character feature template, wherein the state function s (y, x, i, j) is as follows:

if the feature template comprises the dictionary feature template, generating a state function s (y, x, i, j) according to the dictionary feature template, wherein the state function s (y, x, i, j) is as follows:

if the feature template comprises the composite feature template, generating a state function s (y, x, i, j) according to the composite feature template, wherein the state function s (y, x, i, j) is as follows:

wherein x represents a two-dimensional sequence consisting of the sequence of characters and the sequence of dictionary tags; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; when j is 1, x represents the dictionary tag sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number of_i±d,j＝0A character, x, representing the i + -d-th position of said sequence of characters_i±d,j＝1Dictionary labels representing the i +/-d positions of the dictionary label sequence, wherein d is any positive integer from 0 to p-i; y is_iAn ith word segmentation label representing the word segmentation label sequence y; n is₁The word segmentation label sequence y is represented by the ith word segmentation label, m represents the characters at the ith +/-d position in the character sequence, and h represents the dictionary label at the ith +/-d position in the dictionary label sequence.

5. The method of claim 3, wherein the sequence of characters comprises p characters and the sequence of word segmentation tags comprises p word segmentation tags;

if the feature template comprises the character feature template, generating a transfer function t (y, x, i, j) according to the character feature template, wherein the transfer function t (y, x, i, j) is as follows:

wherein x represents a two-dimensional sequence consisting of the sequence of characters and the sequence of dictionary tags; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number of_i±d,j＝0Representing the characters at the i +/-d position of the character sequence, wherein d is any positive integer from 0 to p-i; y is_iAn ith word segmentation label representing the word segmentation label sequence y; y is_i-1Represents a scoreThe i-1 word segmentation label of the word label sequence y; n is₁An ith word segmentation label, n, representing the word segmentation label sequence y₂The i-1 th word segmentation label of the word segmentation label sequence y is represented, and m represents the characters at the i +/-d th positions in the character sequence.

6. The method of any of claims 1 to 5, wherein the at least one word segmentation tag comprises: the first label of the starting position of the word, the second label of the middle position of the word, the third label of the ending position of the word and the fourth label of the single word;

determining at least one word segmentation label corresponding to each character in the character sequence to obtain a plurality of word segmentation label sequences, including:

7. The method according to any one of claims 1 to 5, wherein assigning a corresponding dictionary label to each character of the matching character string and each character except the matching character string in the character sequence respectively to obtain a dictionary label sequence comprises:

8. The method of claim 1, wherein the conditional probability prediction model is trained according to:

for each group of samples, determining the values of each state function and each transfer function under the condition that the sample character sequences in the group of samples are marked as word segmentation label sequences of each sample according to at least one of the sample character sequences and the sample dictionary label sequences;

9. A text segmentation apparatus, comprising:

10. The apparatus of claim 9, wherein the conditional probability prediction module, when determining the conditional probability that the character sequence is labeled as each participle tag sequence based on the character sequence, the dictionary tag sequence, and a pre-trained conditional probability prediction model, is specifically configured to:

11. The apparatus of claim 10, wherein the feature template comprises at least one of:

12. The apparatus of claim 11, wherein the sequence of characters comprises p characters, the sequence of dictionary tags comprises p dictionary tags, and the sequence of participle tags comprises p participle tags;

if the feature template comprises the character feature template, the conditional probability prediction module generates a state function s (y, x, i, j) according to the character feature template, wherein the state function s (y, x, i, j) is as follows:

if the feature template comprises the dictionary feature template, the conditional probability prediction module generates a state function s (y, x, i, j) according to the dictionary feature template, wherein the state function s (y, x, i, j) is as follows:

if the feature template comprises the composite feature template, the conditional probability prediction module generates a state function s (y, x, i, j) according to the composite feature template, wherein the state function s (y, x, i, j) is as follows:

wherein x represents a symbol represented by the character sequence anda two-dimensional sequence composed of the dictionary tag sequence; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; when j is 1, x represents the dictionary tag sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number of_i±d,j＝0A character, x, representing the i + -d-th position of said sequence of characters_i±d,j＝1Dictionary labels representing the i +/-d positions of the dictionary label sequence, wherein d is any positive integer from 0 to p-i; y is_iAn ith word segmentation label representing the word segmentation label sequence y; n is₁The word segmentation label sequence y is represented by the ith word segmentation label, m represents the characters at the ith +/-d position in the character sequence, and h represents the dictionary label at the ith +/-d position in the dictionary label sequence.

13. The apparatus of claim 11, wherein the sequence of characters comprises p characters, the sequence of dictionary tags comprises p dictionary tags, and the sequence of participle tags comprises p participle tags;

if the feature template comprises the character feature template, the conditional probability prediction module generates a transfer function t (y, x, i, j) according to the character feature template, wherein the transfer function t (y, x, i, j) is as follows:

wherein x represents a two-dimensional sequence consisting of the sequence of characters and the sequence of dictionary tags; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number of_i±d,j＝0Representing the characters at the i +/-d position of the character sequence, wherein d is any positive integer from 0 to p-i; y is_iAn ith word segmentation label representing the word segmentation label sequence y; y is_i-1The i-1 th word segmentation label represents the word segmentation label sequence y; n is₁An ith word segmentation label, n, representing the word segmentation label sequence y₂The i-1 word segmentation label represents the word segmentation label sequence y, and m represents the i + -d word segmentation labels in the character sequenceThe character of the location.

14. The apparatus of any of claims 9 to 13, wherein the at least one word segmentation tag comprises: the first label of the starting position of the word, the second label of the middle position of the word, the third label of the ending position of the word and the fourth label of the single word;

the second determining module, when determining at least one word segmentation tag corresponding to each character in the character sequence to obtain multiple word segmentation tag sequences, is specifically configured to:

15. The apparatus according to any one of claims 9 to 13, wherein the first determining module, when assigning a corresponding dictionary tag to each character of the matching character string and each character except the matching character string in the character sequence, respectively, to obtain a dictionary tag sequence, is specifically configured to: