CN109829162B - Text word segmentation method and device - Google Patents

Text word segmentation method and device Download PDF

Info

Publication number
CN109829162B
CN109829162B CN201910094380.2A CN201910094380A CN109829162B CN 109829162 B CN109829162 B CN 109829162B CN 201910094380 A CN201910094380 A CN 201910094380A CN 109829162 B CN109829162 B CN 109829162B
Authority
CN
China
Prior art keywords
sequence
character
dictionary
label
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910094380.2A
Other languages
Chinese (zh)
Other versions
CN109829162A (en
Inventor
王李鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Big Data Technologies Co Ltd
Original Assignee
New H3C Big Data Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Big Data Technologies Co Ltd filed Critical New H3C Big Data Technologies Co Ltd
Priority to CN201910094380.2A priority Critical patent/CN109829162B/en
Publication of CN109829162A publication Critical patent/CN109829162A/en
Application granted granted Critical
Publication of CN109829162B publication Critical patent/CN109829162B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The application provides a text word segmentation method and a text word segmentation device, wherein the method comprises the following steps: converting a text to be word segmented into a character sequence; matching character strings meeting preset length in the character sequence with standard words in a pre-constructed dictionary, determining matched character strings matched with the standard words, and respectively allocating corresponding dictionary labels to each character of the matched character strings in the character sequence and each character except the matched character strings to obtain a dictionary label sequence; determining at least one word segmentation label corresponding to each character in the character sequence to obtain a plurality of word segmentation label sequences; determining the conditional probability of each participle label sequence marked by the character sequence according to the character sequence, the dictionary label sequence and a pre-trained conditional probability prediction model; and determining the word segmentation label sequence corresponding to the conditional probability meeting the preset condition as a target word segmentation label sequence, and performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.

Description

Text word segmentation method and device
Technical Field
The application relates to the technical field of big data, in particular to a text word segmentation method and device.
Background
In the natural language processing technology, the word segmentation technology is the basis of other language processing, and the accuracy of word segmentation is very important for other language processing. At present, when a text is analyzed and processed, a certain difficulty exists in word segmentation processing for the text containing unstructured data.
Taking an electronic medical record as an example, because the electronic medical record contains a lot of unstructured data, such as medical history records, medical record summary, and the like, performing automatic word segmentation on such unstructured data is a very difficult task at the same time as the most basic analysis and mining of the electronic medical record.
Therefore, a technical solution for rapidly and accurately segmenting words of a text containing unstructured data is needed.
Disclosure of Invention
In view of the above, an object of the present application is to provide a text word segmentation method and device, which can rapidly and accurately segment a text containing unstructured data.
In a first aspect, the present application provides a text word segmentation method, including:
converting the text to be segmented into a character sequence;
matching character strings meeting preset length contained in the character sequence with standard words in a pre-constructed dictionary, determining matched character strings matched with the standard words, and respectively allocating corresponding dictionary labels to each character of the matched character strings in the character sequence and each character except the matched character strings to obtain a dictionary label sequence;
determining at least one word segmentation label corresponding to each character in the character sequence to obtain a plurality of word segmentation label sequences;
determining the conditional probability of the character sequence marked as each participle label sequence according to the character sequence, the dictionary label sequence and a pre-trained conditional probability prediction model;
determining a word segmentation label sequence corresponding to the conditional probability meeting the preset condition as a target word segmentation label sequence, and performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.
In a second aspect, the present application provides a text word segmentation apparatus, including:
the conversion module is used for converting the text to be segmented into a character sequence;
the first determining module is used for matching the character strings which are contained in the character sequence and meet the preset length with standard words in a dictionary which is constructed in advance, determining matched character strings which are matched with the standard words, and respectively allocating corresponding dictionary labels to each character of the matched character strings in the character sequence and each character except the matched character strings to obtain a dictionary label sequence;
the second determining module is used for determining at least one word segmentation label corresponding to each character in the character sequence to obtain a plurality of word segmentation label sequences;
the conditional probability prediction module is used for determining the conditional probability of each participle tag sequence marked by the character sequence according to the character sequence, the dictionary tag sequence and a pre-trained conditional probability prediction model;
and the word segmentation processing module is used for determining a word segmentation label sequence corresponding to the conditional probability meeting the preset condition as a target word segmentation label sequence and performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the method of the first aspect described above, or any possible implementation of the first aspect.
In a fourth aspect, this application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to perform the steps of the text segmentation method in the first aspect or any one of the possible implementations of the first aspect.
The application provides a text word segmentation method and a text word segmentation device, firstly, a text to be segmented can be converted into a character sequence, then, a character string which meets a preset length in the character sequence can be matched with a standard word in a pre-constructed dictionary, a dictionary label sequence can be obtained based on a matching result, and various word segmentation label sequences can be obtained by determining at least one word segmentation label corresponding to each character in the character sequence. Further, the dictionary tag sequence and the character sequence can be used as input of the model, the conditional probability when the character sequence is marked as each word segmentation tag sequence is predicted by using the conditional probability prediction model, then the target word segmentation tag sequence is determined based on the obtained conditional probability, and word segmentation processing is carried out on the text to be segmented based on the target word segmentation tag sequence.
The method comprises two segmentation prediction processes of dictionary matching and conditional probability prediction model prediction, and by combining the two segmentation prediction processes, on one hand, a dictionary label sequence obtained by dictionary matching is used as a reference factor during prediction based on the conditional probability prediction model, so that the accuracy of a segmentation result predicted by the conditional probability prediction model is high, and the accuracy of the predicted segmentation result is improved; on the other hand, a conditional probability prediction model is introduced, and under the condition that a character sequence and a dictionary label sequence corresponding to a text to be segmented are given, the conditional probability when the character sequence is marked as a certain segmentation label sequence is predicted, so that the segmentation label sequence corresponding to the character sequence can be directly obtained, namely, segmentation labels respectively corresponding to all characters in the text to be recognized can be obtained through one-time prediction process, and the text segmentation efficiency can be improved.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a schematic flow chart illustrating a text word segmentation method provided in an embodiment of the present application;
fig. 2 is a schematic diagram illustrating a process of matching a to-be-segmented text based on a forward maximum matching algorithm according to an embodiment of the present application;
fig. 3 is a schematic flowchart illustrating a process of matching a to-be-segmented text based on an inverse maximum matching algorithm according to an embodiment of the present application;
FIG. 4 is a flow chart illustrating a word segmentation tag sequence for predicting that a character sequence is marked according to an embodiment of the present application;
FIG. 5 is a flow chart illustrating a training process of a conditional probability prediction model provided by an embodiment of the present application;
fig. 6 is a schematic structural diagram illustrating a text word segmentation apparatus according to an embodiment of the present application;
fig. 7 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
At present, when a word segmentation method based on supervised learning is adopted when a text containing unstructured data is segmented, a sample corpus carrying manually labeled word segmentation labels needs to be constructed as a training set, and then a word segmentation prediction model is obtained by training with the help of the sample corpus in the training set so as to predict word segmentation results of the text. Due to the fact that the sample corpus is large in scale in the training set, the word segmentation method with supervised learning needs to consume a large amount of manpower to label the word segmentation labels of the sample corpus, the manpower cost is high, the difficulty in constructing a comprehensive training set is high, and the construction efficiency is low. However, if the word segmentation method based on unsupervised learning is adopted to determine the word segmentation result of the text, the accuracy of the word segmentation result is lower compared with the word segmentation method based on supervised learning.
In order to solve the problems, the application provides a text word segmentation method and a text word segmentation device. Referring to fig. 1, a schematic flow chart of a text word segmentation method provided in an embodiment of the present application includes the following steps:
step 101, converting a text to be segmented into a character sequence.
And 102, matching character strings meeting the preset length in the character sequence with standard words in a pre-constructed dictionary, determining matched character strings matched with the standard words, and respectively allocating corresponding dictionary labels to each character of the matched character strings in the character sequence and each character except the matched character strings to obtain a dictionary label sequence.
And 103, determining at least one word segmentation label corresponding to each character in the character sequence to obtain various word segmentation label sequences.
And step 104, determining the conditional probability of the character sequence marked as each participle label sequence according to the character sequence, the dictionary label sequence and a pre-trained conditional probability prediction model.
And 105, determining the word segmentation label sequence corresponding to the conditional probability meeting the preset conditions as a target word segmentation label sequence, and performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.
The text to be participled is composed of a plurality of characters, so that the text to be participled can be divided into separate characters by taking a single character as a unit, and then all the characters are sequentially arranged to form a character sequence. The process of segmenting the text to be segmented into words can be converted into the process of predicting the segmentation label corresponding to each character in the character sequence by converting the text to be segmented into the character sequence.
Based on this, the embodiment of the application proposes that a conditional probability prediction model can be directly used to predict a target word segmentation label sequence corresponding to the whole character sequence, so that target word segmentation labels corresponding to all characters in the character sequence can be obtained through one prediction process, and the efficiency of text word segmentation can be improved. In addition, in order to improve the accuracy of predicting the target word segmentation tags, before the target word segmentation tag sequence of the character sequence is predicted by using the conditional probability prediction model, the characters which meet the preset length and are contained in the character sequence are respectively matched with the standard words in the pre-constructed dictionary, the dictionary tag sequence is obtained based on the matching result, and then the dictionary tag sequence can be used as a reference factor and input into the conditional probability prediction model together with the character sequence to obtain a more accurate prediction result.
Next, a dictionary-based matching process and a prediction process based on a conditional probability prediction model will be described in detail.
Implementation process I and matching process based on dictionary
It should be understood that the matching process based on the dictionary can be applied to the process of training the conditional probability prediction model to generate a sample dictionary tag sequence corresponding to the sample character sequence in the sample set, or can be applied to the process of predicting the word segmentation tag sequence based on the conditional probability prediction model obtained by training to generate a dictionary tag sequence corresponding to the text to be segmented. The two processes are based on the same technical concept, so the process of generating the dictionary label sequence corresponding to the text to be segmented is emphasized in the application.
In the embodiment of the application, the character strings which are included in the character sequence and meet the preset length can be matched with the standard words in the pre-constructed dictionary, and the dictionary label sequence is determined based on the matching result. The specific process of constructing the dictionary may refer to the prior art, and is not described in this application.
In specific implementation, a character string which meets a preset length and is included in the character sequence may be first matched with a standard word in a dictionary established in advance, and a matching character string matched with the standard word may be determined. And then, corresponding dictionary labels can be respectively allocated to each character of the matched character string in the character sequence and each character except the matched character string to obtain a dictionary label sequence.
Illustratively, the matching process may employ a forward maximum matching algorithm, an inverse maximum matching algorithm, or a two-way maximum matching algorithm. The bidirectional maximum matching algorithm may be understood as a process of determining a correct word segmentation result by comparing a word segmentation result obtained by a forward maximum matching algorithm with a word segmentation result obtained by a reverse maximum matching algorithm.
It should be noted that the character string satisfying the preset length may be a character string including at least one character. If the dictionary matching is performed by applying the forward maximum matching algorithm or the reverse maximum matching algorithm, the character strings satisfying the preset length may be: a character string which at least contains one character and the total number of the contained characters does not exceed the total number of the characters of the longest standard word in the dictionary.
The following describes the matching algorithm:
(1) and (4) a forward maximum matching algorithm.
Referring to fig. 2, a schematic flowchart of matching a character sequence based on a forward maximum matching algorithm is shown, which includes the following steps:
step 201, a characters are sequentially taken from the front to the back in the character sequence as character strings to be matched.
In one example, a may be taken as the total number of characters of the longest standard word in the dictionary.
Step 202, judging whether a standard word same as the character string to be matched exists in the dictionary.
If yes, go to step 203; if the determination result is negative, step 204 is executed.
And step 203, determining the character string to be matched as a matched character string matched with the standard word, further returning to step 201, and taking the next character string with the length of a until all characters in the character sequence are traversed.
Step 204, after the last character of the character string to be matched is removed, the remaining characters form a new character string to be matched and the step 202 is executed, until a matched character string matched with the standard word is found out, the step 201 is returned to and executed, and a character string with the length of a is taken; or, after all the characters in the character string to be matched are removed, the step 201 is executed again, and a character string with the length of a is taken down.
After matching the character sequences based on the matching process, a first matching result can be obtained, and the first matching result records the matching character strings in the character sequences and characters except the matching character strings. The matching character string may be composed of a plurality of characters, or may be composed of a single character, which is not limited in this application.
(2) And (5) performing reverse maximum matching algorithm.
Referring to fig. 3, a schematic flowchart of matching a character sequence based on an inverse maximum matching algorithm is shown, which includes the following steps:
and 301, sequentially taking a characters from the back to the front from the character sequence as a character string to be matched.
Wherein the meaning of a is the same as that described in the forward maximum matching algorithm.
Step 302, judging whether a standard word same as the character string to be matched exists in the dictionary.
If yes, go to step 303; if the determination result is negative, go to step 304.
Step 303, determining the character string to be matched as a matched character string matched with the standard word, further returning to step 301, and taking a next character string with the length of a until all characters in the character sequence are traversed.
Step 304, after the first character of the character string to be matched is removed, the remaining characters form a new character string to be matched and the step 302 is executed until a matched character string matched with the standard word is found, the step 301 is returned to and executed, and a character string with the length of a is taken; or after all characters in the character string to be matched are removed, the step 301 is returned to and executed, and a character string with the length of a is taken down.
After matching the character sequence based on the matching process, a second matching result can be obtained, and the matching character string in the character sequence and the characters except the matching character string are recorded in the second matching result. The matching character string may be composed of a plurality of characters, or may be composed of a single character, which is not limited in this application.
(3) And (3) a bidirectional maximum matching algorithm.
After obtaining the first matching result and the second matching result based on the matching processes shown in fig. 2 and fig. 3, the first matching result and the second matching result may be compared, and a better matching result may be selected as a final matching result.
If the first matching result and the second matching result are consistent, any one of the matching results can be selected as a final matching result.
If the first matching result is inconsistent with the second matching result, the number of the matching character strings in the first matching result and the second matching result, the number of the characters except the matching character strings and the number of the matching character strings of a single character can be compared, and a better matching result is selected as a final matching result. For example, the final matching result may be selected on the basis of the principle that the larger the number of matching character strings, the better the number of characters other than the matching character strings, and the better the number of matching character strings of a single character.
In one possible implementation, after determining the matching character string and the characters except the matching character string in the character sequence, the dictionary tag may be assigned to each character in the character sequence according to the following rule, so as to obtain a dictionary tag sequence composed of dictionary tags:
for any character in the character sequence, if the character is a character in a matching character string, a first dictionary label is allocated to the character, and if the character is a character except the matching character string, a second dictionary label is allocated to the character.
In one example, the first dictionary tag can be represented by 1 and the second dictionary tag can be represented by 0. Of course, in practical applications, the first dictionary label and the second dictionary label may be configured according to actual requirements, for example, the first dictionary label is represented by Y, and the second dictionary label is represented by N, which is not limited in the present application.
Exemplarily, taking a text to be segmented as an electronic medical record as an example, the electronic medical record records "lung-unvoiced and dry-wet rales and membrane chest fricatives", and if the dictionary includes "dry-wet rales" and "fricatives", the "dry-wet rales" and the "fricatives" may be determined as matching character strings, and then a dictionary tag sequence shown in table 1 may be generated according to the above embodiment (the left side of table 1 is a character sequence corresponding to the text to be segmented, the first dictionary tag is denoted by 1, and the second dictionary tag is denoted by 0):
TABLE 1
Figure BDA0001964143650000091
Figure BDA0001964143650000101
And implementing a second flow and a prediction process based on a pre-trained conditional probability prediction model.
In the embodiment of the present application, before predicting a target participle tag sequence marked by a character sequence, each participle tag sequence that the character sequence may be marked may be determined first.
In a possible implementation manner, for each character in the character sequence, at least one word segmentation label corresponding to each character may be determined, further, one word segmentation label may be arbitrarily selected from the at least one word segmentation label corresponding to each character as a target word segmentation label, and a sequence formed by the target word segmentation labels corresponding to each character is used as a word segmentation label sequence.
At least one word segmentation label which can be possibly marked on each character specifically comprises the following steps: the first label of the starting position of the word, the second label of the middle position of the word, the third label of the ending position of the word and the fourth label of the single word. In one example, b (begin) may be used to represent a first tag of a start position of a word, i (intermediate) may be used to represent a second tag of a middle position of the word, e (end) may be used to represent a third tag of an end position of the word, and s (single) may be used to represent a fourth tag of a single word composed of single characters.
Wherein, for each character, there are 4 cases B, I, S, E in the corresponding participle tag, and if any one of the 4 participle tags corresponding to each character is taken as the target participle tag, the generated participle tag sequence is 4pAnd (4) seed preparation.
In the embodiment of the application, after determining a plurality of word segmentation tag sequences, the conditional probability that a character sequence is marked as each word segmentation tag sequence can be predicted based on the character sequence, the dictionary tag sequence and a pre-trained conditional probability prediction model.
The specific prediction process is shown in fig. 4:
step 401, determining a plurality of feature templates according to the character sequence and/or the dictionary tag sequence.
Step 402, generating at least one state function and at least one transfer function according to the plurality of determined feature templates.
And step 403, determining the values of each state function and each transfer function under the condition that the character sequence is marked as each participle label sequence.
And step 404, inputting the values of the state functions and the values of the transfer functions corresponding to each participle tag sequence into a pre-trained conditional probability prediction model, and respectively calculating the conditional probability of the character sequence marked as each participle tag sequence.
To facilitate an understanding of the prediction flow illustrated in fig. 4, a plurality of feature templates determined from a character sequence and/or a dictionary tag sequence will first be described.
Illustratively, the feature templates may include at least one of the following templates:
a character feature template for representing individual characters in the sequence of characters;
the character feature template is used for representing the incidence relation of different characters in the character sequence;
a dictionary feature template for representing a single dictionary tag in the sequence of dictionary tags;
a dictionary feature template for representing an associative relationship between different dictionary tags in the sequence of dictionary tags;
and the composite characteristic template consists of the character characteristic template and the dictionary characteristic template.
The above three feature templates can also be used as a unitary template (Unigram template) or a binary template (Bigram template).
Wherein, the unary template can be used to determine the state function, the template format is Uk:% x [ i, j ], wherein the letter U indicates that the template is the unary template; k represents a serial number of the template; x represents a two-dimensional sequence consisting of a sequence of characters and a sequence of dictionary tags; in the present disclosure, j denotes a position of a column, and indicates a first column when j is 0, the first column refers to a character sequence in the two-dimensional sequence, and indicates a second column when j is 1, the second column refers to a dictionary tag sequence in the two-dimensional sequence; in the present disclosure, i represents the i-th position in the character sequence or the dictionary tag sequence, that is, the current position, and when j is 0, x [ i,0] represents the character at the i-th position in the character sequence in the two-dimensional sequence, and when j is 1, x [ i,1] represents the dictionary tag at the i-th position in the dictionary tag sequence in the two-dimensional sequence.
A binary template may be used to determine the transfer function, the template format being, for example, Bk:% x [ i, j ], where the letter B indicates that the template is a binary template; other parameters can be referred to the description of the unary template, and are not described in detail here.
Illustratively, the above feature templates are exemplified by following the correspondence relationship between the character sequences of the electronic medical record and the corresponding dictionary tag sequences shown in table 1.
For the character sequence "lung-loud and dry-wet-loud, membrane chest fricative" formed by electronic medical record, the characteristic templates that can be generated are shown in table 2:
TABLE 2
Figure BDA0001964143650000121
Figure BDA0001964143650000131
Among them, U01 to U18 shown in Table 2 are monovalent templates, and B01 is a binary template.
U01-U05 are character feature templates for representing individual characters in a sequence of characters. For example, U01:% x [ i-2,0] represents the character at the i-2 nd position in the sequence of characters, i.e., the character at the position that precedes the current position and is two characters apart from the current position; u03:% x [ i,0] represents the character at the ith position in the character sequence, i.e. the character at the current position; u05:% x [ i +2,0] represents the character at the i +2 th position in the sequence of characters, i.e., the position after the current position and separated from the current position by two characters.
U06-U12 are character feature templates used to represent associations of different characters in a sequence of characters. For example, U06% x [ i-2,0 ]/% x [ i-1,0] indicates the character at the i-2 th position in the character sequence, and the character at the i-1 st position in the character sequence; u07:% x [ i-1,0 ]/% x [ i,0] denotes the character at the i-1 st position in the character sequence, and the character at the i-th position in the character sequence.
U13 is a dictionary feature template for representing a single dictionary label in a sequence of dictionary labels. For example, U13:% x [ i,1] may represent the dictionary tag at the ith position in the dictionary tag sequence.
U14 is a composite feature template consisting of a character feature template and a dictionary feature template. U14:% x [ i,0 ]/% x [ i,1] can represent the character at the ith position in the character sequence and the dictionary label at the ith position in the dictionary label sequence.
U15-U18 are dictionary feature templates used to represent associations between different dictionary labels in a dictionary label sequence. For example, U15:% x [ i-2,1 ]/% x [ i-1,1] represents the dictionary label at the i-2 th position in the dictionary label sequence and the dictionary label at the i-1 st position in the dictionary label sequence.
B01 is a binary template, and B01 can also be attributed as a character feature template for representing a single character in a sequence of characters. B01:% x [ i,0] can represent the character at the ith position in the sequence of characters. Of course, in practical applications, the dictionary feature template and the composite feature template also form a binary template, which is not limited in the present application.
In the embodiment of the present application, the unary templates may generate a state function S (y, x, I, j), and each unary template may generate W × p state functions, where p represents the number of characters included in a character sequence, may also represent the number of dictionary tags included in a dictionary tag sequence, may also represent the number of segmentation tags included in a segmentation tag sequence, the number of characters, the number of dictionary tags, and the number of segmentation tags are the same, and W represents the type of the segmentation tag, and in this disclosure, W is 4, that is, 4 segmentation tags such as "B", "E", "I", and "S".
Following the above example, as shown in table 1, the character sequence includes "double", "lung", "not", "smell", "and", "dry", "wet", "sex", "calc", "pitch", "", "membrane", "chest", "mole", "rubbing". "these 16 characters, i.e. p ═ 16, the class of the word segmentation labels includes 4 kinds of" B "," E "," I "," S ", i.e. W ═ 4, so it can be concluded that each unary template can generate 16 × 4 ═ 64 state functions.
The binary templates can generate transfer functions t (y, x, i, j), and each binary template can generate W p state functions, wherein p and W have the same meaning.
Continuing with the above example, as shown in table 1, it can be concluded that each binary template can generate 16 × 4 — 256 transfer functions.
Further, after determining the various feature templates, a state function may be generated based on a unary template, and a transfer function may also be generated based on a binary template, where a specific embodiment is as follows:
the first embodiment,
Since the unary template may be one or more of a character feature template, a dictionary feature template, and a composite feature template, the state function s (y, x, i, j) generated based on the unary template includes the following cases:
first, assume that the character sequence includes p characters, the dictionary tag sequence includes p dictionary tags, and the word segmentation tag sequence includes p word segmentation tags, which are equal to each other.
Case 1: if the feature template comprises a character feature template, generating a state function s (y, x, i, j) according to the character feature template, wherein the state function s (y, x, i, j) is as follows:
Figure BDA0001964143650000151
wherein x represents a two-dimensional sequence consisting of a sequence of characters and a sequence of dictionary tags; when j is 0, the character sequence in the two-dimensional sequence is represented; x is the number ofi±d,j=0The characters represent the i +/-d positions of the character sequence, i is any integer from 1 to p, and d is any positive integer from 0 to p-i; y represents a word segmentation tag sequence; y isiAn ith word segmentation label representing a word segmentation label sequence y;
s (y, x, i, j) is m at the i + -d position of the character sequence, and the i-th participle label of the participle label sequence y is n1Under the condition of k1Otherwise, s (y, x, i, j) is k2
Case 2: if the feature template comprises a dictionary feature template, then a state function s (y, x, i, j) generated according to the dictionary feature template is:
Figure BDA0001964143650000152
wherein, when j is 1, representing dictionary label sequence in the two-dimensional sequence; x is the number ofi±d,j=1The character of i + -d position of the dictionary label sequence is represented, i is any integer from 1 to p, d is any positive integer from 0 to p-iAn integer number; other parameters have the same meanings as above.
The dictionary label of s (y, x, i, j) at the i +/-d position of the dictionary label sequence is h, and the ith participle label of the participle label sequence y is n1Under the condition of k1Otherwise, s (y, x, i, j) is k2
Case 3: if the feature template comprises a composite feature template, then the state function s (y, x, i, j) generated according to the composite feature template is:
Figure BDA0001964143650000153
s (y, x, i, j) where the character at the i + -d-th position of the character sequence is m, the dictionary label at the i + -d-th position of the dictionary label sequence is h, and the i-th participle label of the participle label sequence y is n1Under the condition of k1Otherwise, s (y, x, i, j) is k2
Wherein k is1For example, the value of 1, k can be selected2For example, it may take the value 0. Of course, in practical application, k may be configured according to practical situations1And k2The value of (a) is not limited in this application.
Wherein, the segmentation labels n1 and n2 can be any one of the four segmentation labels of B, I, E, S.
For ease of understanding, the generated state function s (y, x, i, j) is illustrated below with reference to the contents of tables 1 and 2.
Example one, assuming that the character feature template is U03:% x [ i,0], and the character at the ith position in the character sequence points to the character "double", then the state function s (y, x, i, j) generated by U03:% x [ i,0] is the following four cases (in the embodiment of the present disclosure, the word segmentation labels corresponding to the characters at each position are four, namely B, I, E, S):
Figure BDA0001964143650000161
Figure BDA0001964143650000162
Figure BDA0001964143650000163
Figure BDA0001964143650000164
for template U03:% x [ i,0]The four determined state functions s1To s4For any word segmentation tag sequence in a plurality of word segmentation tag sequences determined by a character sequence, the value s of the state function corresponding to the word segmentation tag sequence needs to be determined1To s4Then, it is necessary to sequentially traverse each character in the character sequence, determine the value of the state function corresponding to each character, assume that the currently traversed character is "double", and if the word segmentation label corresponding to the character "double" in the word segmentation label sequence is "B", s in the four state functions is1Taking the value 1, other state functions s2To s4The value is 0. The above process can also be referred to as a value determination method of a state function or a transfer function generated by other feature templates, and is not described one by one here.
Example two, assuming that the character feature template is U04:% x [ i +1,0], and the character at the i +1 th position in the character sequence points to the character "lung", then the state function s (y, x, i, j) generated using U04:% x [ i +1,0] is the following four cases:
Figure BDA0001964143650000171
Figure BDA0001964143650000172
Figure BDA0001964143650000173
Figure BDA0001964143650000174
example three, assuming that the character feature template is U08% x [ i,0 ]/% x [ i +1,0], the character at the ith position in the character sequence points to the character "double", and the character at the ith +1 position in the character sequence points to the character "lung", then the state function s (y, x, i, j) generated using U08% x [ i,0 ]/% x [ i +1,0] is the following four cases:
Figure BDA0001964143650000175
Figure BDA0001964143650000176
Figure BDA0001964143650000177
Figure BDA0001964143650000178
of course, for other character feature templates in the unary template, the state function may also be generated in the manner of the above example one to example three, and the detailed description is not repeated.
Example four, assuming that the dictionary feature template is U13:% x [ i,1] and the dictionary label at the ith position in the dictionary label sequence is 0, then the state function s (y, x, i, j) generated using U13:% x [ i,1] is the following four cases:
Figure BDA0001964143650000181
Figure BDA0001964143650000182
Figure BDA0001964143650000183
Figure BDA0001964143650000184
example five, assuming that the dictionary feature template is U17% x [ i,1 ]/% x [ i +1,1], the dictionary label at the ith position in the dictionary label sequence is 0, and the dictionary label at the ith +1 position is 0, then the state function s (y, x, i, j) generated using U17% x [ i,1 ]/% x [ i +1,1] is the following four cases:
Figure BDA0001964143650000185
Figure BDA0001964143650000186
Figure BDA0001964143650000187
Figure BDA0001964143650000188
of course, for other dictionary feature templates in the unary template, the state function may also be generated in the manner of the above example four to example five, and the detailed description is not repeated.
Example six, assuming that the composite feature template is U14:% x [ i,0 ]/% x [ i,1], the character at the ith position in the sequence of characters points to the character "double", and the dictionary label at the ith position in the sequence of dictionary labels is 0, then the state function s (y, x, i, j) generated using U14:% x [ i,0 ]/% x [ i,1] is the following four cases:
Figure BDA0001964143650000191
Figure BDA0001964143650000192
Figure BDA0001964143650000193
Figure BDA0001964143650000194
of course, for other composite feature templates in the unary template, the state function may also be generated in the manner described in the above sixth example, and the description thereof will not be specifically provided.
The second embodiment,
The binary template may also be one or more of a character feature template, a dictionary feature template, and a composite feature template. The transfer function generated based on the binary template comprises the following conditions:
first, assume that the character sequence includes p characters, the dictionary tag sequence includes p dictionary tags, and the word segmentation tag sequence includes p word segmentation tags, which are equal to each other.
Case 1: if the feature template comprises a character feature template, generating a transfer function t (y, x, i, j) according to the character feature template, wherein the transfer function t (y, x, i, j) is as follows:
Figure BDA0001964143650000195
wherein x represents a two-dimensional sequence consisting of a sequence of characters and a sequence of dictionary tags; when j is 0, the character sequence in the two-dimensional sequence is represented; x is the number ofi±d,j=0The character of i + -d th position of the character sequence is represented, i is any integer from 1 to p, d is any integer from 0 to p-iAny positive integer; y represents a word segmentation tag sequence; y isiAn ith word segmentation label representing a word segmentation label sequence y; y isi-1The i-1 th word segmentation label represents the word segmentation label sequence y;
t (y, x, i, j) m at the i + -d position of the character sequence and n at the i-th participle label of the participle label sequence1The i-1 th word segmentation label of the word segmentation label sequence y is n2Under the condition of k1Otherwise, t (y, x, i, j) takes the value of k2
Case 2: if the feature template comprises a dictionary feature template, the transfer function t (y, x, i, j) generated according to the dictionary feature template is:
Figure BDA0001964143650000201
wherein x represents a two-dimensional sequence consisting of a sequence of characters and a sequence of dictionary tags; when j is 1, representing a dictionary label sequence in the two-dimensional sequence; x is the number ofi±d,j=1Representing dictionary labels at the i +/-d position of the dictionary label sequence, wherein i is any integer from 1 to p, p is the total number of characters contained in the character sequence, and d is any positive integer from 0 to p-i; y represents a word segmentation tag sequence; y isiAn ith word segmentation label representing a word segmentation label sequence y; y isi-1The i-1 th word segmentation label represents the word segmentation label sequence y;
t (y, x, i, j) has a dictionary label h at the i + -d position of the dictionary label sequence and an i-th participle label n of the participle label sequence y1The i-1 th word segmentation label of the word segmentation label sequence y is n2Under the condition of k1Otherwise, t (y, x, i, j) takes the value of k2
Case 3: if the feature template comprises a composite feature template, the transfer function t (y, x, i, j) generated according to the composite feature template is:
Figure BDA0001964143650000202
wherein t (y, x, i, j) is m at the i + -d position of the character label sequence, h at the i + -d position of the dictionary label sequence, and n at the i-th word segmentation label of the word segmentation label sequence1The i-1 th word segmentation label of the word segmentation label sequence y is n2Under the condition of k1Otherwise, t (y, x, i, j) takes the value of k2
For ease of understanding, the generated transfer function t (y, x, i, j) is illustrated below with reference to the contents of tables 1 and 2.
Suppose the character feature template is B01:% x [ i,0]With the character at the ith position in the character sequence pointing to the character "Lung", then B01:% x [ i, 0:%]The generated transfer function t (y, x, i, j) includes 16 cases, where y is giveniIn this case, B, four transfer functions t (y, x, i, j) can be generated:
Figure BDA0001964143650000211
Figure BDA0001964143650000212
Figure BDA0001964143650000213
Figure BDA0001964143650000214
of course, y may also be targetedi=I、yi=E、yiIn the three cases of S, four transfer functions t (y, x, i, j) may be generated, and a detailed description thereof will not be provided.
Further, after the state functions and the transfer functions are obtained according to the above manner, the values of the state functions and the values of the transfer functions can be determined when the character sequence is marked as each word segmentation label sequence. And then inputting the values of the state functions and the values of the transfer functions corresponding to each participle label sequence into a pre-trained conditional probability prediction model, and respectively calculating the conditional probability of the character sequence marked as each participle label sequence.
In the embodiment of the present application, the pre-trained conditional probability prediction model is a Conditional Random Field (CRF). A conditional random field is understood to be a conditional probability distribution model of a set of output random variables given the set of input random variables, the assumption of the model being that the output random variables constitute a markov random field. In the scene of word segmentation of a text to be segmented, the input random variable can be a two-dimensional sequence x consisting of a character sequence and a dictionary tag sequence, and the output random variable can be a word segmentation tag sequence y.
In the embodiment of the application, the text to be participled is participled, and the problem that the conditional probability of each participle label sequence marked by a predicted character sequence can be actually converted. The more the predicted word segmentation label sequence with the higher conditional probability, the higher the possibility of indicating the word segmentation label sequence to be correct.
Illustratively, the formula for the conditional random field is:
Figure BDA0001964143650000221
wherein the content of the first and second substances,
Figure BDA0001964143650000222
in the above formula, p (y | x) represents a conditional probability that a two-dimensional sequence x composed of a character sequence and a dictionary tag sequence is labeled as a participle tag sequence y;
i represents the ith position in the character sequence or dictionary tag sequence;
j represents the number of columns of the two-dimensional sequence x, and when j is 0, it represents a character sequence in the two-dimensional sequence x, and when j is 1, it represents a dictionary tag sequence in the two-dimensional sequence x;
p represents the number of characters contained in the character sequence, the number of dictionary labels contained in the dictionary label sequence and the number of participle labels contained in the participle label sequence;
m is the number of word segmentation label sequences y obtained by carrying out word segmentation labeling on the character sequences x;
z (x) is a normalization factor;
sl(y, x, i, j) represents the L-th state function, and L represents the total number of state functions generated according to the unary template, wherein the unary template can comprise at least one of a character feature template, a dictionary feature template and a composite feature template, and the number of the unary templates is assumed to be e1Then L ═ e1W is the type of word segmentation label;
tk(y, x, i, j) represents the kth transfer function, and K represents the total number of transfer functions generated according to a binary template, wherein the binary template may comprise at least one of a character feature template, a dictionary feature template and a composite feature template, and the number of the binary templates is assumed to be e2Then K ═ e2W p, W has the same meaning as above.
Wherein, mulIs a first weight of the state function. Lambda [ alpha ]kIs the second weight of the transfer function. Weight lambda of transfer function and state functionkAnd mulThe solution is obtained by training a conditional probability prediction model, and a specific solution process will be described in detail later.
It can be known from the above calculation formula of the conditional random field that, when the conditional probability of a character sequence marked as each participle tag sequence is calculated, because the values of each state function and each transfer function can be obtained under the condition of giving the participle tag sequence, the conditional probability of the character sequence marked as the given participle tag sequence can be obtained by substituting the values of each state function and each transfer function into the conditional probability prediction model.
Illustratively, the character sequence and corresponding dictionary tag sequence described above in Table 1 are followed, andif the characteristic templates shown in Table 2 are selected from the unary templates U01-U18 in Table 2, the state function s is generatedl(y, x, i, j), then a state function s can be generatedlThe total number L of (y, x, i, j) is 18, 4, 16, 1152 state functions, i.e. s1(y, x, i, j) to s1152(y, x, i, j). If the binary template B01 in Table 2 is selected to generate the transfer function tk(y, x, i, j), then the transfer function t can be generatedkThe total number K of (y, x, i, j) is 4, 16, 256 transfer functions, i.e. t1(y, x, i, j) to t256(y,x,i,j)。
When a two-dimensional sequence x composed of the character sequence and the corresponding dictionary tag sequence described in table 1 and a word segmentation tag sequence y are given, the value of each state function and the value of each transfer function may be determined separately from i ═ 1 and j ═ 0 until the value of each state function and the value of each transfer function are determined when i ═ p (p ═ 16 in this example) and j ═ 1 are determined, and then the conditional probability when the character sequence is marked as the given word segmentation tag sequence may be found.
When any one state function is obtained, the setting condition of the state function can be yi=n1,xi±d,j=0=m”、yi=n1,xi±d,j=1H, or yi=n1,xi±d,j=0=m,xi±d,j=1When the setting condition of the state function is satisfied, the state function is determined to take 1 if the setting condition is satisfied, and the state function can be determined to take 0 if the setting condition is not satisfied.
When any transfer function is obtained, the setting condition of the transfer function may be "yi=n1,yi-1=n2,xi±d,j=0When the setting condition of the transfer function is satisfied, the transfer function is determined to take 1 if the setting condition is satisfied, and the transfer function can be determined to take 0 if the setting condition is not satisfied.
It should be noted that, because the dictionary label corresponding to the matching character string is already marked in the dictionary label sequence, in some specific scenarios, when the accuracy is higher based on the dictionary matching process, the dictionary label corresponding to the matching character string can be equivalent to the participle result for reference, and then the participle label corresponding to the matching character string can be derived based on the dictionary label corresponding to the matching character string, so that each character of the matching character string in the character sequence does not need to be endowed with various participle labels, but the participle label of the matching character string can be configured directly based on the result marked in the dictionary label sequence, and thus, the number of times of predicting the conditional probability by the conditional probability prediction model can be saved, and the efficiency of the participle prediction process is higher.
For example, after the electronic medical records and the corresponding dictionary tag sequences shown in table 1 are continued to be followed, and the electronic medical records and the dictionaries are matched to obtain the dictionary tag sequences, it can be determined that "dry-wet rale" and "fricative" are matched character strings, that is, "dry-wet rale" and "fricative" can be used as well-divided words, and the word-divided tags corresponding to the original "dry-wet rale" and "fricative" have 48Possibly, in the scheme, the dictionary tag sequence is used as a reference factor, when the electronic medical record is predicted to be marked as each segmentation tag sequence based on the conditional probability prediction model, the segmentation tag corresponding to the dry-wet rale in the segmentation tag sequence can be determined as ' B (dry) I (wet) I (sexual) I (rale) E (pitch), ' the segmentation tag corresponding to the fricative ' can be determined as ' B (mole) I (fricative) E (pitch), ' so that the segmentation tag sequence can be reduced by 38Possibility, i.e. reduction by 38A possible sequence of word segmentation tags. Compared with the method that the conditional probabilities are calculated one by one for all the word segmentation label sequences, the method and the device for predicting the word segmentation labels can save the times of predicting the conditional probabilities by the conditional probability prediction model, and enable the efficiency of the word segmentation prediction process to be higher.
After the conditional probability that the character sequence is marked as each participle tag sequence is obtained, the participle tag sequence corresponding to the conditional probability meeting the preset condition can be determined as a target participle tag sequence. Illustratively, the word segmentation label sequence corresponding to the conditional probability with the maximum value of the conditional probability is determined as the target word segmentation label sequence. And then, performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.
In one example, continuing with the electronic medical record shown in table 1, after comparing the conditional probabilities corresponding to each target word segmentation tag sequence, the target word segmentation tag sequence corresponding to the conditional probability with the highest value of the conditional probabilities is selected, as shown in table 3:
TABLE 3
Electronic medical record Dictionary tag sequence Target word segmentation tag sequence
Double is 0 B
Lung (lung) 0 E
Is prepared from 0 B
Wen 0 E
And 0 S
dry matter 1 B
Wet 1 I
Property of (2) 1 I
Rhizome of Japanese fern 1 I
Sound 1 E
0 S
Is prepared from 0 B
Wen 0 E
And 0 S
film 0 B
Chest 0 E
Massage device 1 B
Eraser 1 I
Sound 1 E
0 S
After performing word segmentation processing on the electronic medical record based on the target word segmentation tag sequence shown in table 3, the obtained word segmentation result includes: "double lung", "not smelling", "and", "dry and wet rale", "(membrane chest", "rubbing sound", "). ".
In the embodiment of the application, when the conditional probability prediction model is used to calculate the conditional probability when the character sequence is marked as each participle tag sequence, the influencing factors of the conditional probability include the values of the state function and the transfer function corresponding to each participle tag sequence, and the weight λ of the transfer function and the state functionkAnd mul. Wherein, the weight value lambdakAnd mulIs solved by training the conditional probability prediction model.
Next, a training process of the conditional probability prediction model in the embodiment of the present application will be described. Referring to fig. 5, a schematic flow chart of a training process of a conditional probability prediction model provided in the embodiment of the present application is shown, including the following steps:
step 501, a sample set is obtained, wherein the sample set comprises a plurality of groups of samples, and each group of samples comprises a sample character sequence, a sample dictionary tag sequence and at least one sample word segmentation tag sequence corresponding to a sample text to be segmented.
Step 502, determining, for each group of samples, a value of each state function and a value of each transfer function when the sample character sequences in the group of samples are marked as each sample word segmentation tag sequence according to at least one of the sample character sequences and the sample dictionary tag sequences.
Step 503, inputting the values of the state functions and the values of the transfer functions determined by each group of samples into a conditional probability prediction model to be trained, and determining the conditional probability functions corresponding to each group of samples, where the conditional probability functions include the first weights of the state functions and the second weights of the transfer functions.
Step 504, inputting the determined conditional probability function corresponding to each group of samples as an independent variable into a preset loss function, and determining a loss value of the preset loss function by adjusting a value of the first weight and a value of the second weight included in the preset loss function.
And 505, when the loss value meets a preset convergence condition, determining a first current value of the first weight and a second current value of the second weight, and determining a conditional probability prediction model obtained under the condition that the first weight is the first current value and the second weight is the second current value.
Specifically, after the conditional probability function is input as an argument to the preset loss function, the above λ may be givenkAnd mulAssigning initial values to two parameters to be trained, and treating the parameter lambda to be trained according to a Newton iteration method or a gradient descent methodkAnd mulAdjusting and updating are carried out until the loss value of the preset loss function meets the preset convergence condition, and then the updating is stopped, so that the parameter lambda to be trained is obtainedkAnd mulSo as to determine the lambda in the conditional random field formulakAnd mulTo obtainA conditional probability prediction model.
In the embodiment of the application, in the process of training the conditional probability prediction model, the dictionary tag sequence can also be used as a reference factor for predicting the participle tag sequence, so that the model convergence can be accelerated, that is, a relatively small amount of sample corpora can be adopted to obtain the conditional probability prediction model through training, so that a large amount of sample corpora with artificially labeled participle tags can be omitted, the labor cost is saved, and the construction efficiency of a training set is improved. After the conditional probability prediction model is obtained, the prediction accuracy of the conditional probability prediction model can be tested through the test sample set, and a specific test process is not explained herein.
Based on the same application concept, a text word segmentation device corresponding to the text word segmentation method is further provided in the embodiment of the present application, and because the principle of solving the problem of the device in the embodiment of the present application is similar to that of the text word segmentation method in the embodiment of the present application, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.
Referring to fig. 6, a schematic structural diagram of a text word segmentation apparatus 60 provided in the embodiment of the present application includes:
the conversion module 61 is used for converting the text to be segmented into a character sequence;
a first determining module 62, configured to match a character string that meets a preset length and is included in the character sequence with a standard word in a dictionary that is constructed in advance, determine a matching character string that matches the standard word, and assign a corresponding dictionary tag to each character of the matching character string and each character except the matching character string in the character sequence, respectively, so as to obtain a dictionary tag sequence;
a second determining module 63, configured to determine at least one word segmentation tag corresponding to each character in the character sequence, so as to obtain multiple word segmentation tag sequences;
a conditional probability prediction module 64, configured to determine a conditional probability that the character sequence is marked as each word segmentation tag sequence according to the character sequence, the dictionary tag sequence, and a pre-trained conditional probability prediction model;
and a word segmentation processing module 65, configured to determine a word segmentation tag sequence corresponding to the conditional probability meeting a preset condition as a target word segmentation tag sequence, and perform word segmentation processing on the text to be word segmented based on the target word segmentation tag sequence.
In some embodiments of the present application, the conditional probability prediction module 64, when determining the conditional probability that the character sequence is marked as each participle tag sequence according to the character sequence, the dictionary tag sequence and a pre-trained conditional probability prediction model, is specifically configured to:
determining a plurality of feature templates according to the character sequence and/or the dictionary label sequence;
generating at least one state function and at least one transfer function according to the plurality of determined feature templates;
determining the values of each state function and each transfer function under the condition that the character sequence is marked as each participle label sequence;
and inputting the values of the state functions and the values of the transfer functions corresponding to each participle label sequence into a pre-trained conditional probability prediction model, and respectively calculating the conditional probability of the character sequence marked as each participle label sequence.
In some embodiments of the present application, the feature template comprises at least one of:
a character feature template for representing individual characters in the sequence of characters;
the character feature template is used for representing the incidence relation of different characters in the character sequence;
a dictionary feature template for representing a single dictionary tag in the sequence of dictionary tags;
a dictionary feature template for representing an associative relationship between different dictionary tags in the sequence of dictionary tags;
and the composite characteristic template consists of the character characteristic template and the dictionary characteristic template.
In some embodiments of the present application, the sequence of characters comprises p characters, the sequence of dictionary tags comprises p dictionary tags, and the sequence of participle tags comprises p participle tags;
if the feature template includes the character feature template, the conditional probability prediction module 64 generates a state function s (y, x, i, j) according to the character feature template, where s (y, x, i, j) is:
Figure BDA0001964143650000291
if the feature template includes the dictionary feature template, the conditional probability prediction module 64 generates a state function s (y, x, i, j) according to the dictionary feature template, where s (y, x, i, j) is:
Figure BDA0001964143650000292
if the feature template includes the composite feature template, the conditional probability prediction module 64 generates a state function s (y, x, i, j) according to the composite feature template, where s (y, x, i, j) is:
Figure BDA0001964143650000293
wherein x represents a two-dimensional sequence consisting of the sequence of characters and the sequence of dictionary tags; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; when j is 1, x represents the dictionary tag sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number ofi±d,j=0A character, x, representing the i + -d-th position of said sequence of charactersi±d,j=1Dictionary labels representing the i +/-d positions of the dictionary label sequence, wherein d is any positive integer from 0 to p-i; y isiAn ith word segmentation label representing the word segmentation label sequence y; n is1The ith word segmentation label represents the word segmentation label sequence y, m represents the characters at the ith +/-d position in the character sequence, and h represents the ith in the dictionary label sequenceDictionary labels for i ± d positions.
In some embodiments of the present application, the sequence of characters comprises p characters, the sequence of dictionary tags comprises p dictionary tags, and the sequence of participle tags comprises p participle tags;
if the feature template includes the character feature template, the conditional probability prediction module 64 generates a transfer function t (y, x, i, j) according to the character feature template, where t (y, x, i, j) is:
Figure BDA0001964143650000301
wherein x represents a two-dimensional sequence consisting of the sequence of characters and the sequence of dictionary tags; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number ofi±d,j=0Representing the characters at the i +/-d position of the character sequence, wherein d is any positive integer from 0 to p-i; y isiAn ith word segmentation label representing the word segmentation label sequence y; y isi-1The i-1 th word segmentation label represents the word segmentation label sequence y; n is1An ith word segmentation label, n, representing the word segmentation label sequence y2The i-1 th word segmentation label of the word segmentation label sequence y is represented, and m represents the characters at the i +/-d th positions in the character sequence.
In some embodiments of the present application, the at least one word segmentation tag comprises: the first label of the starting position of the word, the second label of the middle position of the word, the third label of the ending position of the word and the fourth label of the single word;
in some embodiments of the present application, when determining at least one word segmentation tag corresponding to each character in the character sequence to obtain multiple word segmentation tag sequences, the second determining module 63 is specifically configured to:
determining at least one word segmentation label corresponding to each character in the character sequence;
and randomly selecting one word segmentation label from at least one word segmentation label corresponding to each character as a target word segmentation label, and taking a sequence formed by the target word segmentation labels corresponding to the characters as a word segmentation label sequence.
In some embodiments of the present application, when the first determining module 62 assigns a corresponding dictionary tag to each character of the matching character string and each character except the matching character string in the character sequence, respectively, to obtain a dictionary tag sequence, specifically:
allocating a dictionary label to each character in the character sequence according to the following rules to obtain a dictionary label sequence consisting of dictionary labels:
and aiming at any character in the character sequence, if the character is the character in the matched character string, a first dictionary label is allocated to the character, and if the character is the character except for the matched character string, a second dictionary label is allocated to the character.
In some embodiments of the present application, the apparatus further comprises:
a model training module 66, configured to train the conditional probability prediction model according to the following manners:
obtaining a sample set, wherein the sample set comprises a plurality of groups of samples, and each group of samples comprises a sample character sequence, a sample dictionary label sequence and at least one sample word segmentation label sequence corresponding to a sample text to be segmented;
for each group of samples, determining values of each state function and each transfer function under the condition that the sample character sequences in the group of samples are marked as word segmentation label sequences of each sample according to at least one of the sample character sequences and the sample dictionary label sequences;
inputting the values of the state functions and the values of the transfer functions determined by each group of samples into a conditional probability prediction model to be trained, and determining the conditional probability function corresponding to each group of samples, wherein the conditional probability function comprises a first weight of the state function and a second weight of the transfer function;
inputting the determined conditional probability function corresponding to each group of samples into a preset loss function as an independent variable, and determining a loss value of the preset loss function by adjusting the value of the first weight and the value of the second weight included in the preset loss function;
and when the loss value meets a preset convergence condition, determining a first current value of the first weight and a second current value of the second weight, and determining a conditional probability prediction model obtained under the condition that the first weight is the first current value and the second weight is the second current value.
The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.
An embodiment of the present application provides an electronic device 700, and as shown in fig. 7, a schematic structural diagram of the electronic device 700 provided in the embodiment of the present application includes: a processor 701, a memory 702 and a bus 703, the memory 702 storing machine-readable instructions executable by the processor 701, the processor 701 and the memory 702 communicating via the bus 703 when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the text segmentation method as set forth in the above method embodiments.
The present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to execute the steps of the text word segmentation method proposed in the above method embodiments.
Specifically, the storage medium can be a general-purpose storage medium, such as a removable disk, a hard disk, or the like, and when the computer program on the storage medium is executed, the text word segmentation method can be executed, so that the text containing the unstructured data can be rapidly and accurately segmented.
The application provides a text word segmentation method and a text word segmentation device, firstly, a text to be segmented can be converted into a character sequence, then, a character string which meets a preset length in the character sequence can be matched with a standard word in a pre-constructed dictionary, a dictionary label sequence can be obtained based on a matching result, and various word segmentation label sequences can be obtained by determining at least one word segmentation label corresponding to each character in the character sequence. Further, the dictionary tag sequence and the character sequence can be used as input of the model, the conditional probability when the character sequence is marked as each word segmentation tag sequence is predicted by using the conditional probability prediction model, then the target word segmentation tag sequence is determined based on the obtained conditional probability, and word segmentation processing is carried out on the text to be segmented based on the target word segmentation tag sequence.
The method comprises two segmentation prediction processes of dictionary matching and conditional probability prediction model prediction, and by combining the two segmentation prediction processes, on one hand, a dictionary label sequence obtained by dictionary matching is used as a reference factor during prediction based on the conditional probability prediction model, so that the accuracy of a segmentation result predicted by the conditional probability prediction model is high, and the accuracy of the predicted segmentation result is improved; on the other hand, a conditional probability prediction model is introduced, and under the condition that a character sequence and a dictionary label sequence corresponding to a text to be segmented are given, the conditional probability when the character sequence is marked as a certain segmentation label sequence is predicted, so that the segmentation label sequence corresponding to the character sequence can be directly obtained, namely, segmentation labels respectively corresponding to all characters in the text to be recognized can be obtained through one-time prediction process, and the text segmentation efficiency can be improved.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (15)

1. A text word segmentation method is characterized by comprising the following steps:
converting a text to be word segmented into a character sequence;
matching character strings meeting preset length contained in the character sequence with standard words in a pre-constructed dictionary, determining matched character strings matched with the standard words, and respectively allocating corresponding dictionary labels to each character of the matched character strings in the character sequence and each character except the matched character strings to obtain a dictionary label sequence;
determining at least one word segmentation label corresponding to each character in the character sequence to obtain a plurality of word segmentation label sequences;
determining the conditional probability of the character sequence marked as each participle label sequence according to the character sequence, the dictionary label sequence and a pre-trained conditional probability prediction model;
determining a word segmentation label sequence corresponding to the conditional probability meeting the preset condition as a target word segmentation label sequence, and performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.
2. The method of claim 1, wherein determining a conditional probability that the character sequence is labeled as each participle tag sequence based on the character sequence, the dictionary tag sequence, and a pre-trained conditional probability prediction model comprises:
determining a plurality of feature templates according to the character sequence and/or the dictionary label sequence;
generating at least one state function and at least one transfer function according to the plurality of determined feature templates;
determining the values of each state function and each transfer function under the condition that the character sequence is marked as each participle label sequence;
and inputting the values of the state functions and the values of the transfer functions corresponding to each participle label sequence into a pre-trained conditional probability prediction model, and respectively calculating the conditional probability of the character sequence marked as each participle label sequence.
3. The method of claim 2, wherein the feature template comprises at least one of:
a character feature template for representing individual characters in the sequence of characters;
the character feature template is used for representing the incidence relation of different characters in the character sequence;
a dictionary feature template for representing a single dictionary tag in the sequence of dictionary tags;
a dictionary feature template for representing an associative relationship between different dictionary tags in the sequence of dictionary tags;
and the composite characteristic template consists of the character characteristic template and the dictionary characteristic template.
4. The method of claim 3, wherein the sequence of characters comprises p characters, the sequence of dictionary tags comprises p dictionary tags, and the sequence of participle tags comprises p participle tags;
if the feature template comprises the character feature template, generating a state function s (y, x, i, j) according to the character feature template, wherein the state function s (y, x, i, j) is as follows:
Figure FDA0001964143640000021
if the feature template comprises the dictionary feature template, generating a state function s (y, x, i, j) according to the dictionary feature template, wherein the state function s (y, x, i, j) is as follows:
Figure FDA0001964143640000022
if the feature template comprises the composite feature template, generating a state function s (y, x, i, j) according to the composite feature template, wherein the state function s (y, x, i, j) is as follows:
Figure FDA0001964143640000023
wherein x represents a two-dimensional sequence consisting of the sequence of characters and the sequence of dictionary tags; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; when j is 1, x represents the dictionary tag sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number ofi±d,j=0A character, x, representing the i + -d-th position of said sequence of charactersi±d,j=1Dictionary labels representing the i +/-d positions of the dictionary label sequence, wherein d is any positive integer from 0 to p-i; y isiAn ith word segmentation label representing the word segmentation label sequence y; n is1The word segmentation label sequence y is represented by the ith word segmentation label, m represents the characters at the ith +/-d position in the character sequence, and h represents the dictionary label at the ith +/-d position in the dictionary label sequence.
5. The method of claim 3, wherein the sequence of characters comprises p characters and the sequence of word segmentation tags comprises p word segmentation tags;
if the feature template comprises the character feature template, generating a transfer function t (y, x, i, j) according to the character feature template, wherein the transfer function t (y, x, i, j) is as follows:
Figure FDA0001964143640000031
wherein x represents a two-dimensional sequence consisting of the sequence of characters and the sequence of dictionary tags; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number ofi±d,j=0Representing the characters at the i +/-d position of the character sequence, wherein d is any positive integer from 0 to p-i; y isiAn ith word segmentation label representing the word segmentation label sequence y; y isi-1Represents a scoreThe i-1 word segmentation label of the word label sequence y; n is1An ith word segmentation label, n, representing the word segmentation label sequence y2The i-1 th word segmentation label of the word segmentation label sequence y is represented, and m represents the characters at the i +/-d th positions in the character sequence.
6. The method of any of claims 1 to 5, wherein the at least one word segmentation tag comprises: the first label of the starting position of the word, the second label of the middle position of the word, the third label of the ending position of the word and the fourth label of the single word;
determining at least one word segmentation label corresponding to each character in the character sequence to obtain a plurality of word segmentation label sequences, including:
determining at least one word segmentation label corresponding to each character in the character sequence;
and randomly selecting one word segmentation label from at least one word segmentation label corresponding to each character as a target word segmentation label, and taking a sequence formed by the target word segmentation labels corresponding to the characters as a word segmentation label sequence.
7. The method according to any one of claims 1 to 5, wherein assigning a corresponding dictionary label to each character of the matching character string and each character except the matching character string in the character sequence respectively to obtain a dictionary label sequence comprises:
allocating a dictionary label to each character in the character sequence according to the following rules to obtain a dictionary label sequence consisting of dictionary labels:
and aiming at any character in the character sequence, if the character is the character in the matched character string, a first dictionary label is allocated to the character, and if the character is the character except for the matched character string, a second dictionary label is allocated to the character.
8. The method of claim 1, wherein the conditional probability prediction model is trained according to:
obtaining a sample set, wherein the sample set comprises a plurality of groups of samples, and each group of samples comprises a sample character sequence, a sample dictionary label sequence and at least one sample word segmentation label sequence corresponding to a sample text to be segmented;
for each group of samples, determining the values of each state function and each transfer function under the condition that the sample character sequences in the group of samples are marked as word segmentation label sequences of each sample according to at least one of the sample character sequences and the sample dictionary label sequences;
inputting the values of the state functions and the values of the transfer functions determined by each group of samples into a conditional probability prediction model to be trained, and determining the conditional probability function corresponding to each group of samples, wherein the conditional probability function comprises a first weight of the state function and a second weight of the transfer function;
inputting the determined conditional probability function corresponding to each group of samples into a preset loss function as an independent variable, and determining a loss value of the preset loss function by adjusting the value of the first weight and the value of the second weight included in the preset loss function;
and when the loss value meets a preset convergence condition, determining a first current value of the first weight and a second current value of the second weight, and determining a conditional probability prediction model obtained under the condition that the first weight is the first current value and the second weight is the second current value.
9. A text segmentation apparatus, comprising:
the conversion module is used for converting the text to be segmented into a character sequence;
the first determining module is used for matching the character strings which are contained in the character sequence and meet the preset length with standard words in a dictionary which is constructed in advance, determining matched character strings which are matched with the standard words, and respectively allocating corresponding dictionary labels to each character of the matched character strings in the character sequence and each character except the matched character strings to obtain a dictionary label sequence;
the second determining module is used for determining at least one word segmentation label corresponding to each character in the character sequence to obtain a plurality of word segmentation label sequences;
the conditional probability prediction module is used for determining the conditional probability of each participle tag sequence marked by the character sequence according to the character sequence, the dictionary tag sequence and a pre-trained conditional probability prediction model;
and the word segmentation processing module is used for determining a word segmentation label sequence corresponding to the conditional probability meeting the preset condition as a target word segmentation label sequence and performing word segmentation processing on the text to be word segmented based on the target word segmentation label sequence.
10. The apparatus of claim 9, wherein the conditional probability prediction module, when determining the conditional probability that the character sequence is labeled as each participle tag sequence based on the character sequence, the dictionary tag sequence, and a pre-trained conditional probability prediction model, is specifically configured to:
determining a plurality of feature templates according to the character sequence and/or the dictionary label sequence;
generating at least one state function and at least one transfer function according to the plurality of determined feature templates;
determining the values of each state function and each transfer function under the condition that the character sequence is marked as each participle label sequence;
and inputting the values of the state functions and the values of the transfer functions corresponding to each participle label sequence into a pre-trained conditional probability prediction model, and respectively calculating the conditional probability of the character sequence marked as each participle label sequence.
11. The apparatus of claim 10, wherein the feature template comprises at least one of:
a character feature template for representing individual characters in the sequence of characters;
the character feature template is used for representing the incidence relation of different characters in the character sequence;
a dictionary feature template for representing a single dictionary tag in the sequence of dictionary tags;
a dictionary feature template for representing an associative relationship between different dictionary tags in the sequence of dictionary tags;
and the composite characteristic template consists of the character characteristic template and the dictionary characteristic template.
12. The apparatus of claim 11, wherein the sequence of characters comprises p characters, the sequence of dictionary tags comprises p dictionary tags, and the sequence of participle tags comprises p participle tags;
if the feature template comprises the character feature template, the conditional probability prediction module generates a state function s (y, x, i, j) according to the character feature template, wherein the state function s (y, x, i, j) is as follows:
Figure FDA0001964143640000061
if the feature template comprises the dictionary feature template, the conditional probability prediction module generates a state function s (y, x, i, j) according to the dictionary feature template, wherein the state function s (y, x, i, j) is as follows:
Figure FDA0001964143640000062
if the feature template comprises the composite feature template, the conditional probability prediction module generates a state function s (y, x, i, j) according to the composite feature template, wherein the state function s (y, x, i, j) is as follows:
Figure FDA0001964143640000071
wherein x represents a symbol represented by the character sequence anda two-dimensional sequence composed of the dictionary tag sequence; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; when j is 1, x represents the dictionary tag sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number ofi±d,j=0A character, x, representing the i + -d-th position of said sequence of charactersi±d,j=1Dictionary labels representing the i +/-d positions of the dictionary label sequence, wherein d is any positive integer from 0 to p-i; y isiAn ith word segmentation label representing the word segmentation label sequence y; n is1The word segmentation label sequence y is represented by the ith word segmentation label, m represents the characters at the ith +/-d position in the character sequence, and h represents the dictionary label at the ith +/-d position in the dictionary label sequence.
13. The apparatus of claim 11, wherein the sequence of characters comprises p characters, the sequence of dictionary tags comprises p dictionary tags, and the sequence of participle tags comprises p participle tags;
if the feature template comprises the character feature template, the conditional probability prediction module generates a transfer function t (y, x, i, j) according to the character feature template, wherein the transfer function t (y, x, i, j) is as follows:
Figure FDA0001964143640000072
wherein x represents a two-dimensional sequence consisting of the sequence of characters and the sequence of dictionary tags; y represents the word segmentation tag sequence; when j is 0, x represents the character sequence in the two-dimensional sequence; i is any integer from 1 to p; x is the number ofi±d,j=0Representing the characters at the i +/-d position of the character sequence, wherein d is any positive integer from 0 to p-i; y isiAn ith word segmentation label representing the word segmentation label sequence y; y isi-1The i-1 th word segmentation label represents the word segmentation label sequence y; n is1An ith word segmentation label, n, representing the word segmentation label sequence y2The i-1 word segmentation label represents the word segmentation label sequence y, and m represents the i + -d word segmentation labels in the character sequenceThe character of the location.
14. The apparatus of any of claims 9 to 13, wherein the at least one word segmentation tag comprises: the first label of the starting position of the word, the second label of the middle position of the word, the third label of the ending position of the word and the fourth label of the single word;
the second determining module, when determining at least one word segmentation tag corresponding to each character in the character sequence to obtain multiple word segmentation tag sequences, is specifically configured to:
determining at least one word segmentation label corresponding to each character in the character sequence;
and randomly selecting one word segmentation label from at least one word segmentation label corresponding to each character as a target word segmentation label, and taking a sequence formed by the target word segmentation labels corresponding to the characters as a word segmentation label sequence.
15. The apparatus according to any one of claims 9 to 13, wherein the first determining module, when assigning a corresponding dictionary tag to each character of the matching character string and each character except the matching character string in the character sequence, respectively, to obtain a dictionary tag sequence, is specifically configured to:
allocating a dictionary label to each character in the character sequence according to the following rules to obtain a dictionary label sequence consisting of dictionary labels:
and aiming at any character in the character sequence, if the character is the character in the matched character string, a first dictionary label is allocated to the character, and if the character is the character except for the matched character string, a second dictionary label is allocated to the character.
CN201910094380.2A 2019-01-30 2019-01-30 Text word segmentation method and device Active CN109829162B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910094380.2A CN109829162B (en) 2019-01-30 2019-01-30 Text word segmentation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910094380.2A CN109829162B (en) 2019-01-30 2019-01-30 Text word segmentation method and device

Publications (2)

Publication Number Publication Date
CN109829162A CN109829162A (en) 2019-05-31
CN109829162B true CN109829162B (en) 2022-04-08

Family

ID=66863299

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910094380.2A Active CN109829162B (en) 2019-01-30 2019-01-30 Text word segmentation method and device

Country Status (1)

Country Link
CN (1) CN109829162B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688853B (en) * 2019-08-12 2022-09-30 平安科技(深圳)有限公司 Sequence labeling method and device, computer equipment and storage medium
CN111831929B (en) * 2019-09-24 2024-01-02 北京嘀嘀无限科技发展有限公司 Method and device for acquiring POI information
CN110795938B (en) * 2019-11-11 2023-11-10 北京小米智能科技有限公司 Text sequence word segmentation method, device and storage medium
CN111026282B (en) * 2019-11-27 2023-05-23 上海明品医学数据科技有限公司 Control method for judging whether medical data labeling is carried out in input process
CN111695355A (en) * 2020-05-26 2020-09-22 平安银行股份有限公司 Address text recognition method, device, medium and electronic equipment
CN112101021A (en) * 2020-09-03 2020-12-18 沈阳东软智能医疗科技研究院有限公司 Method, device and equipment for realizing standard word mapping
CN112464667B (en) * 2020-11-18 2021-11-16 北京华彬立成科技有限公司 Text entity identification method and device, electronic equipment and storage medium
CN112861531B (en) * 2021-03-22 2023-11-14 北京小米移动软件有限公司 Word segmentation method, device, storage medium and electronic equipment
CN113609850A (en) * 2021-07-02 2021-11-05 北京达佳互联信息技术有限公司 Word segmentation processing method and device, electronic equipment and storage medium
CN117493540A (en) * 2023-12-28 2024-02-02 荣耀终端有限公司 Text matching method, terminal device and computer readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101082909A (en) * 2007-06-28 2007-12-05 腾讯科技(深圳)有限公司 Method and system for dividing Chinese sentences for recognizing deriving word
WO2010024052A1 (en) * 2008-08-27 2010-03-04 日本電気株式会社 Device for verifying speech recognition hypothesis, speech recognition device, and method and program used for same
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN102262634A (en) * 2010-05-24 2011-11-30 北京大学深圳研究生院 Automatic questioning and answering method and system
CN102929870A (en) * 2011-08-05 2013-02-13 北京百度网讯科技有限公司 Method for establishing word segmentation model, word segmentation method and devices using methods
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN103678318A (en) * 2012-08-31 2014-03-26 富士通株式会社 Multi-word unit extraction method and equipment and artificial neural network training method and equipment
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101082909A (en) * 2007-06-28 2007-12-05 腾讯科技(深圳)有限公司 Method and system for dividing Chinese sentences for recognizing deriving word
WO2010024052A1 (en) * 2008-08-27 2010-03-04 日本電気株式会社 Device for verifying speech recognition hypothesis, speech recognition device, and method and program used for same
CN102262634A (en) * 2010-05-24 2011-11-30 北京大学深圳研究生院 Automatic questioning and answering method and system
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN102929870A (en) * 2011-08-05 2013-02-13 北京百度网讯科技有限公司 Method for establishing word segmentation model, word segmentation method and devices using methods
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN103678318A (en) * 2012-08-31 2014-03-26 富士通株式会社 Multi-word unit extraction method and equipment and artificial neural network training method and equipment
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"基于统计与词典相结合的中文分词的研究与实现";周祺;《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》;20160315;全文 *
Qi-yu Jiang ; Hong-yi Li ; Jia-fen Liang ; Qing-xiang Wang等."Multi-combined Features Text Mining of TCM Medical Cases with CRF".《2016 8th International Conference on Information Technology in Medicine and Education (ITME)》.2017, *
Yi-Feng Pan ; Xinwen Hou ; Cheng-Lin Liu."Text Localization in Natural Scene Images Based on Conditional Random Field".《2009 10th International Conference on Document Analysis and Recognition》.2009, *

Also Published As

Publication number Publication date
CN109829162A (en) 2019-05-31

Similar Documents

Publication Publication Date Title
CN109829162B (en) Text word segmentation method and device
CN110210029B (en) Method, system, device and medium for correcting error of voice text based on vertical field
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN107193807B (en) Artificial intelligence-based language conversion processing method and device and terminal
CN110163181B (en) Sign language identification method and device
CN106557563B (en) Query statement recommendation method and device based on artificial intelligence
US20230244704A1 (en) Sequenced data processing method and device, and text processing method and device
CN111709243A (en) Knowledge extraction method and device based on deep learning
CN110619127B (en) Mongolian Chinese machine translation method based on neural network turing machine
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN111401084A (en) Method and device for machine translation and computer readable storage medium
CN113591457A (en) Text error correction method, device, equipment and storage medium
CN111191002A (en) Neural code searching method and device based on hierarchical embedding
CN108268439B (en) Text emotion processing method and device
CN110472062B (en) Method and device for identifying named entity
CN111462751A (en) Method, apparatus, computer device and storage medium for decoding voice data
CN112381079A (en) Image processing method and information processing apparatus
CN114021573B (en) Natural language processing method, device, equipment and readable storage medium
JP5441937B2 (en) Language model learning device, language model learning method, language analysis device, and program
CN110569355B (en) Viewpoint target extraction and target emotion classification combined method and system based on word blocks
CN113191150B (en) Multi-feature fusion Chinese medical text named entity identification method
CN110399477A (en) A kind of literature summary extracting method, equipment and can storage medium
CN113722436A (en) Text information extraction method and device, computer equipment and storage medium
CN113780418A (en) Data screening method, system, equipment and storage medium
CN109977430B (en) Text translation method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant