CN109829162A - A kind of text segmenting method and device - Google Patents
A kind of text segmenting method and device Download PDFInfo
- Publication number
- CN109829162A CN109829162A CN201910094380.2A CN201910094380A CN109829162A CN 109829162 A CN109829162 A CN 109829162A CN 201910094380 A CN201910094380 A CN 201910094380A CN 109829162 A CN109829162 A CN 109829162A
- Authority
- CN
- China
- Prior art keywords
- label
- character
- character string
- participle
- dictionary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Machine Translation (AREA)
Abstract
This application provides a kind of text segmenting method and devices, wherein this method comprises: being character string by text conversion to be segmented;The character string for meeting preset length for including in character string is matched with the standard words in the dictionary constructed in advance, the determining and matched matched character string of standard words, corresponding dictionary label is distributed respectively for each character of matched character string in character string and each character in addition to matched character string, obtains dictionary sequence label;It determines the corresponding at least one participle label of each character in character string, obtains a variety of participle sequence labels;According to character string, dictionary sequence label and conditional probability prediction model trained in advance, determine that character string is marked as the conditional probability of every kind of participle sequence label;The corresponding participle sequence label of the conditional probability for meeting preset condition is determined as target participle sequence label, and participle text is treated based on target participle sequence label and carries out word segmentation processing.
Description
Technical field
This application involves big data technical fields, in particular to a kind of text segmenting method and device.
Background technique
In natural language processing technique, participle technique is the basis of other Language Processings, and the accuracy of participle is to other
It is particularly significant for Language Processing.Currently, when being analyzed and processed to text, for the text for including unstructured data
This, when carrying out word segmentation processing with certain difficulty.
By taking electronic health record as an example, due to including many unstructured datas, such as medical history taking, the course of disease note in electronic health record
Record and case history brief summary etc., carrying out automatic word segmentation to this kind of unstructured data is that electronic health record is analyzed and excavated most basic
It is simultaneously also a very arduous task.
It can be seen that can quickly and accurately the text for including unstructured data be divided by needing one kind at present
The technical solution of word.
Summary of the invention
In view of this, the application's is designed to provide a kind of text segmenting method and device, it can be quickly and accurately
The text for including unstructured data is segmented.
In a first aspect, the application provides a kind of text segmenting method, comprising:
It is character string by the text conversion to be segmented;
By the character string for meeting preset length for including in the character string and the standard words in the dictionary that in advance constructs
It is matched, the determining and matched matched character string of the standard words, is each word of matched character string in the character string
Symbol and each character in addition to the matched character string distribute corresponding dictionary label respectively, obtain dictionary sequence label;
It determines the corresponding at least one participle label of each character in the character string, obtains a variety of participle label sequences
Column;
According to the character string, the dictionary sequence label and conditional probability prediction model trained in advance, determine
The character string is marked as the conditional probability of every kind of participle sequence label;
The corresponding participle sequence label of the conditional probability for meeting preset condition is determined as target participle sequence label, and base
Word segmentation processing is carried out to the text to be segmented in target participle sequence label.
Second aspect, the application provide a kind of text participle device, comprising:
Conversion module, for being character string by the text conversion to be segmented;
First determining module, for constructing the character string for meeting preset length for including in the character string with preparatory
Dictionary in standard words matched, it is determining with the matched matched character string of the standard words, be in the character string
Each character with character string and each character in addition to the matched character string distribute corresponding dictionary label respectively, obtain
Dictionary sequence label;
Second determining module is obtained for determining the corresponding at least one participle label of each character in the character string
To a variety of participle sequence labels;
Conditional probability prediction module, for according to the character string, the dictionary sequence label and training in advance
Conditional probability prediction model determines that the character string is marked as the conditional probability of every kind of participle sequence label;
Word segmentation processing module, for the corresponding participle sequence label of the conditional probability for meeting preset condition to be determined as target
Sequence label is segmented, and word segmentation processing is carried out to the text to be segmented based on target participle sequence label.
The third aspect, the embodiment of the present application also provide a kind of electronic equipment, comprising: processor, memory and bus, it is described
Memory is stored with the executable machine readable instructions of the processor, when electronic equipment operation, the processor with it is described
By bus communication between memory, the machine readable instructions executed when being executed by the processor it is above-mentioned in a first aspect, or
The step of any possible this segmenting method of embodiment Chinese of first aspect.
Fourth aspect, the embodiment of the present application also provide a kind of computer readable storage medium, the computer-readable storage medium
Computer program is stored in matter, which executes above-mentioned in a first aspect, or first aspect when being run by processor
The step of any possible this segmenting method of embodiment Chinese.
This application provides a kind of text segmenting method and devices, can be first character sequence by text conversion to be segmented
Column, can match the character string for meeting preset length in character string with the standard words in the dictionary constructed in advance, base later
In the available dictionary sequence label of matching result, the corresponding at least one of character each in determining character string can also be passed through
Label is segmented, a variety of participle sequence labels are obtained.It is possible to further using dictionary sequence label and character string as model
Input is marked as conditional probability when every kind of participle sequence label using conditional probability prediction model prediction character string, after
It is continuous just to determine that target segments sequence label based on obtained conditional probability, and participle text is treated based on target participle sequence label
This progress word segmentation processing.
It include predicting the two participle predictions based on dictionary matching and based on conditional probability prediction model in aforesaid way
Process, by combining above-mentioned two participle prediction process, on the one hand, using the dictionary sequence label obtained through dictionary matching as base
The reference factor when prediction of conditional probability prediction model can to eventually pass through point that conditional probability prediction model predicts
The accuracy of word result is higher, promotes the accuracy rate of prediction word segmentation result;On the other hand, conditional probability prediction model is introduced,
In the case where giving the corresponding character string of text to be segmented and dictionary sequence label, prediction character string is marked as certain
Conditional probability when kind participle sequence label, can directly obtain the corresponding participle sequence label of character string, also in this way
To obtain the corresponding participle label of alphabet in text to be identified by primary prediction process, text thus can also be improved
The efficiency of this participle.
To enable the above objects, features, and advantages of the application to be clearer and more comprehensible, below in conjunction with specific embodiment, and cooperate
Appended attached drawing, is described in detail below.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in the embodiment attached
Figure is briefly described, it should be understood that the following drawings illustrates only some embodiments of the application, therefore is not construed as pair
The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this
A little attached drawings obtain other relevant attached drawings.
Fig. 1 shows the flow diagram of text segmenting method provided by the embodiments of the present application;
Fig. 2 shows it is provided by the embodiments of the present application based on Forward Maximum Method algorithm treat participle text carry out it is matched
Flow diagram;
Fig. 3 show it is provided by the embodiments of the present application based on reverse maximum matching algorithm treat participle text carry out it is matched
Flow diagram;
Fig. 4 shows the process signal of the labeled participle sequence label of prediction character string provided by the embodiments of the present application
Figure;
Fig. 5 shows the flow diagram of the training process of conditional probability prediction model provided by the embodiments of the present application;
Fig. 6 shows a kind of structural schematic diagram of text participle device provided by the embodiments of the present application;
Fig. 7 shows the structural schematic diagram of electronic equipment provided by the embodiments of the present application.
Specific embodiment
To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application
Middle attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only
It is some embodiments of the present application, instead of all the embodiments.The application being usually described and illustrated herein in the accompanying drawings is real
The component for applying example can be arranged and be designed with a variety of different configurations.Therefore, below to the application's provided in the accompanying drawings
The detailed description of embodiment is not intended to limit claimed scope of the present application, but is merely representative of the selected reality of the application
Apply example.Based on embodiments herein, those skilled in the art institute obtained without making creative work
There are other embodiments, shall fall in the protection scope of this application.
Currently, when being segmented to the text for including unstructured data, if using based on supervised learning
Segmenting method then needs to construct the sample corpus for carrying the participle label manually marked as training set, then by training
The sample corpus of concentration segments prediction model to train to obtain, to predict the word segmentation result of text.Due to sample in training set
Corpus it is in large scale, using the segmenting method of supervised learning need to expend a large amount of manpowers go mark sample corpus participle mark
Label, human cost is higher and the difficulty of the more comprehensive training set of building is larger, building efficiency is lower.However, if using base
The word segmentation result of text is determined in the segmenting method of unsupervised learning, for the segmenting method of opposite supervised learning, and meeting
There is a problem of that the accuracy rate of word segmentation result is lower.
In view of the above-mentioned problems, this application provides a kind of text segmenting method and devices.It is the application shown in referring to Fig.1
The flow diagram for the text segmenting method that embodiment provides, includes the following steps:
Step 101, by text conversion to be segmented be character string.
Step 102, by the character string for meeting preset length for including in character string and the mark in the dictionary that in advance constructs
Quasi- word is matched, determining with the matched matched character string of standard words, be in character string each character of matched character string and
Each character in addition to matched character string distributes corresponding dictionary label respectively, obtains dictionary sequence label.
Step 103 determines the corresponding at least one participle label of each character in character string, obtains a variety of participle labels
Sequence.
Step 104, according to character string, dictionary sequence label and conditional probability prediction model trained in advance, determine
Character string is marked as the conditional probability of every kind of participle sequence label.
The corresponding participle sequence label of the conditional probability for meeting preset condition is determined as target participle label by step 105
Sequence, and participle text is treated based on target participle sequence label and carries out word segmentation processing.
Since text to be segmented is made of multiple characters, therefore text to be segmented can be split as unit of single character
At individual character, then each character is arranged successively, constitutes character string.By being character sequence by text conversion to be segmented
Column can be converted to the corresponding participle of each character in prediction character string then treat the process that is segmented of participle text
The process of label, due to there are many corresponding participle label possibility of each character, therefore by determining the corresponding mesh of each character
Mark participle label can treat participle text and be segmented.
It based on this, is proposed in the embodiment of the present application, entire word can be directly predicted using conditional probability prediction model
It accords with the corresponding target of sequence and segments sequence label, alphabet point in character string can be obtained by primary prediction process in this way
Not corresponding target segments label, so as to improve the efficiency of text participle.Also, in order to promote prediction target participle label
Accuracy rate, can character string is predicted using conditional probability prediction model target segment sequence label before, first will
The character for meeting preset length for including in character string is matched with the standard words in the dictionary constructed in advance respectively, and base
Obtain dictionary sequence label in matching result, it is subsequent can be using dictionary sequence label as reference factor, together with character string
It is input in conditional probability prediction model, to obtain more accurate prediction result.
In the following, being carried out specifically to the matching process based on dictionary with the pre- flow gauge based on conditional probability prediction model
It is bright.
Implementing procedure one, the matching process based on dictionary
It is to be appreciated that the matching process based on dictionary can both be applied and is trained to conditional probability prediction model
During, to generate the corresponding sample dictionary sequence label of sample character string in sample set, it can also apply and be based on
During the conditional probability prediction model prediction participle sequence label that training obtains, to generate the corresponding dictionary of text to be segmented
Sequence label.Based on the same technical idea, therefore the application highlights and generates the corresponding dictionary of text to be segmented two processes
The process of sequence label.
In the embodiment of the present application, it can be constructed by the character string for meeting preset length for including in character string and in advance
Standard words in dictionary are matched, and are based on matching result, determine dictionary sequence label.Wherein, the mistake of dictionary is specifically constructed
Journey is referred to the prior art, not reinflated explanation in the application.
In specific implementation, it can be constructed first by the character string for meeting preset length for including in character string and in advance
Standard words in dictionary are matched, the determining and matched matched character string of standard words.It and then can be to be matched in character string
Each character of character string and each character in addition to matched character string distribute corresponding dictionary label respectively, obtain dictionary mark
Sign sequence.
Illustratively, matching process can using Forward Maximum Method algorithm, reverse maximum matching algorithm or it is two-way most
Big matching algorithm.Wherein, so-called self-reinforcing in double directions can be understood as point by obtaining Forward Maximum Method algorithm
The word segmentation result that word result and reverse maximum matching algorithm obtain is compared, with the process of the correct word segmentation result of determination.
It should be noted that the character string for meeting preset length can be the character string for containing at least one character.If answering
When carrying out dictionary matching with Forward Maximum Method algorithm or reverse maximum matching algorithm, the above-mentioned character string for meeting preset length can
To be: contain at least one character and the character total quantity that includes without departing from standard words longest in dictionary character total quantity
Character string.
In the following, being illustrated respectively to above-mentioned matching algorithm:
(1), Forward Maximum Method algorithm.
Referring to shown in Fig. 2, to carry out matched flow diagram to character string based on Forward Maximum Method algorithm, including
Following steps:
Step 201 successively takes a character as character string to be matched from front to back from character string.
In one example, a can be using value as the character total quantity of standard words longest in dictionary.
Step 202 judges in dictionary with the presence or absence of standard words identical with character string to be matched.
If the determination result is YES, 203 are thened follow the steps;If judging result be it is no, then follow the steps 204.
Character string to be matched is determined as and the matched matched character string of standard words, and then return step by step 203
201, taking next length is the character string of a, until having traversed alphabet in character string.
After removing the character that character string to be matched is located at last, remaining character is formed newly for step 204
Character string to be matched simultaneously executes step 202, returns to step 201 with the matched matched character string of standard words until finding out,
Taking next length is the character string of a;Alternatively, being returned to step after alphabet removes in character string to be matched
201, taking next length is the character string of a.
After being matched based on above-mentioned matching process to character string, available first matching result, the first matching knot
Record has the matched character string in character string and the character in addition to matched character string in fruit.Wherein, matched character string can
To be made of multiple characters, can also be made of single character, the application does not limit this.
(2), reverse maximum matching algorithm.
Referring to shown in Fig. 3, to carry out matched flow diagram to character string based on reverse maximum matching algorithm, including
Following steps:
Step 301 successively takes a character as character string to be matched from back to front from character string.
Wherein, the meaning of a is the same as described in above-mentioned Forward Maximum Method algorithm.
Step 302 judges in dictionary with the presence or absence of standard words identical with character string to be matched.
If the determination result is YES, 303 are thened follow the steps;If judging result be it is no, then follow the steps 304.
Character string to be matched is determined as and the matched matched character string of standard words, and then return step by step 303
301, taking next length is the character string of a, until having traversed alphabet in character string.
Character string to be matched is located at after primary character removes by step 304, by remaining character composition it is new to
Matched character string simultaneously executes step 302, until find out with the matched matched character string of standard words, return to step 301, take
Next length is the character string of a;Or by after alphabet removes in character string to be matched, 301 is returned to step, is taken
Next length is the character string of a.
After being matched based on above-mentioned matching process to character string, available second matching result, the second matching knot
The matched character string in character string and the character in addition to matched character string are recorded in fruit.Wherein, matched character string can be with
It is made of, can also be made of single character, the application does not limit this multiple characters.
(3), self-reinforcing in double directions.
It, can after based on the first matching result and the second matching result is obtained with matching process shown in Fig. 3 shown in Fig. 2
To compare the first matching result and the second matching result, therefrom select preferable matching result as final matching result.
If the first matching result and the second matching result are consistent, any matching result can choose as final
With result.
If the first matching result and the second matching result are inconsistent, the first matching result and second can be compared
The number of character with the number of matched character string in result, in addition to matched character string and by the matching word of single character
The number of string is accorded with, and then therefrom selects preferable matching result as final matching result.For example, can be according to matching character
The number of string is The more the better, the matching character of the more fewer better and single character of the number of character in addition to matched character string
More fewer, better principle selects final matching result to the number of string.
In a kind of possible embodiment, matched character string in character string is being determined and in addition to matched character string
Character after, can by each character in character string according to it is following rule distribute dictionary label, obtain by dictionary label
The dictionary sequence label of composition:
For any one character in character string, if the character is the character in matched character string, for the character point
With the first dictionary label, if the character is the character in addition to matched character string, the second dictionary label will be distributed for the character.
In one example, the first dictionary label can be indicated with 1, and the second dictionary label can be indicated with 0.Certainly, it actually answers
The first dictionary label and the second dictionary label can also be configured according to actual needs in, for example, the first dictionary label Y table
Show, the second dictionary label indicates that the application does not limit this with N.
Illustratively, by taking text to be segmented is electronic health record as an example, describe that " double lungs are not heard and dry and wet in electronic health record
Rale, do not hear and film chest fricative ", it is assumed that in dictionary include " dry moist rales ", " fricative ", then " dry moist rales ",
" fricative " can be determined that matched character string, and then dictionary mark as shown in Table 1 can be generated according to above embodiment
(1 left side one of table is classified as the corresponding character string of text to be segmented to label sequence, and the first dictionary label is indicated with 1, the second dictionary label
Indicated with 0):
Table 1
Implementing procedure two, the prediction process based on conditional probability prediction model trained in advance.
It, first can be true before the labeled target participle sequence label of prediction character string in the embodiment of the present application
Make every kind of participle sequence label that character string may be labeled.
In a kind of possible embodiment, for character each in character string, it can determine that each character is corresponding extremely
Few a kind of participle label, it is possible to further arbitrarily select a kind of point from the corresponding at least one participle label of each character
Word label segments label as target, and using sequence composed by the corresponding target participle label of each character as one kind
Segment sequence label.
Wherein, the participle label that each character may be labeled has at least one, specifically includes: the starting position of word
Second label in the middle position of the first label, word, the third label of the end position of word, individual character word the 4th mark
Label.In one example, the first label of the starting position of word can be indicated with B (Begin), indicate word with I (Intermediate)
Middle position the second label, with E (End) indicate be word end position third label, indicate single with S (single)
4th label of the monosyllabic word of a character composition.
Wherein, for each character, corresponding participle label has this 4 kinds of situations of B, I, S, E, it is assumed that wraps in character string
P character is included, it, can if segmenting label as target for optional one in corresponding this 4 kinds participles label of each character
With the participle sequence label of generation for 4pKind.
In the embodiment of the present application, after determining a variety of participle sequence labels, character string, dictionary label can be based on
Sequence and conditional probability prediction model trained in advance are marked as every kind of participle sequence label to predict character string
Conditional probability.
Specific prediction process, referring to shown in Fig. 4:
Step 401, according to character string and/or dictionary sequence label, determine multiple feature templates.
Step 402, according to determining multiple feature templates, generate at least one function of state and at least one transfer letter
Number.
Step 403 determines each function of state in the case where character string is marked as every kind of participle sequence label
The value of value and each transfer function.
Step 404 segments every kind the value of the corresponding each function of state of sequence label and taking for each transfer function
Value is input in conditional probability prediction model trained in advance, is calculated separately character string and is marked as every kind of participle sequence label
Conditional probability.
For convenient for the understanding to pre- flow gauge shown in Fig. 4, firstly, to according to character string and/or dictionary sequence label
Determining multiple feature templates are introduced.
Illustratively, feature templates may include at least one of lower template:
For indicating the character feature template of single character in the character string;
For indicating the character feature template of the incidence relation of kinds of characters in the character string;
For indicating the dictionary feature templates of single dictionary label in the dictionary sequence label;
For indicating the dictionary feature templates of the incidence relation in the dictionary sequence label between different dictionary labels;
The compound characteristics template being made of the character feature template and the dictionary feature templates.
Three of the above feature templates are also used as a meta template (Unigram template) or two meta template (Bigram
template)。
Wherein, a meta template is determined for function of state, and template style is, for example, Uk:%x [i, j], wherein letter U
Expression template is a meta template;K expression is the serial number of template;X indicates the two dimension being made of character string and dictionary sequence label
Sequence;In the disclosure, j indicates that the position of column is shown to be first row as j=0, and first row refers to the character in two-dimensional sequence
Sequence, secondary series is shown to be as j=1, and secondary series refers to the dictionary sequence label in two-dimensional sequence;In the disclosure, i indicates word
Sequence or i-th of position namely current location in dictionary sequence label are accorded with, as j=0, x [i, 0] is indicated in two-dimensional sequence
Character string in i-th of position character, as j=1, x [i, 1] indicate two-dimensional sequence in dictionary sequence label in i-th
The dictionary label of a position.
Two meta templates are determined for transfer function, and template style is, for example, Bk:%x [i, j], and wherein letter b indicates
Template is two meta templates;Other parameters can be found in the explanation of an above-mentioned meta template, and which is not described herein again.
Illustratively, continue to use the character string of electronic health record shown in above-mentioned table 1 and corresponding dictionary sequence label it
Between corresponding relationship, features described above template is illustrated.
For the character string " double lungs are not heard and dry moist rales, do not hear and film chest fricative " that electronic health record is constituted, that
, the feature templates that can be generated are referring to shown in table 2:
Table 2
Wherein, U01 to U18 shown in table 2 is a meta template, and B01 is two meta templates.
U01 to U05 is the character feature template for indicating single character in character string.For example, U01:%x [i-2,
0] indicate character string in the i-th -2 positions character, i.e., before current location and with two, current location interval character
The character of position;U03:%x [i, 0] indicates the character of i-th of position in character string, the i.e. character of current location;U05:%x
[i+2,0] indicate character string in the i-th+2 positions character, i.e., after current location and with current location interval two
The character of the position of a character.
U06 to U12 is the character feature template for indicating the incidence relation of kinds of characters in character string.For example,
U06:%x [i-2,0]/%x [i-1,0] is indicated in character string (i-1)-th in the character and character string of the i-th -2 positions
The character of a position;U07:%x [i-1,0]/%x [i, 0] indicates the character and character of (i-1)-th position in character string
The character of i-th of position in sequence.
U13 is the dictionary feature templates for indicating single dictionary label in dictionary sequence label.For example, U13:%x [i,
1] the dictionary label of i-th of position in dictionary sequence label can be indicated.
U14 is the compound characteristics template being made of character feature template and dictionary feature templates.U14:%x [i, 0]/%x
[i, 1] can indicate the dictionary label of i-th of position in the character and dictionary sequence label of i-th of position in character string.
U15 to U18 is the dictionary feature for indicating the incidence relation in dictionary sequence label between different dictionary labels
Template.For example, U15:%x [i-2,1]/%x [i-1,1] indicate dictionary sequence label in the i-th -2 positions dictionary label, with
And in dictionary sequence label (i-1)-th position dictionary label.
B01 is two meta templates, and B01 can also be attributed to the character feature mould for indicating single character in character string
Plate.B01:%x [i, 0] can indicate the character of i-th of position in character string.Certainly, dictionary feature templates in practical application
Two meta templates are also constituted with compound characteristics template, the application does not limit this.
In the embodiment of the present application, a meta template be can be generated function of state s (y, x, i, j), and each meta template can give birth to
At W*p function of state, wherein p indicates the character number for including in character string, also may indicate that and wrap in dictionary sequence label
The dictionary label number contained also may indicate that the number for the participle label for including in participle sequence label, character number, dictionary mark
It is identical to sign number, participle label number three, W indicates the type of participle label, in the disclosure, W=4, i.e. " B ", " E ", " I ",
" S " this 4 kinds participle labels.
Continue to use above-mentioned example, as shown in table 1, in character string include " double ", " lung ", " not ", " news ", " and ", " dry ",
" wet ", " property ", " hello ", " sound ", ", ", " film ", " chest ", " rubbing ", " wiping ", "." this 16 characters, i.e. p=16 segments label
Type includes " B ", " E ", " I ", " S " this 4 kinds, i.e. W=4, therefore it can be concluded that 16*4=64 can be generated in each meta template
A function of state.
Wherein, two meta templates can be generated transfer function t (y, x, i, j), and W*W*p shape can be generated in each two meta template
State function, wherein p, W meaning are same as above.
Continue to continue to use above-mentioned example, as shown in table 1, it can be deduced that each two meta template can be generated 16*4*4=256
Transfer function.
Further, after determining above-mentioned all kinds of feature templates, it can be based on unitary template generation function of state, also
It can be based on binary template generation transfer function, specific embodiment is as follows:
Embodiment one,
Since an above-mentioned meta template can be one of character feature template, dictionary feature templates and compound characteristics template
Or it is a variety of, therefore the function of state s (y, x, i, j) based on above-mentioned unitary template generation includes following several situations:
First, it is assumed that character string includes p character, dictionary sequence label includes p dictionary label, segments sequence label
Comprising p participle label, three is equal.
Situation 1: if feature templates include character feature template, according to character feature template, the function of state s of generation
(y, x, i, j) are as follows:
Wherein, x indicates the two-dimensional sequence being made of character string and dictionary sequence label;When j=0, two-dimensional sequence is indicated
In character string;xI ± d, j=0Indicate character string i-th ± d position character, i take 1 into p arbitrary integer, d take 0 to
Any positive integer in p-i;Y indicates participle sequence label;yiIndicate i-th of participle label of participle sequence label y;
Character of the s (y, x, i, j) in i-th ± d position for meeting character string is m and segments the i-th of sequence label y
A participle label is n1Under conditions of value be k1, conversely, s (y, x, i, j) value is k2。
Situation 2: if feature templates include dictionary feature templates, according to dictionary feature templates, the function of state s of generation
(y, x, i, j) are as follows:
Wherein, when j=1, the dictionary sequence label in two-dimensional sequence is indicated;xI ± d, j=1Indicate the i-th of dictionary sequence label
The character of ± d position, i take 1 into p arbitrary integer, d take 0 any positive integer into p-i;Other parameters meaning is same as above.
Dictionary label of the s (y, x, i, j) in i-th ± d position for meeting dictionary sequence label is h and participle label sequence
I-th of participle label for arranging y is n1Under conditions of value be k1, conversely, s (y, x, i, j) value is k2。
Situation 3: if feature templates include compound characteristics template, according to compound characteristics template, the function of state s of generation
(y, x, i, j) are as follows:
Character of the s (y, x, i, j) in i-th ± d position for meeting character string is i-th ± d of m, dictionary sequence label
The dictionary label of a position is h and segments i-th of participle label of sequence label y to be n1Under conditions of value be k1, conversely, s
(y, x, i, j) value is k2。
Wherein, k1Such as it can be with value for 1, k2Such as it can be with value for 0.It certainly, can also be according to reality in practical application
Border situation configures k1And k2Value, the application do not limit this.
Wherein, participle label n1 and n2 can be these four participle any one of labels of above-mentioned B, I, E, S.
For ease of understanding, it below with reference to the content of Tables 1 and 2, illustrates to the function of state s (y, x, i, j) of generation
Explanation.
Example one, hypothesis character feature template are U03:%x [i, 0], and the character of i-th of position is directed toward word in character string
It accords with " double ", then utilizing U03:%x [i, 0], the function of state s (y, x, i, j) of generation is following four situation (in disclosure reality
Apply in example, the corresponding participle label of the character of each position is four kinds, i.e. B, I, E, S):
The aforementioned four function of state s determined for template U03:%x [i, 0]1To s4, character string is determined
Multiple participle sequence labels in any one participle sequence label, it is thus necessary to determine that the corresponding state letter of the participle sequence label
Several value s1To s4, then, it needs successively to traverse each character in character string, determines the corresponding function of state of each character
Value, it is assumed that the character currently traversed be " double ", if the participle label for corresponding to character " double " in the participle sequence label is
" B ", the s in aforementioned four function of state1Value 1, other function of state s2To s4Value is 0.About other feature template generation
Function of state or transfer function value method of determination, be also referred to the above process, no longer make introductions all round here.
Example two, hypothesis character feature template are U04:%x [i+1,0], and the character of i+1 position refers in character string
To character " lung ", then utilizing U04:%x [i+1,0], the function of state s (y, x, i, j) of generation is following four situation:
Example three, hypothesis character feature template are U08:%x [i, 0]/%x [i+1,0], i-th of position in character string
Character be directed toward character " double ", the character of i+1 position is directed toward character " lung " in character string, then utilize U08:%x [i,
0] the function of state s (y, x, i, j) of/%x [i+1,0], generation are following four situation:
Certainly, for other character feature templates in a meta template, above-mentioned example one is also referred to example three
Mode generates function of state, specifically not reinflated explanation.
Example four, hypothesis dictionary feature templates are U13:%x [i, 1], the dictionary mark of i-th of position in dictionary sequence label
Label are 0, then utilizing U13:%x [i, 1], the function of state s (y, x, i, j) of generation is following four situation:
Example five, hypothesis dictionary feature templates are U17:%x [i, 1]/%x [i+1,1], i-th in dictionary sequence label
The dictionary label of position is 0, and the dictionary label of i+1 position is 0, then U17:%x [i, 1]/%x [i+1,1] is utilized, it is raw
At function of state s (y, x, i, j) be following four situation:
Certainly, for other dictionary feature templates in a meta template, above-mentioned example four is also referred to example five
Mode generates function of state, specifically not reinflated explanation.
Example six, hypothesis compound characteristics template are U14:%x [i, 0]/%x [i, 1], i-th position in character string
Character is directed toward character " double ", and the dictionary label of i-th of position is 0 in dictionary sequence label, then utilizing U14:%x [i, 0]/%
The function of state s (y, x, i, j) of x [i, 1], generation are following four situation:
Certainly, for other compound characteristics templates in a meta template, the mode for being also referred to above-mentioned example six is generated
Function of state, specifically not reinflated explanation.
Embodiment two,
Above-mentioned two meta template be also possible to one of character feature template, dictionary feature templates and compound characteristics template or
It is a variety of.Transfer function based on above-mentioned binary template generation includes following several situations:
First, it is assumed that character string includes p character, dictionary sequence label includes p dictionary label, segments sequence label
Comprising p participle label, three is equal.
Situation 1: if feature templates include character feature template, according to character feature template, the transfer function t of generation
(y, x, i, j) are as follows:
Wherein, x indicates the two-dimensional sequence being made of character string and dictionary sequence label;When j=0, two-dimensional sequence is indicated
In character string;xI ± d, j=0Indicate character string i-th ± d position character, i take 1 into p arbitrary integer, d take 0 to
Any positive integer in p-i;Y indicates participle sequence label;yiIndicate i-th of participle label of participle sequence label y;yi-1Table
Show (i-1)-th participle label of participle sequence label y;
Character of the t (y, x, i, j) in i-th ± d position for meeting character string is m and segments the i-th of sequence label y
A participle label is n1, participle sequence label y (i-1)-th participle label for n2Under conditions of value be k1, conversely, t (y, x,
I, j) value be k2。
Situation 2: if feature templates include dictionary feature templates, according to dictionary feature templates, the transfer function t of generation
(y, x, i, j) are as follows:
Wherein, x indicates the two-dimensional sequence being made of character string and dictionary sequence label;When j=1, two-dimensional sequence is indicated
In dictionary sequence label;xI ± d, j=1Indicate dictionary sequence label i-th ± d position dictionary label, i take 1 into p arbitrarily
Integer, p are the character total number for including in character string, and d takes 0 any positive integer into p-i;Y indicates participle sequence label;
yiIndicate i-th of participle label of participle sequence label y;yi-1Indicate (i-1)-th participle label of participle sequence label y;
Dictionary label of the t (y, x, i, j) in i-th ± d position for meeting dictionary sequence label is h and participle label sequence
I-th of participle label for arranging y is n1, participle sequence label y (i-1)-th participle label for n2Under conditions of value be k1, instead
It, t (y, x, i, j) value is k2。
Situation 3: if feature templates include compound characteristics template, according to compound characteristics template, the transfer function t of generation
(y, x, i, j) are as follows:
Wherein, character of the t (y, x, i, j) in i-th ± d position for meeting alphanumeric tag sequence is m, dictionary sequence label
I-th ± d position dictionary label be h and segment sequence label y i-th of participle label be n1, participle sequence label y
(i-1)-th participle label be n2Under conditions of value be k1, conversely, t (y, x, i, j) value is k2。
For ease of understanding, it below with reference to the content of Tables 1 and 2, illustrates to the transfer function t (y, x, i, j) of generation
Explanation.
Assuming that character feature template is B01:%x [i, 0], the character of i-th of position is directed toward character " lung " in character string,
B01:%x [i, 0] so is utilized, the transfer function t (y, x, i, j) of generation includes 16 kinds of situations, wherein with yiThis feelings of=B
Condition, four kinds of producible transfer function t (y, x, i, j) are as follows:
It is, of course, also possible to be directed to yi=I, yi=E, yiThese three situations of=S can also generate four kinds of transfer function t respectively
(y, x, i, j), specifically not reinflated explanation.
Further, it after obtaining function of state and transfer function according to aforesaid way, can also determine in character sequence
Column are marked as the value of each function of state and the value of each transfer function in the case where every kind of participle sequence label.In turn
The value of value and each transfer function that every kind segments the corresponding each function of state of sequence label is input to preparatory training
Conditional probability prediction model in, calculate separately character string be marked as every kind participle sequence label conditional probability.
In the embodiment of the present application, trained conditional probability prediction model is condition random field (conditional in advance
Random field, CRF).Wherein, condition random field can be understood as under conditions of given one group of input stochastic variable another group
The conditional probability distribution model of stochastic variable is exported, the assumed condition of model is that output stochastic variable composition markov is random
?.In the case where being applied to treat the scene that participle text is segmented, the stochastic variable of input can be for by character string and dictionary
The stochastic variable of the two-dimensional sequence x of sequence label composition, output can be participle sequence label y.
Participle text is treated in the embodiment of the present application to be segmented, and can be actually converted to prediction character string and is marked as
The problem of conditional probability of every kind of participle sequence label.Wherein, the bigger participle sequence label of the conditional probability predicted, explanation
A possibility that being correct participle sequence label, is bigger.
Illustratively, the calculation formula of condition random field are as follows:
Wherein,
In above-mentioned formula, and p (y | x) it indicates to be marked by the two-dimensional sequence x that character string and dictionary sequence label form
For the conditional probability for segmenting sequence label y;
I indicates i-th of position in character string or dictionary sequence label;
J indicates the columns of two-dimensional sequence x, when j=0, indicates the character string in two-dimensional sequence x, when j=1, indicates two dimension
Dictionary sequence label in sequence x;
P indicates the character number for including in character string, also may indicate that the dictionary label for including in dictionary sequence label
Number also may indicate that the number for the participle label for including in participle sequence label;
M is the quantity that the participle sequence label y that participle mark obtains is carried out to character string x;
Z (x) is standardizing factor;
sl(y, x, i, j) indicates that first of function of state, L indicate the total number of the function of state according to unitary template generation,
Wherein, a meta template may include at least one of character feature template, dictionary feature templates, compound characteristics template, it is assumed that
The number of one meta template is e1It is a, then L=e1* W*p, W are the type for segmenting label;
tk(y, x, i, j) indicates that k-th of transfer function, K indicate the total number of the transfer function according to binary template generation,
Wherein, two meta templates may include at least one of character feature template, dictionary feature templates, compound characteristics template, it is assumed that
The number of two meta templates is e2It is a, then K=e2* the meaning of W*W*p, W are same as above.
Wherein, μlFor the first weight of function of state.λkFor the second weight of transfer function.Transfer function and function of state
Weight λkAnd μlIt is to be solved and being trained to conditional probability prediction model, specific solution procedure will below
It illustrates.
By the calculation formula of above-mentioned condition random field it is found that when calculating character sequence is marked as every kind of participle sequence label
Conditional probability when, since the value of each function of state and each in the case where given participle sequence label, can be sought out
The value of a transfer function, therefore the value of the value of each function of state and each transfer function is substituting to above-mentioned condition probability
In prediction model, the conditional probability that character string is marked as given participle sequence label can be sought out.
Illustratively, character string described in above-mentioned table 1 and corresponding dictionary sequence label and 2 institute of above-mentioned table are continued to use
The feature templates stated, if a meta template U01 to U18 generates function of state s in selection table 2l(y, x, i, j), then can give birth to
At function of state slThe L=18*4*16=1152 function of state of total number of (y, x, i, j), i.e. s1(y, x, i, j) is to s1152
(y,x,i,j).If two meta template B01 generate transfer function t in selection table 2k(y, x, i, j), then the transfer that can be generated
Function tkThe K=4*4*16=256 transfer function of total number of (y, x, i, j), i.e. t1(y, x, i, j) is to t256(y,x,i,j)。
The two-dimensional sequence x formed in the given character string described in table 1 and corresponding dictionary sequence label and some
Segment sequence label y in the case where, can be by i=1, j=0, determine the value and each transfer of each function of state respectively
The value of function, until determining the value of each function of state and each transfer letter when i=p (p=16 in this example), j=1
Several values, and then conditional probability when character string is marked as given participle sequence label can be sought out.
Wherein, when seeking any one function of state, wherein the setting condition of the function of state can be " yi=n1,
xI ± d, j=0=m ", yi=n1,xI ± d, j=1=h or " yi=n1,xI ± d, j=0=m, xI ± d, j=1=h ", by judging the function of state
Setting condition it is whether true, if so, then determine that the function of state takes 1, if not, can determine that the function of state takes 0.
Wherein, when seeking any one transfer function, wherein the setting condition of the transfer function can be " yi=n1,
yi-1=n2,xI ± d, j=0=m ", whether the setting condition by judging the transfer function is true, if so, then determine the transfer letter
Number takes 1, if not, it can determine that the transfer function takes 0.
It should be noted that the corresponding dictionary label of matched character string has been got well due to labeled in dictionary sequence label,
It, can be by the corresponding dictionary mark of matched character string when higher based on the accuracy of dictionary matching process under some special scenes
The word segmentation result that be equivalent to can for reference is signed, it is possible to derive matching word based on the corresponding dictionary label of matched character string
The corresponding participle label of symbol string, in this way can be various without assigning to each character in this part of matched character string in character string
Label is segmented, but the participle label that the result marked in dictionary sequence label carrys out configurations match character string can be directly based upon,
The number of conditional probability prediction model predicted condition probability can also be saved in this way, so that participle prediction process efficiency is higher.
For example, continue to continue to use electronic health record shown in table 1 and corresponding dictionary sequence label, by by electronic health record and
Dictionary is matched after obtaining dictionary sequence label, can determine that " dry moist rales ", " fricative " are matched character string,
That is " dry moist rales ", " fricative " can be used as a point good word, originally " dry moist rales " and " fricative "
Corresponding participle label has 48Kind is possible, but using dictionary sequence label as reference factor in this programme, general based on condition
Rate prediction model segments in sequence label " dry moist rales " come when predicting that electronic health record is marked as every kind of participle sequence label
Corresponding participle label can be determined as " B (dry) I (wet) I (property) I (hello) E (sound) ", " fricative " corresponding participle label can
To be determined as " B (rub) I (wiping) E (sound) ", therefore segments sequence label and can reduce 38Kind possibility, namely reduce 38Kind can
The participle sequence label of energy.In this way, relative to by all participle sequence labels, all one by one compared with design conditions probability, the application is mentioned
The above scheme of confession can also save the number of conditional probability prediction model predicted condition probability, so that participle prediction process effect
Rate is higher.
After obtaining character string and being marked as the conditional probability of each participle sequence label, it can will meet default
The corresponding participle sequence label of the conditional probability of condition is determined as target participle sequence label.Illustratively, by conditional probability
The corresponding participle sequence label of the maximum conditional probability of numerical value is determined as target participle sequence label.Then, it is segmented based on target
Sequence label treats participle text and carries out word segmentation processing.
In one example, continue to continue to use electronic health record shown in table 1, corresponding by segmenting sequence label to each target
Conditional probability be compared after, the corresponding target of the highest conditional probability of the numerical value of the conditional probability of selection segments label sequence
Column, ginseng are shown in Table 3:
Table 3
Electronic health record | Dictionary sequence label | Target segments sequence label |
It is double | 0 | B |
Lung | 0 | E |
Not | 0 | B |
It hears | 0 | E |
And | 0 | S |
It is dry | 1 | B |
It is wet | 1 | I |
Property | 1 | I |
Hello | 1 | I |
Sound | 1 | E |
, | 0 | S |
Not | 0 | B |
It hears | 0 | E |
And | 0 | S |
Film | 0 | B |
Chest | 0 | E |
It rubs | 1 | B |
It wipes | 1 | I |
Sound | 1 | E |
。 | 0 | S |
After carrying out word segmentation processing to electronic health record based on the participle sequence label of target shown in table 3, available participle
Result include: " double lungs ", " not hearing ", " and ", " dry moist rales ", ", ", " film chest ", " fricative ", "."
In the embodiment of the present application, marked being marked as each participle using conditional probability prediction model calculating character sequence
Sign sequence when conditional probability when, the influence factor of conditional probability is in addition to the corresponding function of state of each participle sequence label and transfer
The value of function further includes the weight λ of transfer function and function of statekAnd μl.Wherein, weight λkAnd μlIt is by general to condition
Rate prediction model is trained and solves.
In the following, being illustrated to the training process of the embodiment of the present application conditional Probabilistic Prediction Model.Referring to Figure 5,
For the flow diagram of the training process of conditional probability prediction model provided by the embodiments of the present application, include the following steps:
Step 501 obtains sample set, includes multiple groups sample in sample set, includes sample to be segmented in every group of sample
The corresponding sample character string of text, sample dictionary sequence label and at least one sample segment sequence label.
Step 502 is directed to every group of sample, according at least one of sample character string, sample dictionary sequence label, really
Sample character string in fixed this group of sample is marked as each function of state in the case where every kind of sample participle sequence label
The value of value and each transfer function.
Step 503, by the value of the value for each function of state determined by every group of sample and each transfer function
It is input in conditional probability prediction model to be trained, determines the corresponding conditional probability function of every group of sample, conditional probability function
In include function of state the first weight and transfer function the second weight.
The corresponding conditional probability function of every group of sample determined is input to default loss as independent variable by step 504
In function, by adjusting the value for the first weight for including in default loss function and the value of the second weight, default damage is determined
Lose the penalty values of function.
Step 505, when penalty values meet the default condition of convergence, determine the first current value and the second weight of the first weight
The second current value, and determination obtained in the case where the first weight is the first current value, the second weight is the second current value
Conditional probability prediction model.
Specifically, can be given above-mentioned after being input to conditional probability function in default loss function as independent variable
λkAnd μlTwo kinds are assigned initial value to training parameter, treat training parameter λ according to Newton iteration method or gradient descent methodkAnd μlInto
Row adjustment updates, until the penalty values of default loss function stop updating when meeting the default condition of convergence, thus obtains wait instruct
Practice parameter lambdakAnd μlValue, to determine that the λ in condition random field formulakAnd μlMould is predicted to get to conditional probability
Type.
In the embodiment of the present application, during training condition Probabilistic Prediction Model, since dictionary sequence label can also be made
The reference factor of sequence label is segmented for prediction, therefore can accelerate model convergence, that is to say, that can use the sample of relatively small amount
This corpus can train to obtain conditional probability prediction model, it is possible thereby to without largely with the participle label manually marked
Sample corpus saves human cost, the building efficiency of training for promotion collection.It, can be with after obtaining conditional probability prediction model
By test sample set, the prediction accuracy of the conditional probability prediction model is tested, specific test process is here not
Reinflated explanation.
Conceived based on same application, text participle dress corresponding with text segmenting method is additionally provided in the embodiment of the present application
It sets, since the principle that the device in the embodiment of the present application solves the problems, such as is similar to the above-mentioned text segmenting method of the embodiment of the present application,
Therefore the implementation of device may refer to the implementation of method, and overlaps will not be repeated.
Referring to shown in Fig. 6, for a kind of structural schematic diagram of text participle device 60 provided by the embodiments of the present application, comprising:
Conversion module 61, for being character string by text conversion to be segmented;
First determining module 62, for by the character string for meeting preset length for including in the character string and preparatory structure
The standard words in dictionary built are matched, the determining and matched matched character string of the standard words, are in the character string
Each character of the matched character string and each character in addition to the matched character string distribute corresponding dictionary mark respectively
Label, obtain dictionary sequence label;
Second determining module 63, for determining the corresponding at least one participle label of each character in the character string,
Obtain a variety of participle sequence labels;
Conditional probability prediction module 64, for according to the character string, the dictionary sequence label and training in advance
Conditional probability prediction model, determine the character string be marked as every kind participle sequence label conditional probability;
Word segmentation processing module 65, for the corresponding participle sequence label of the conditional probability for meeting preset condition to be determined as mesh
Mark participle sequence label, and word segmentation processing is carried out to the text to be segmented based on target participle sequence label.
In some embodiments of the present application, the conditional probability prediction module 64, according to the character string, institute's predicate
Allusion quotation sequence label and conditional probability prediction model trained in advance, determine that the character string is marked as every kind of participle label
When the conditional probability of sequence, it is specifically used for:
According to the character string and/or the dictionary sequence label, multiple feature templates are determined;
According to determining multiple feature templates, at least one function of state and at least one transfer function are generated;
Determine the value of each function of state in the case where the character string is marked as every kind of participle sequence label
With the value of each transfer function;
The value of value and each transfer function that every kind segments the corresponding each function of state of sequence label is input to
In advance in trained conditional probability prediction model, the item that the character string is marked as every kind of participle sequence label is calculated separately
Part probability.
In some embodiments of the present application, the feature templates include at least one of lower template:
For indicating the character feature template of single character in the character string;
For indicating the character feature template of the incidence relation of kinds of characters in the character string;
For indicating the dictionary feature templates of single dictionary label in the dictionary sequence label;
For indicating the dictionary feature templates of the incidence relation in the dictionary sequence label between different dictionary labels;
The compound characteristics template being made of the character feature template and the dictionary feature templates.
In some embodiments of the present application, the character string includes p character, and the dictionary sequence label includes p
Dictionary label, the participle sequence label include p participle label;
If the feature templates include the character feature template, the conditional probability prediction module 64 is according to the word
Accord with feature templates, the function of state s (y, x, i, j) of generation are as follows:
If the feature templates include the dictionary feature templates, the conditional probability prediction module 64 is according to institute's predicate
Allusion quotation feature templates, the function of state s (y, x, i, j) of generation are as follows:
If the feature templates include the compound characteristics template, the conditional probability prediction module 64 is according to described multiple
Close feature templates, the function of state s (y, x, i, j) of generation are as follows:
Wherein, x indicates the two-dimensional sequence being made of the character string and the dictionary sequence label;Y indicates described point
Word sequence label;When j=0, x indicates the character string in the two-dimensional sequence;When j=1, x indicates the two-dimensional sequence
In the dictionary sequence label;I take 1 into p arbitrary integer;xI ± d, j=0Indicate i-th ± d position of the character string
Character, xI ± d, j=1Indicate the dictionary label of i-th ± d position of the dictionary sequence label, d takes 0 any just whole into p-i
Number;yiIndicate i-th of participle label of the participle sequence label y;n1Indicate i-th of participle mark of the participle sequence label y
Label, m indicate that the character of i-th ± d position in the character string, h indicate i-th ± d position in the dictionary sequence label
Dictionary label.
In some embodiments of the present application, the character string includes p character, and the dictionary sequence label includes p
Dictionary label, the participle sequence label include p participle label;
If the feature templates include the character feature template, the conditional probability prediction module 64 is according to the word
Accord with feature templates, the transfer function t (y, x, i, j) of generation are as follows:
Wherein, x indicates the two-dimensional sequence being made of the character string and the dictionary sequence label;Y indicates described point
Word sequence label;When j=0, x indicates the character string in the two-dimensional sequence;I take 1 into p arbitrary integer;xI ± d, j=0
Indicate that the character of i-th ± d position of the character string, d take 0 any positive integer into p-i;yiIndicate the participle mark
Sign i-th of participle label of sequences y;yi-1Indicate (i-1)-th participle label of participle sequence label y;n1Indicate the participle mark
Sign i-th of participle label of sequences y, n2Indicate that (i-1)-th participle label of the participle sequence label y, m indicate the character
The character of i-th ± d position in sequence.
In some embodiments of the present application, at least one described participle label includes: the first mark of the starting position of word
It signs, the 4th label of the third label of the end position of second label in the middle position of word, word and individual character word;
In some embodiments of the present application, second determining module 63, each character in determining the character string
Corresponding at least one participle label is specifically used for when obtaining a variety of participle sequence labels:
Determine the corresponding at least one participle label of each character in the character string;
It arbitrarily selects a kind of participle label to segment as target from the corresponding at least one participle label of each character to mark
Label, and using sequence composed by the corresponding target participle label of each character as a kind of participle sequence label.
In some embodiments of the present application, first determining module 62, to match word described in the character string
The each character and each character in addition to the matched character string for according with string distribute corresponding dictionary label respectively, obtain dictionary
When sequence label, it is specifically used for:
Each character in the character string is distributed to dictionary label according to following rule, obtain being made of dictionary label
Dictionary sequence label:
For any one character in the character string, if the character is the character in the matched character string, for
The character distributes the first dictionary label, if the character is the character in addition to the matched character string, will distribute for the character
Second dictionary label.
In some embodiments of the present application, described device further include:
Model training module 66, for obtaining the conditional probability prediction model according to following manner training:
Sample set is obtained, includes multiple groups sample in the sample set, includes sample text to be segmented in every group of sample
Corresponding sample character string, sample dictionary sequence label and at least one sample segment sequence label;
For every group of sample, according at least one of the sample character string, the sample dictionary sequence label, really
The sample character string in fixed this group of sample is marked as each state letter in the case where every kind of sample participle sequence label
The value of several value and each transfer function;
By the value of the value for each function of state determined by every group of sample and each transfer function be input to
In trained conditional probability prediction model, determines the corresponding conditional probability function of every group of sample, wrapped in the conditional probability function
Include the first weight of the function of state and the second weight of the transfer function;
The corresponding conditional probability function of every group of sample determined is input in default loss function as independent variable, is led to
Cross the value of the value and second weight that adjust first weight for including in the default loss function, determine described in
The penalty values of default loss function;
When the penalty values meet the default condition of convergence, the first current value and described second of first weight is determined
Second current value of weight, and determine first weight is first current value, second weight is described second
The conditional probability prediction model obtained in the case where current value.
Description about the interaction flow between the process flow and each module of each module in device is referred to
The related description in embodiment of the method is stated, I will not elaborate.
The embodiment of the present application provides a kind of electronic equipment 700, is illustrated in figure 7 electronics provided by the embodiments of the present application and sets
Standby 700 structural schematic diagram, comprising: processor 701, memory 702 and bus 703, memory 702 are stored with processor 701
Executable machine readable instructions, it is logical by bus 703 between processor 701 and memory 702 when electronic equipment operation
Letter, processor execute machine readable instructions, execute when executing such as the text segmenting method proposed in above method embodiment
Step.
The embodiment of the present application provides a kind of computer readable storage medium, is stored on the computer readable storage medium
Computer program executes the text participle as proposed in above method embodiment when the computer program is run by processor
The step of method.
Specifically, which can be general storage medium, such as mobile disk, hard disk, on the storage medium
Computer program when being run, above-mentioned text segmenting method is able to carry out, so as to quickly and accurately to including non-
The text of structural data is segmented.
This application provides a kind of text segmenting method and devices, can be first character sequence by text conversion to be segmented
Column, can match the character string for meeting preset length in character string with the standard words in the dictionary constructed in advance, base later
In the available dictionary sequence label of matching result, the corresponding at least one of character each in determining character string can also be passed through
Label is segmented, a variety of participle sequence labels are obtained.It is possible to further using dictionary sequence label and character string as model
Input is marked as conditional probability when every kind of participle sequence label using conditional probability prediction model prediction character string, after
It is continuous just to determine that target segments sequence label based on obtained conditional probability, and participle text is treated based on target participle sequence label
This progress word segmentation processing.
It include predicting the two participle predictions based on dictionary matching and based on conditional probability prediction model in aforesaid way
Process, by combining above-mentioned two participle prediction process, on the one hand, using the dictionary sequence label obtained through dictionary matching as base
The reference factor when prediction of conditional probability prediction model can to eventually pass through point that conditional probability prediction model predicts
The accuracy of word result is higher, promotes the accuracy rate of prediction word segmentation result;On the other hand, conditional probability prediction model is introduced,
In the case where giving the corresponding character string of text to be segmented and dictionary sequence label, prediction character string is marked as certain
Conditional probability when kind participle sequence label, can directly obtain the corresponding participle sequence label of character string, also in this way
To obtain the corresponding participle label of alphabet in text to be identified by primary prediction process, text thus can also be improved
The efficiency of this participle.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description
It with the specific work process of device, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.In the application
In provided several embodiments, it should be understood that disclosed systems, devices and methods, it can be real by another way
It is existing.The apparatus embodiments described above are merely exemplary, for example, the division of the unit, only a kind of logic function
It can divide, there may be another division manner in actual implementation, in another example, multiple units or components can combine or can collect
At another system is arrived, or some features can be ignored or not executed.Another point, shown or discussed mutual coupling
Conjunction or direct-coupling or communication connection can be the indirect coupling or communication connection by some communication interfaces, device or unit,
It can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.
It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product
It is stored in the executable non-volatile computer-readable storage medium of a processor.Based on this understanding, the application
Technical solution substantially the part of the part that contributes to existing technology or the technical solution can be with software in other words
The form of product embodies, which is stored in a storage medium, including some instructions use so that
One computer equipment (can be personal computer, server or the network equipment etc.) executes each embodiment institute of the application
State all or part of the steps of method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-Only
Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. is various to deposit
Store up the medium of program code.
The above is only the protection scopes of the specific embodiment of the application, but the application to be not limited thereto, any to be familiar with
Those skilled in the art within the technical scope of the present application, can easily think of the change or the replacement, and should all cover
Within the protection scope of the application.Therefore, the protection scope of the application should be subject to the protection scope in claims.
Claims (15)
1. a kind of text segmenting method characterized by comprising
It is character string by text conversion to be segmented;
The character string for meeting preset length for including in the character string and the standard words in the dictionary that in advance constructs are carried out
Matching, the determining and matched matched character string of the standard words are each word of matched character string described in the character string
Symbol and each character in addition to the matched character string distribute corresponding dictionary label respectively, obtain dictionary sequence label;
It determines the corresponding at least one participle label of each character in the character string, obtains a variety of participle sequence labels;
According to the character string, the dictionary sequence label and conditional probability prediction model trained in advance, determine described in
Character string is marked as the conditional probability of every kind of participle sequence label;
The corresponding participle sequence label of the conditional probability for meeting preset condition is determined as target participle sequence label, and is based on institute
It states target participle sequence label and word segmentation processing is carried out to the text to be segmented.
2. the method as described in claim 1, which is characterized in that described according to the character string, the dictionary sequence label
And conditional probability prediction model trained in advance, determine that the character string is marked as the condition of every kind of participle sequence label
Probability, comprising:
According to the character string and/or the dictionary sequence label, multiple feature templates are determined;
According to determining multiple feature templates, at least one function of state and at least one transfer function are generated;
Determine in the case where the character string is marked as every kind of participle sequence label the value of each function of state and each
The value of a transfer function;
The value of value and each transfer function that every kind segments the corresponding each function of state of sequence label is input in advance
In trained conditional probability prediction model, calculate separately the character string be marked as every kind participle sequence label condition it is general
Rate.
3. method according to claim 2, which is characterized in that the feature templates include at least one of lower template:
For indicating the character feature template of single character in the character string;
For indicating the character feature template of the incidence relation of kinds of characters in the character string;
For indicating the dictionary feature templates of single dictionary label in the dictionary sequence label;
For indicating the dictionary feature templates of the incidence relation in the dictionary sequence label between different dictionary labels;
The compound characteristics template being made of the character feature template and the dictionary feature templates.
4. method as claimed in claim 3, which is characterized in that the character string includes p character, the dictionary label sequence
Column include p dictionary label, and the participle sequence label includes p participle label;
If the feature templates include the character feature template, according to the character feature template, the function of state s of generation
(y, x, i, j) are as follows:
If the feature templates include the dictionary feature templates, according to the dictionary feature templates, the function of state s of generation
(y, x, i, j) are as follows:
If the feature templates include the compound characteristics template, according to the compound characteristics template, the function of state s of generation
(y, x, i, j) are as follows:
Wherein, x indicates the two-dimensional sequence being made of the character string and the dictionary sequence label;Y indicates the participle mark
Sign sequence;When j=0, x indicates the character string in the two-dimensional sequence;When j=1, x is indicated in the two-dimensional sequence
The dictionary sequence label;I take 1 into p arbitrary integer;xI ± d, j=0Indicate the word of i-th ± d position of the character string
Symbol, xI ± d, j=1Indicate the dictionary label of i-th ± d position of the dictionary sequence label, d takes 0 any just whole into p-i
Number;yiIndicate i-th of participle label of the participle sequence label y;n1Indicate i-th of participle mark of the participle sequence label y
Label, m indicate that the character of i-th ± d position in the character string, h indicate i-th ± d position in the dictionary sequence label
Dictionary label.
5. method as claimed in claim 3, which is characterized in that the character string includes p character, the participle label sequence
Column include p participle label;
If the feature templates include the character feature template, according to the character feature template, the transfer function t of generation
(y, x, i, j) are as follows:
Wherein, x indicates the two-dimensional sequence being made of the character string and the dictionary sequence label;Y indicates the participle mark
Sign sequence;When j=0, x indicates the character string in the two-dimensional sequence;I take 1 into p arbitrary integer;xI ± d, j=0It indicates
The character of i-th ± d position of the character string, d take 0 any positive integer into p-i;yiIndicate the participle label sequence
Arrange i-th of participle label of y;yi-1Indicate (i-1)-th participle label of participle sequence label y;n1Indicate the participle label sequence
Arrange i-th of participle label of y, n2Indicate that (i-1)-th participle label of the participle sequence label y, m indicate the character string
In i-th ± d position character.
6. method as claimed in claim 1 to 5, which is characterized in that at least one participle label includes: word
First label of starting position, second label in the middle position of word, word end position third label and monosyllabic word
4th label of language;
It determines the corresponding at least one participle label of each character in the character string, obtains a variety of participle sequence labels, wrap
It includes:
Determine the corresponding at least one participle label of each character in the character string;
Arbitrarily select a kind of participle label as target participle label from the corresponding at least one participle label of each character, and
Using sequence composed by the corresponding target participle label of each character as a kind of participle sequence label.
7. method as claimed in claim 1 to 5, which is characterized in that for matched character string described in the character string
Each character and each character in addition to the matched character string distribute corresponding dictionary label respectively, obtain dictionary label
Sequence, comprising:
Each character in the character string is distributed to dictionary label according to following rule, obtain the word being made of dictionary label
Allusion quotation sequence label:
For any one character in the character string, if the character is the character in the matched character string, for the word
Symbol the first dictionary label of distribution will be character distribution second if the character is the character in addition to the matched character string
Dictionary label.
8. the method as described in claim 1, which is characterized in that obtain the conditional probability according to following manner training and predict mould
Type:
Sample set is obtained, includes multiple groups sample in the sample set, includes that sample text to be segmented is corresponding in every group of sample
Sample character string, sample dictionary sequence label and at least one sample segment sequence label;
This group of sample is determined according at least one of the sample character string, sample dictionary sequence label for every group of sample
Sample character string in this is marked as in the case where every kind of sample participle sequence label the value of each function of state and each
The value of a transfer function;
The value of the value for each function of state determined by every group of sample and each transfer function is input to wait train
Conditional probability prediction model in, determine the corresponding conditional probability function of every group of sample, include institute in the conditional probability function
State the first weight of function of state and the second weight of the transfer function;
The corresponding conditional probability function of every group of sample determined is input in default loss function as independent variable, passes through tune
The value for first weight for including in the whole default loss function and the value of second weight determine described default
The penalty values of loss function;
When the penalty values meet the default condition of convergence, the first current value and second weight of first weight are determined
The second current value, and determine first weight is first current value, second weight is described second current
The conditional probability prediction model obtained in the case where value.
9. a kind of text segments device characterized by comprising
Conversion module, for being character string by text conversion to be segmented;
First determining module, for by the character string for meeting preset length for including in the character string and the word that in advance constructs
Standard words in allusion quotation are matched, the determining and matched matched character string of the standard words, are described in the character string
Each character with character string and each character in addition to the matched character string distribute corresponding dictionary label respectively, obtain
Dictionary sequence label;
Second determining module obtains more for determining the corresponding at least one participle label of each character in the character string
Kind participle sequence label;
Conditional probability prediction module, for according to the character string, the dictionary sequence label and condition trained in advance
Probabilistic Prediction Model determines that the character string is marked as the conditional probability of every kind of participle sequence label;
Word segmentation processing module, for the corresponding participle sequence label of the conditional probability for meeting preset condition to be determined as target participle
Sequence label, and word segmentation processing is carried out to the text to be segmented based on target participle sequence label.
10. device as claimed in claim 9, which is characterized in that the conditional probability prediction module, according to the character sequence
Column, the dictionary sequence label and conditional probability prediction model trained in advance, determine that the character string is marked as often
When the conditional probability of kind participle sequence label, it is specifically used for:
According to the character string and/or the dictionary sequence label, multiple feature templates are determined;
According to determining multiple feature templates, at least one function of state and at least one transfer function are generated;
Determine in the case where the character string is marked as every kind of participle sequence label the value of each function of state and each
The value of a transfer function;
The value of value and each transfer function that every kind segments the corresponding each function of state of sequence label is input in advance
In trained conditional probability prediction model, calculate separately the character string be marked as every kind participle sequence label condition it is general
Rate.
11. device as claimed in claim 10, which is characterized in that the feature templates include at least one in lower template
Kind:
For indicating the character feature template of single character in the character string;
For indicating the character feature template of the incidence relation of kinds of characters in the character string;
For indicating the dictionary feature templates of single dictionary label in the dictionary sequence label;
For indicating the dictionary feature templates of the incidence relation in the dictionary sequence label between different dictionary labels;
The compound characteristics template being made of the character feature template and the dictionary feature templates.
12. device as claimed in claim 11, which is characterized in that the character string includes p character, the dictionary label
Sequence includes p dictionary label, and the participle sequence label includes p participle label;
If the feature templates include the character feature template, the conditional probability prediction module is according to the character feature
Template, the function of state s (y, x, i, j) of generation are as follows:
If the feature templates include the dictionary feature templates, the conditional probability prediction module is according to the dictionary feature
Template, the function of state s (y, x, i, j) of generation are as follows:
If the feature templates include the compound characteristics template, the conditional probability prediction module is according to the compound characteristics
Template, the function of state s (y, x, i, j) of generation are as follows:
Wherein, x indicates the two-dimensional sequence being made of the character string and the dictionary sequence label;Y indicates the participle mark
Sign sequence;When j=0, x indicates the character string in the two-dimensional sequence;When j=1, x is indicated in the two-dimensional sequence
The dictionary sequence label;I take 1 into p arbitrary integer;xI ± d, j=0Indicate the word of i-th ± d position of the character string
Symbol, xI ± d, j=1Indicate the dictionary label of i-th ± d position of the dictionary sequence label, d takes 0 any just whole into p-i
Number;yiIndicate i-th of participle label of the participle sequence label y;n1Indicate i-th of participle mark of the participle sequence label y
Label, m indicate that the character of i-th ± d position in the character string, h indicate i-th ± d position in the dictionary sequence label
Dictionary label.
13. device as claimed in claim 11, which is characterized in that the character string includes p character, the dictionary label
Sequence includes p dictionary label, and the participle sequence label includes p participle label;
If the feature templates include the character feature template, the conditional probability prediction module is according to the character feature
Template, the transfer function t (y, x, i, j) of generation are as follows:
Wherein, x indicates the two-dimensional sequence being made of the character string and the dictionary sequence label;Y indicates the participle mark
Sign sequence;When j=0, x indicates the character string in the two-dimensional sequence;I take 1 into p arbitrary integer;xI ± d, j=0It indicates
The character of i-th ± d position of the character string, d take 0 any positive integer into p-i;yiIndicate the participle label sequence
Arrange i-th of participle label of y;yi-1Indicate (i-1)-th participle label of participle sequence label y;n1Indicate the participle label sequence
Arrange i-th of participle label of y, n2Indicate that (i-1)-th participle label of the participle sequence label y, m indicate the character string
In i-th ± d position character.
14. the device as described in claim 9 to 13 is any, which is characterized in that at least one described participle label includes: word
The first label of starting position, second label in middle position of word, word end position third label and individual character
4th label of word;
Second determining module, the corresponding at least one participle label of each character, obtains in determining the character string
When a variety of participle sequence labels, it is specifically used for:
Determine the corresponding at least one participle label of each character in the character string;
Arbitrarily select a kind of participle label as target participle label from the corresponding at least one participle label of each character, and
Using sequence composed by the corresponding target participle label of each character as a kind of participle sequence label.
15. the device as described in claim 9 to 13 is any, which is characterized in that first determining module, for the character
Each character of matched character string described in sequence and each character in addition to the matched character string distribute corresponding respectively
Dictionary label is specifically used for when obtaining dictionary sequence label:
Each character in the character string is distributed to dictionary label according to following rule, obtain the word being made of dictionary label
Allusion quotation sequence label:
For any one character in the character string, if the character is the character in the matched character string, for the word
Symbol the first dictionary label of distribution will be character distribution second if the character is the character in addition to the matched character string
Dictionary label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910094380.2A CN109829162B (en) | 2019-01-30 | 2019-01-30 | Text word segmentation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910094380.2A CN109829162B (en) | 2019-01-30 | 2019-01-30 | Text word segmentation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109829162A true CN109829162A (en) | 2019-05-31 |
CN109829162B CN109829162B (en) | 2022-04-08 |
Family
ID=66863299
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910094380.2A Active CN109829162B (en) | 2019-01-30 | 2019-01-30 | Text word segmentation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109829162B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110688853A (en) * | 2019-08-12 | 2020-01-14 | 平安科技(深圳)有限公司 | Sequence labeling method and device, computer equipment and storage medium |
CN110795938A (en) * | 2019-11-11 | 2020-02-14 | 北京小米智能科技有限公司 | Text sequence word segmentation method, device and storage medium |
CN111026282A (en) * | 2019-11-27 | 2020-04-17 | 上海明品医学数据科技有限公司 | Control method for judging whether to label medical data in input process |
CN111695355A (en) * | 2020-05-26 | 2020-09-22 | 平安银行股份有限公司 | Address text recognition method, device, medium and electronic equipment |
CN111831929A (en) * | 2019-09-24 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Method and device for acquiring POI information |
CN112101021A (en) * | 2020-09-03 | 2020-12-18 | 沈阳东软智能医疗科技研究院有限公司 | Method, device and equipment for realizing standard word mapping |
CN112464667A (en) * | 2020-11-18 | 2021-03-09 | 北京华彬立成科技有限公司 | Text entity identification method and device, electronic equipment and storage medium |
CN112861531A (en) * | 2021-03-22 | 2021-05-28 | 北京小米移动软件有限公司 | Word segmentation method, word segmentation device, storage medium and electronic equipment |
CN113609850A (en) * | 2021-07-02 | 2021-11-05 | 北京达佳互联信息技术有限公司 | Word segmentation processing method and device, electronic equipment and storage medium |
CN117493540A (en) * | 2023-12-28 | 2024-02-02 | 荣耀终端有限公司 | Text matching method, terminal device and computer readable storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101082909A (en) * | 2007-06-28 | 2007-12-05 | 腾讯科技(深圳)有限公司 | Method and system for dividing Chinese sentences for recognizing deriving word |
WO2010024052A1 (en) * | 2008-08-27 | 2010-03-04 | 日本電気株式会社 | Device for verifying speech recognition hypothesis, speech recognition device, and method and program used for same |
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
CN102262634A (en) * | 2010-05-24 | 2011-11-30 | 北京大学深圳研究生院 | Automatic questioning and answering method and system |
CN102929870A (en) * | 2011-08-05 | 2013-02-13 | 北京百度网讯科技有限公司 | Method for establishing word segmentation model, word segmentation method and devices using methods |
CN103020034A (en) * | 2011-09-26 | 2013-04-03 | 北京大学 | Chinese words segmentation method and device |
CN103678318A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
CN108038103A (en) * | 2017-12-18 | 2018-05-15 | 北京百分点信息科技有限公司 | A kind of method, apparatus segmented to text sequence and electronic equipment |
-
2019
- 2019-01-30 CN CN201910094380.2A patent/CN109829162B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101082909A (en) * | 2007-06-28 | 2007-12-05 | 腾讯科技(深圳)有限公司 | Method and system for dividing Chinese sentences for recognizing deriving word |
WO2010024052A1 (en) * | 2008-08-27 | 2010-03-04 | 日本電気株式会社 | Device for verifying speech recognition hypothesis, speech recognition device, and method and program used for same |
CN102262634A (en) * | 2010-05-24 | 2011-11-30 | 北京大学深圳研究生院 | Automatic questioning and answering method and system |
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
CN102929870A (en) * | 2011-08-05 | 2013-02-13 | 北京百度网讯科技有限公司 | Method for establishing word segmentation model, word segmentation method and devices using methods |
CN103020034A (en) * | 2011-09-26 | 2013-04-03 | 北京大学 | Chinese words segmentation method and device |
CN103678318A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
CN108038103A (en) * | 2017-12-18 | 2018-05-15 | 北京百分点信息科技有限公司 | A kind of method, apparatus segmented to text sequence and electronic equipment |
Non-Patent Citations (3)
Title |
---|
QI-YU JIANG; HONG-YI LI; JIA-FEN LIANG; QING-XIANG WANG等: ""Multi-combined Features Text Mining of TCM Medical Cases with CRF"", 《2016 8TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY IN MEDICINE AND EDUCATION (ITME)》 * |
YI-FENG PAN; XINWEN HOU; CHENG-LIN LIU: ""Text Localization in Natural Scene Images Based on Conditional Random Field"", 《2009 10TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION》 * |
周祺: ""基于统计与词典相结合的中文分词的研究与实现"", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 * |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110688853A (en) * | 2019-08-12 | 2020-01-14 | 平安科技(深圳)有限公司 | Sequence labeling method and device, computer equipment and storage medium |
WO2021027125A1 (en) * | 2019-08-12 | 2021-02-18 | 平安科技(深圳)有限公司 | Sequence labeling method and apparatus, computer device and storage medium |
CN111831929A (en) * | 2019-09-24 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Method and device for acquiring POI information |
CN111831929B (en) * | 2019-09-24 | 2024-01-02 | 北京嘀嘀无限科技发展有限公司 | Method and device for acquiring POI information |
CN110795938B (en) * | 2019-11-11 | 2023-11-10 | 北京小米智能科技有限公司 | Text sequence word segmentation method, device and storage medium |
CN110795938A (en) * | 2019-11-11 | 2020-02-14 | 北京小米智能科技有限公司 | Text sequence word segmentation method, device and storage medium |
CN111026282B (en) * | 2019-11-27 | 2023-05-23 | 上海明品医学数据科技有限公司 | Control method for judging whether medical data labeling is carried out in input process |
CN111026282A (en) * | 2019-11-27 | 2020-04-17 | 上海明品医学数据科技有限公司 | Control method for judging whether to label medical data in input process |
CN111695355A (en) * | 2020-05-26 | 2020-09-22 | 平安银行股份有限公司 | Address text recognition method, device, medium and electronic equipment |
CN111695355B (en) * | 2020-05-26 | 2024-05-14 | 平安银行股份有限公司 | Address text recognition method and device, medium and electronic equipment |
CN112101021A (en) * | 2020-09-03 | 2020-12-18 | 沈阳东软智能医疗科技研究院有限公司 | Method, device and equipment for realizing standard word mapping |
CN112464667A (en) * | 2020-11-18 | 2021-03-09 | 北京华彬立成科技有限公司 | Text entity identification method and device, electronic equipment and storage medium |
CN112464667B (en) * | 2020-11-18 | 2021-11-16 | 北京华彬立成科技有限公司 | Text entity identification method and device, electronic equipment and storage medium |
CN112861531A (en) * | 2021-03-22 | 2021-05-28 | 北京小米移动软件有限公司 | Word segmentation method, word segmentation device, storage medium and electronic equipment |
CN112861531B (en) * | 2021-03-22 | 2023-11-14 | 北京小米移动软件有限公司 | Word segmentation method, device, storage medium and electronic equipment |
CN113609850A (en) * | 2021-07-02 | 2021-11-05 | 北京达佳互联信息技术有限公司 | Word segmentation processing method and device, electronic equipment and storage medium |
CN113609850B (en) * | 2021-07-02 | 2024-05-17 | 北京达佳互联信息技术有限公司 | Word segmentation processing method and device, electronic equipment and storage medium |
CN117493540A (en) * | 2023-12-28 | 2024-02-02 | 荣耀终端有限公司 | Text matching method, terminal device and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109829162B (en) | 2022-04-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109829162A (en) | A kind of text segmenting method and device | |
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
CN106649288B (en) | Artificial intelligence based translation method and device | |
CN112256828B (en) | Medical entity relation extraction method, device, computer equipment and readable storage medium | |
Creutz et al. | Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0 | |
CN109948149B (en) | Text classification method and device | |
CN108509411A (en) | Semantic analysis and device | |
US20190204296A1 (en) | Nanopore sequencing base calling | |
US9483739B2 (en) | Transductive feature selection with maximum-relevancy and minimum-redundancy criteria | |
CN107957993B (en) | English sentence similarity calculation method and device | |
CN110163181B (en) | Sign language identification method and device | |
CN111310440B (en) | Text error correction method, device and system | |
CN107193807A (en) | Language conversion processing method, device and terminal based on artificial intelligence | |
US11347995B2 (en) | Neural architecture search with weight sharing | |
KR102134472B1 (en) | A method for searching optimal structure of convolution neural network using genetic algorithms | |
CN108108347B (en) | Dialogue mode analysis system and method | |
CN110222329A (en) | A kind of Chinese word cutting method and device based on deep learning | |
CN110457470A (en) | A kind of textual classification model learning method and device | |
CN114528835A (en) | Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination | |
CN111488460B (en) | Data processing method, device and computer readable storage medium | |
CN113239697B (en) | Entity recognition model training method and device, computer equipment and storage medium | |
CN110569355A (en) | Viewpoint target extraction and target emotion classification combined method and system based on word blocks | |
CN113033709A (en) | Link prediction method and device | |
CN110334204B (en) | Exercise similarity calculation recommendation method based on user records | |
Yeh et al. | MSRCall: A multi-scale deep neural network to basecall Oxford nanopore sequences |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |