CN108363686A - A kind of character string segmenting method, device, terminal device and storage medium - Google Patents
A kind of character string segmenting method, device, terminal device and storage medium Download PDFInfo
- Publication number
- CN108363686A CN108363686A CN201810030722.XA CN201810030722A CN108363686A CN 108363686 A CN108363686 A CN 108363686A CN 201810030722 A CN201810030722 A CN 201810030722A CN 108363686 A CN108363686 A CN 108363686A
- Authority
- CN
- China
- Prior art keywords
- character string
- word
- chinese character
- chinese
- temporary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of character string segmenting method, device, terminal device and storage medium, the method includes:Obtain pending Chinese character string;Single character segmentation is carried out to Chinese character string, obtains n single word ai, by aiWith adjacent single word ai+1It is combined, obtains temporary word aiai+1;If temporary word aiai+1Exist in preset everyday words dictionary, then by temporary word aiai+1With adjacent single word ai+2It is combined, obtains temporary word aiai+1ai+2, and use temporary word aiai+1ai+2Continuation is searched in the everyday words dictionary, until temporary word aiai+1ai+2...akUntil being not present in everyday words dictionary;By aiai+1ai+2...ak‑1It is identified as effective word, and from single word akStart, continues to combine lookup with adjacent single word.Technical scheme of the present invention is long it is not necessary that initial word is arranged, and avoids everyday words and is separated by mistake, improves the accuracy of character string participle, also, realization method is more simple and easy to do, execution efficiency higher, to effectively improve the versatility and participle efficiency of character string participle.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of character string segmenting method, device, terminal device and deposit
Storage media.
Background technology
Currently, traditional character string segmenting method generally use Forward Maximum Method algorithm, the algorithm is firstly the need of setting
One most major term is long, then to character string to be identified from left to right according to the most major term progress row scan, and will most major term it is long
Character string matched with the word in dictionary, if do not matched, just shorten length continually look for, until find or at
Until individual character.
But the size of most major term length can influence the accuracy and participle efficiency of character string participle, if most major term length is too short,
Long word will be separated by mistake, influence the accuracy of participle, if word length is long, matching times obviously increase, and influence execution efficiency.
In actual use, in order to reach satisfied word segmentation result, repeated multiple times to most major term progress row move often is also needed to
State adjusts, and causes to segment less efficient.
Invention content
A kind of character string segmenting method of offer of the embodiment of the present invention, device, terminal device and storage medium, it is existing to solve
The accuracy and segment less efficient problem that character string segments in technology.
In a first aspect, the embodiment of the present invention provides a kind of character string segmenting method, including:
Obtain pending Chinese character string;
Single character segmentation is carried out to the Chinese character string, obtains n single word ai, wherein i ∈ [1, n], n are described
The number for the chinese character for including in Chinese character string;
If i is less than n, by aiWith adjacent single word ai+1It is combined, obtains temporary word aiai+1;
If the temporary word aiai+1Exist in preset everyday words dictionary, then by the temporary word aiai+1With adjacent list
A word ai+2It is combined, obtains temporary word aiai+1ai+2, and use the temporary word aiai+1ai+2Continue in the everyday words word
It is searched in library, until temporary word aiai+1ai+2...akUntil being not present in the everyday words dictionary, wherein k ∈ [i+1,
n];
If the temporary word aiai+1ai+2...akIt is not present in the everyday words dictionary, then by aiai+1ai+2...ak-1Know
Not Wei effective word, and from single word akStart, if k is equal to n, by akIt is identified as effective word, if k is less than n, by akAs ai,
If continuing to execute the i is less than n, by aiWith adjacent single word ai+1It is combined, obtains temporary word aiai+1The step of;
If the temporary word aiai+1ai+2...akExist in the everyday words dictionary and k=n, then by aiai+ 1ai+2...akIt is identified as effective word;
The effective word that will identify that is determined as the word segmentation result of the Chinese character string.
Second aspect, the embodiment of the present invention provide a kind of character string participle device, including:
Acquisition module, for obtaining pending Chinese character string;
Cutting module obtains n single word a for carrying out single character segmentation to the Chinese character stringi, wherein i ∈
[1, n], n are the number for the chinese character for including in the Chinese character string;
Composite module, if being less than n for i, by aiWith adjacent single word ai+1It is combined, obtains temporary word aiai+1;
Circulation searching module, if being used for the temporary word aiai+1Exist in preset everyday words dictionary, then faces described
When word aiai+1With adjacent single word ai+2It is combined, obtains temporary word aiai+1ai+2, and use the temporary word aiai+1ai+2After
Continue and searched in the everyday words dictionary, until temporary word aiai+1ai+2...akThere is no be in the everyday words dictionary
Only, wherein k ∈ [i+1, n];
First identification module, if being used for the temporary word aiai+1ai+2...akIt is not present in the everyday words dictionary, then
By aiai+1ai+2...ak-1It is identified as effective word, and from single word akStart, if k is equal to n, by akIt is identified as effective word, if k
Less than n, then by akAs aiIf continuing to execute the i is less than n, by aiWith adjacent single word ai+1It is combined, obtains interim
Word aiai+1The step of;
Second identification module, if being used for the temporary word aiai+1ai+2...akExist in the everyday words dictionary and k
=n, then by aiai+1ai+2...akIt is identified as effective word;
As a result determining module, effective word for will identify that are determined as the word segmentation result of the Chinese character string.
The third aspect, the embodiment of the present invention provide a kind of terminal device, including memory, processor and are stored in described
In memory and the computer program that can run on the processor, the processor are realized when executing the computer program
The step of character string segmenting method.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer-readable storage medium
The step of matter is stored with computer program, and the computer program realizes the character string segmenting method when being executed by processor.
Character string segmenting method provided in an embodiment of the present invention has the following advantages that compared with prior art:The present invention is implemented
In character string segmenting method, device, terminal device and storage medium that example is provided, pass through the pending Chinese character to getting
Character string carries out single character segmentation, then since first single word, after it is combined into temporary word with adjacent single word
The temporary word is searched in conventional dictionary, if the temporary word exists in everyday words dictionary, by the temporary word and next phase
Adjacent single word is combined into new temporary word, and continues to search the new temporary word in conventional dictionary, if remaining able to inquire
It arrives, then continues to combine, and continue to inquire in everyday words dictionary, until inquiry is combined into less than newest in everyday words dictionary
Until temporary word, it is effective word to take a temporary word at this time, then will reject remaining single word and next phase after effective word
Adjacent single word continues to combine, and continues to inquire in everyday words dictionary, until the single word of Chinese character string is all handled
Until complete.Character string participle is carried out by the way of single character segmentation and combination, it is long it is not necessary that initial word is arranged, avoid everyday words
It is separated by mistake, to improve the accuracy of character string participle, meanwhile, relative to traditional Forward Maximum Method algorithm, this hair
The realization method of the technical solution of bright embodiment is more simple and easy to do, execution efficiency higher, to effectively improve character string participle
Versatility and participle efficiency.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the present invention
Example, for those of ordinary skill in the art, without having to pay creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is the implementation flow chart for the character string segmenting method that the embodiment of the present invention 1 provides;
Fig. 2 is the implementation flow chart of step S1 in the character string segmenting method that the embodiment of the present invention 1 provides;
Fig. 3 is the declaration form where two address character strings of analysis in the character string segmenting method that the embodiment of the present invention 1 provides
The implementation flow chart of correlation degree between the corresponding declaration form of information;
Fig. 4 is to establish everyday words dictionary and to everyday words word in the character string segmenting method that the embodiment of the present invention 1 provides
The implementation flow chart that library is updated;
Fig. 5 is the schematic diagram for the character string participle device that the embodiment of the present invention 2 provides;
Fig. 6 is the schematic diagram for the terminal device that the embodiment of the present invention 4 provides.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, the every other implementation that those of ordinary skill in the art are obtained without creative efforts
Example, shall fall within the protection scope of the present invention.
Embodiment 1
Referring to Fig. 1, Fig. 1 shows the implementation process of character string segmenting method provided in an embodiment of the present invention.The character
String segmenting method can be applied to the matching analysis in insurance industry to policy information.Details are as follows:
S1:Obtain pending Chinese character string.
In embodiments of the present invention, the acquisition modes of pending Chinese character string are not limited, can receive
Chinese character string input by user can be the Chinese character string identified by image scanning or figure, can also be
The Chinese character string extracted from the text message of preservation.
Further, Chinese character string is obtained from the policy information of declaration form database.Wherein, policy information is purchased for user
The attribute information for the declaration form that product of insuring generates, for example, home address information, work unit's information etc..
It should be noted that the character string segmenting method in the embodiment of the present invention is segmented for Chinese character string.If
Need the character string for carrying out character string participle in addition to containing chinese character, it includes English character or Arabic numerals etc. also to contain
Other characters including character then need first to identify non-chinese character therein, then non-chinese character is separated several
Chinese character string carries out word segmentation processing.
S2:Single character segmentation is carried out to pending Chinese character string, obtains n single word ai, wherein i ∈ [1, n], n
Number for the chinese character for including in the Chinese character string.
In embodiments of the present invention, cutting is carried out according to single word to pending Chinese character string, obtains n Chinese character
ai, can be by n Chinese character aiIt is preserved with array form according to the sequence of Chinese character string, each Chinese character is one of the array
Element, that is to say, that each of carry out after single character segmentation putting in order according to the Chinese Character in Chinese character string for single word
Symbol sequence is from left to right arranged in order.
For example, if pending Chinese character string is " In Luohu District of Shenzhen Municipal ", carries out single character segmentation and obtain six lists
A word is respectively:a1=deep, a2=ditch between fields, a3=city, a4=sieve, a5=lake, a6=area.
S3:If i is less than n, by aiWith adjacent single word ai+1It is combined, obtains temporary word aiai+1。
Specifically, from first character a1Start, by a1With a2It is combined, obtains temporary word a1a2。
It should be noted that as i=n, a at this timeiIt has been the last character in pending Chinese character string, has not had
There is adjacent single word, therefore combination can not be continued and obtain new temporary word.
S4:If temporary word aiai+1Exist in preset everyday words dictionary, then by temporary word aiai+1With adjacent single word
ai+2It is combined, obtains temporary word aiai+1ai+2, and use temporary word aiai+1ai+2Continuation is searched in everyday words dictionary,
Until temporary word aiai+1ai+2...akUntil being not present in everyday words dictionary, wherein k ∈ [i+1, n].
In embodiments of the present invention, preset everyday words dictionary contains common Chinese-character words, and the everyday words word
Library can be periodically updated.
If the temporary word that step S3 is obtained exists in the everyday words dictionary, continue temporary word aiai+1With it is adjacent
Single word ai+2It is combined, obtains temporary word aiai+1ai+2If temporary word aiai+1ai+2Still exist in everyday words dictionary, then
Continue temporary word aiai+1ai+2With ai+3It is combined, obtains temporary word aiai+1ai+2ai+3, and it is continuing with temporary word aiai+ 1ai+2ai+3It is searched in everyday words dictionary, until temporary word aiai+1ai+2...akIn the absence of in everyday words dictionary, terminate to look into
It looks for.
It should be noted that the value range of k is more than or equal to i+1 and to be less than or equal to n, that is to say, that as k=n,
The temporary word a searched at this time in everyday words dictionaryiai+1ai+2...anHave arrived at the last one of pending Chinese character string
Therefore single word is completed to temporary word aiai+1ai+2...anLookup after, no matter the temporary word is in everyday words dictionary
No presence does not all continue to be searched.
Continuation is illustrated by taking the pending Chinese character string " In Luohu District of Shenzhen Municipal " that step S2 is referred to as an example, for step
Six single word " a that rapid S2 is obtained1=deep, a2=ditch between fields, a3=city, a4=sieve, a5=lake, a6=area ", first by " depth " and " ditch between fields "
It is combined into temporary word " Shenzhen ", and searches the temporary word " Shenzhen " in everyday words dictionary, since " Shenzhen " is in everyday words dictionary
Middle presence then continues " Shenzhen " and adjacent " city " being combined into temporary word " Shenzhen ", and continuation is searched in everyday words dictionary
The temporary word " Shenzhen ", since " Shenzhen " still exists in everyday words dictionary, therefore continuing will be " Shenzhen " and adjacent
" sieve " is combined into temporary word " Shenzhen sieve ", since " Shenzhen sieve " is not present in everyday words dictionary, then stops inquiring, currently
Temporary word be " Shenzhen sieve ".
S5:If temporary word aiai+1ai+2...akIt is not present in everyday words dictionary, then by aiai+1ai+2...ak-1It is identified as
Effective word, and from single word akStart, if k is equal to n, by akIt is identified as effective word, if k is less than n, by akAs ai, and return
Step S3 is returned to continue to execute.
Specifically, if obtaining temporary word a according to step S4iai+1ai+2...akIt is not present, then will in everyday words dictionary
aiai+1ai+2...ak-1It is identified as effective word, and from single word akStart, if k is less than n, by akAs ai, return to step S3 weights
It is new to execute.
As k=i+1, if aiai+1It is not present in everyday words dictionary, then by single word aiIt is identified as effective word, and not
It needs to judge the single word aiIt whether there is in everyday words dictionary.
As k=n, if aiai+1ai+2...anIt is not present in everyday words dictionary, then a at this timekIt has been the pending Chinese
The last character in word character string can not continue combination and obtain new temporary word, therefore no longer without adjacent single word
Return to step S3, directly by akIt is identified as effective word, and flow jumps to step S7 and continues to execute.
Continuation is illustrated by taking the pending Chinese character string " In Luohu District of Shenzhen Municipal " that step S2 is referred to as an example, when common
When temporary word " Shenzhen sieve " being not present in word dictionary, then " Shenzhen " is identified as effective word, is then opened from single word " sieve "
Begin, return to step S3 continues to be combined into new temporary word " Luohu " with adjacent " lake ", and continues to carry out in everyday words dictionary
It searches.
S6:If temporary word aiai+1ai+2...akExist in everyday words dictionary and k=n, then by aiai+1ai+2...akKnow
It Wei not effective word.
Specifically, if obtaining temporary word a according to step S4iai+1ai+2...akExist in everyday words dictionary and k=n,
Then illustrate temporary word aiai+1ai+2...akThe last one single word of pending Chinese character string is had arrived at, and this is interim
Word aiai+1ai+2...akExist in everyday words dictionary, then directly by aiai+1ai+2...akIt is identified as effective word, and is completed
Effective word identification to pending Chinese character string.
Continuation is illustrated by taking the pending Chinese character string " In Luohu District of Shenzhen Municipal " that step S2 is referred to as an example, when common
When temporary word " Shenzhen sieve " being not present in word dictionary, then " Shenzhen " is identified as effective word, is then opened from single word " sieve "
Begin, continues to be combined into new temporary word " Luohu " with adjacent " lake " according to step S3, and continue to search in everyday words dictionary
The temporary word " Luohu ", if being combined into and facing with adjacent " area " by " Luohu " there are temporary word " Luohu " in everyday words dictionary
When word " Luohu District ", and continue to search the temporary word " Luohu District " in everyday words dictionary, if in everyday words dictionary existing interim
Word " Luohu District ", and k=n at this time, then do not continue to circulation searching, and " Luohu District " is directly identified as effective word.
S7:The effective word that will identify that is determined as the word segmentation result of pending Chinese character string.
Specifically, effective word step S5 and step S6 identified is as the word segmentation result of the Chinese character string.
Continuation is illustrated by taking the pending Chinese character string " In Luohu District of Shenzhen Municipal " that step S2 is referred to as an example, according to step
Effective word that rapid S5 and step S6 is identified is:" Shenzhen " and " Luohu District ", therefore, Chinese character string " Shenzhen's Luohu
The word segmentation result in area " is:" Shenzhen " and " Luohu District ".
In the corresponding embodiments of Fig. 1, single character segmentation is carried out to the pending Chinese character string got first, so
Afterwards since first single word, it is combined into adjacent single word after temporary word and searches this in conventional dictionary temporarily
The temporary word and next adjacent single word are combined into new face by word if the temporary word exists in everyday words dictionary
When word, and continue the new temporary word is searched in conventional dictionary, if remaining able to inquire, continue to combine, and continue exist
It is inquired in everyday words dictionary, until inquiry is less than the newest temporary word being combined into everyday words dictionary, takes one at this time
A temporary word is effective word, then remaining single word continues to combine with next adjacent single word after rejecting effective word,
And continue to inquire in everyday words dictionary, until the single word of Chinese character string has all been handled.Pass through single character segmentation
Character string participle is carried out with the mode of combination, it is long it is not necessary that initial word is arranged, it avoids everyday words and is separated by mistake, improve character string
The accuracy of participle, meanwhile, relative to traditional Forward Maximum Method algorithm, the realization side of the technical solution of the embodiment of the present invention
Formula is more simple and easy to do, execution efficiency higher, to effectively improve the versatility and participle efficiency of character string participle.
Next, on the basis of the corresponding embodiments of Fig. 1, come to step S1 below by a specific embodiment
In the concrete methods of realizing of the pending Chinese character string of acquisition that refers to be described in detail.
Referring to Fig. 2, Fig. 2 shows the specific implementation flow of the step S1 of the embodiment of the present invention, details are as follows:
S11:Obtain the pending character string for including chinese character and non-chinese character.
In embodiments of the present invention, the acquisition modes for treating processing character string are not limited, and can be that reception user is defeated
The information entered can be the character string identified by image scanning or figure, can also be the text message from preservation
The character string of middle extraction.
Further, pending character string is directly acquired from the policy information of declaration form database.Wherein, policy information is
User buys the attribute information for the declaration form that insurance products generate, for example, home address information, work unit's information etc..
Specifically, pending character string include chinese character and non-chinese character, wherein non-chinese character refer to include English
Other characters including the characters such as Chinese character or Arabic numerals, such as punctuation mark, English word, English alphabet, I
Additional character as uncle's number and "@" etc..
For example, if pending character string is work unit's information " address in policy information:The East Roads Sun Gang, Luohu District 3012
Number Times Squares Zhong Min Building B " then contains punctuation mark, English character and Arabic data in work unit's information and exists
Other interior characters.
S12:Identify m non-chinese character string in pending character string, wherein non-chinese character string only includes continuous phase
Adjacent non-chinese character, m is zero or positive integer.
Specifically, chinese character and the non-Chinese character in pending character string are identified according to the Unicode of character (Unicode)
Character.
Unicode is generated to solve the limitation of traditional character coding method, it is each of each language
Character setting is unified and unique binary coding, with meet across language, it is cross-platform carry out text conversion, processing is wanted
It asks.Wherein, 0x4E00 to 0x9FA5 indicates the binary coding range of simplified Hanzi.
The first character for treating processing character string starts to be traversed one by one from left to right, if the Unicode of the character exists
In the range of 0x4E00 to 0x9FA5, then it is chinese character by the character recognition, is otherwise identified as non-chinese character, and according to knowledge
The non-chinese character not gone out obtains m non-chinese character string.If m is zero, illustrate that the pending character string is Chinese character string,
Not comprising non-chinese character.
S13:It is m+1 substring by pending string segmentation using m non-chinese character string as cut-off, wherein
Each substring is determined as to pending Chinese character string, substring only includes the chinese character of continuous adjacent.
Specifically, pending string segmentation is m+1 son by the m non-chinese character string identified according to step S12
Character string, each substring are Chinese character string.
Continue with the pending character string " address in step S11:The Times Squares the Zhong Min B of the East Roads Sun Gang, Luohu District 3012
It is illustrated for seat ", the non-chinese character string identified according to step S12 is:“:", " 3012 " and " B ", it is non-with this three
Chinese character string is divided into four substrings as cut-off, the pending character string, respectively:" address ", " Luohu District
The East Roads Sun Gang ", " number Times Squares Zhong Min " and " seat ".The Chinese character string of i.e. pending word segmentation processing is " address ", " Luohu District
The East Roads Sun Gang ", " number Times Squares Zhong Min " and " seat ".
In the corresponding embodiments of Fig. 2, by being carried out to the pending character string comprising chinese character and non-chinese character
Cutting is handled, and the non-chinese character to identify treats processing character string and carry out first time participle, obtain m as cut-off
Non-chinese character string and m+1 Chinese character string can accurately filter out the Chinese character string in pending character string, Jin Erke
To execute step S2 to step S7 to m+1 Chinese character string, by the word segmentation result of obtained Chinese character string and non-Chinese Character
Word segmentation result of the symbol string as pending character string, to further effectively improve the participle accuracy of pending character string and divide
Word efficiency.
On the basis of the corresponding embodiments of Fig. 2, if pending character string is the declaration form being stored in declaration form database
The address character string of information, then two address character strings are executed respectively complete mentioned by step S7 will identify that it is effective
Word is determined as after the word segmentation result of the Chinese character string, further includes the policy information pair analyzed where two address character strings
The processing procedure of correlation degree between the declaration form answered.
Referring to Fig. 3, Fig. 3 shows the policy information where two address character strings of analysis provided in an embodiment of the present invention
The specific implementation flow of correlation degree between corresponding declaration form, details are as follows:
S81:For each address character string, according to the word segmentation result of each Chinese character string in the address character string, really
The feature and its weighted value of the fixed address character string.
Specifically, for each address character string m+1 in the address character string is obtained after executing completion step S7
The word segmentation result of a Chinese character string and m non-chinese character string, using the effective word of each of the word segmentation result as the ground
The feature of location character string, using occurrence number of each feature in the address character as the weighted value of this feature.
S82:According to the feature and its weighted value of two address character strings, compare two address character strings meaning whether
It is identical.
It specifically,, can will be each according to the obtained features of step S81 and its weighted value for each address character string
Word frequency of the weighted value of feature as this feature, the feature vector of the address character string is built according to the word frequency of each feature, so
It uses cosine similarity algorithm to calculate the similarity of two address character strings afterwards, two address character strings is judged according to the similarity
Meaning it is whether identical.
As another embodiment of the present invention, for each address character string, in the feature obtained according to step S81 and its
After weighted value, the address character string corresponding two can also be obtained using simhash algorithms according to each feature and its weighted value
Then system number string calculates the Hamming distances between corresponding two binary strings of two address character strings, according to hamming away from
From judging whether the meaning of two address character strings is identical.
If similarity is more than preset similarity threshold or Hamming distances are less than preset distance threshold, two are confirmed
The meaning of a address character string is identical, otherwise it is assumed that the meaning of two address character strings differs.
It should be noted that preset similarity threshold and preset distance threshold can according to the needs of practical application into
Row setting, is not limited herein.
It is understood that similarity is bigger or Hamming distances are smaller, then the meaning of two address character strings is closer,
Conversely, similarity is smaller or Hamming distances are bigger, then the meaning gap of two address character strings is bigger.
It should be noted that in embodiments of the present invention, when whether the meaning for comparing two address character strings is identical, only
Chinese character string is compared, it is whether not identical to the meaning of non-chinese character string to judge.This is because analyzing
When correlation degree between two declaration forms, have been able to effectively embody the similar of address character string by the comparison of Chinese character string
Degree, when the meaning of the Chinese character string of two address character strings is identical, even if the meaning of non-chinese character string differs, example
If house number is different, it is also assumed that there are relevances between two declaration forms.
In other embodiments, between needing the accurate matching result according to two address character strings to analyze two declaration forms
Correlation degree when, can also according to after determining that the meaning of two address character strings is identical to the comparison of Chinese character string, into
Whether the content that one step compares the non-chinese character string in two address character strings in such a way that character directly compares is identical, if
It is identical then to confirm that the meaning of two address character strings is identical, confirm that the meaning of two address character strings differs if differing.
S83:If the meaning of two address character strings is identical, the policy information where two address character strings is corresponded to
Declaration form be associated.
Specifically, if determining that the meaning of two address character strings is identical according to step S82, confirm the two address characters
There are relevances between the corresponding declaration form of policy information where going here and there, and the meaning of two address character strings is closer, then recognizes
Correlation degree between two declaration forms is higher, and the meaning gap of two address character strings is bigger, then it is assumed that two declaration forms
Between correlation degree it is lower.
It is latent between the relevant staff of insurance company can be helped accurately to excavate declaration form by being associated to declaration form
In relationship, be conducive to relevant staff and declaration form is analyzed, finds various possible insurance fraud risks.
In the corresponding embodiments of Fig. 3, by policy information address character string carry out word segmentation processing after, according to point
Word result determines the feature and its weighted value of address character string, and according to the feature and its weighted value of two address character strings, than
It is whether identical compared with the meaning of two address character strings, and when comparison result is that meaning is identical, it will be where two address character strings
The corresponding declaration form of policy information be associated, so as to help the relevant staff of insurance company accurately excavate declaration form it
Between potential relationship, be conducive to relevant staff and declaration form analyzed, find various possible insurance fraud risks.
On the basis of above example, before the pending Chinese character string of the acquisition that step S1 is referred to, may be used also
To execute the processing procedure for establishing everyday words dictionary.
Referring to Fig. 4, Fig. 4 shows the specific implementation flow provided in an embodiment of the present invention for establishing everyday words dictionary, in detail
It states as follows:
S01:Chinese everyday expressions were collected from internet.
Specifically, the mode for Chinese everyday expressions being collected from internet includes directly from the Internet download Chinese everyday words
Language, such as common noun, hot word etc. further include obtaining the higher various words of search rate from the search engine on internet,
And the Chinese website that visit capacity is larger from internet obtains the Chinese terms etc. frequently occurred, can also include from internet
On other Chinese wordbanks in obtain Chinese everyday expressions, but it is not limited to this, can also include that other various are obtained from internet
Take the channel of common Chinese terms.
S02:Obtain the insurance professional term of insurance industry.
Specifically, for insurance industry, insurance professional term is targetedly obtained, including directly protected from the Internet download
The technical term and noun that dangerous industry is related to, or insure correlation word etc., such as Chinese safety from insurance related web site search
The insurance professional term of insurance company has " one account of Pingan Insurance is logical to be logged in ", " Pingan Insurance's phone vehicle insurance ", " Pingan Insurance quotient
City " etc., but it is not limited to this, can also include the insurance professional term that insurance industry is obtained from other various channels.
Can be arranged side by side it should be noted that there is no inevitable priority to execute sequence between step S01 and step S02
The relationship of execution, is not limited herein.
S03:Using Chinese everyday expressions and insurance professional term, everyday words dictionary is established.
Specifically, the Chinese everyday expressions that step S01 is collected and the insurance professional term that step S02 is obtained are integrated into
In everyday words dictionary.
Further, the everyday words dictionary after integration is arranged, it includes duplicate removal, denoising, screening, sieve to arrange flow
Select sensitive word, repeatedly denoising, regular etc..
In the corresponding embodiments of Fig. 4, everyday words dictionary combines Chinese everyday expressions and insurance industry in network
Insurance professional term, not only increased the word amount term of reference of everyday words dictionary, but also by insurance industry profession name
The addition of word enhances the accuracy and validity that word match is carried out in the character string participle for being related to insurance field.
With continued reference to Fig. 4, as shown in figure 4, using Chinese everyday expressions and the professional name of insurance mentioned by step S03
Word is established everyday words dictionary and is established after everyday words dictionary, can also be updated to everyday words dictionary.Details are as follows:
S04:It is spaced at every predetermined time and everyday words dictionary is updated.
Specifically, it periodically from the high frequency Chinese hot word of the interconnection recent insurance industry of online collection, and is recorded in default
Temporary wordlist in, be spaced at every predetermined time, each word in inquiring the temporary wordlist in everyday words dictionary whether
In the presence of if the word in temporary wordlist is not present in everyday words dictionary, which being added in everyday words dictionary, and to facing
When vocabulary in each word inquiry after the completion of, the content of the temporary wordlist is emptied, to discharge the memory space of temporary wordlist, is carried
The cyclic utilization rate of high temporary wordlist.
It should be noted that scheduled time interval can be one day or one week, but it is not limited to this, specifically can be with
It is configured according to the needs of practical application, is not limited herein.
It is still to be updated to everyday words dictionary by each scheduled time interval in the corresponding embodiments of Fig. 4,
The high frequency Chinese hot word of insurance industry is supplemented in everyday words dictionary, it can be ensured that the hot topic constantly generated in insurance industry
Keyword keeps the sensibility of height, by constantly updating and improve everyday words dictionary, everyday words dictionary is made to go from strength to strength, to
Constantly enhancing carries out the accuracy and validity of word match in the character string participle for being related to insurance field.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
Embodiment 2
Corresponding to the character string segmenting method of foregoing embodiments, Fig. 5 shows character string provided in an embodiment of the present invention point
The structure diagram of word device illustrates only and the relevant part of the embodiment of the present invention for convenience of description.
Referring to Fig. 5, character string participle device includes:Acquisition module 51, cutting module 52, composite module 53, cycle
Searching module 54, the first identification module 55, the second identification module 56 and result determining module 57, each function module are described in detail such as
Under:
Acquisition module 51, for obtaining pending Chinese character string;
Cutting module 52 obtains n single word a for carrying out single character segmentation to pending Chinese character stringi,
In, i ∈ [1, n], n are the number for the chinese character for including in the Chinese character string;
Composite module 53, if being less than n for i, by aiWith adjacent single word ai+1It is combined, obtains temporary word aiai+1;
Circulation searching module 54, if being used for temporary word aiai+1Exist in preset everyday words dictionary, then by the temporary word
aiai+1With adjacent single word ai+2It is combined, obtains temporary word aiai+1ai+2, and use temporary word aiai+1ai+2Continue common
It is searched in word dictionary, until temporary word aiai+1ai+2...akUntil being not present in everyday words dictionary, wherein k ∈ [i+1,
n];
First identification module 55, if being used for temporary word aiai+1ai+2...akIt is not present in everyday words dictionary, then by aiai+ 1ai+2...ak-1It is identified as effective word, and from single word akStart, if k is equal to n, by akIt is identified as effective word, if k is less than n,
Then by akAs ai, return to composite module 53 and continue to execute;
Second identification module 56, if being used for temporary word aiai+1ai+2...akExist in everyday words dictionary and k=n, then
By aiai+1ai+2...akIt is identified as effective word;
As a result determining module 57, effective word for will identify that are determined as the word segmentation result of Chinese character string.
Further, acquisition module 51 includes:
Character string acquiring unit 511, for obtaining the pending character string for including chinese character and non-chinese character;
Non- Chinese character acquiring unit 512, for identification m non-chinese character string in pending character string, wherein non-Chinese character
Character string only includes the non-chinese character of continuous adjacent, and m is zero or positive integer;
Cutting unit 513, for being m+1 son by pending string segmentation using m non-chinese character string as cut-off
Each substring is determined as pending Chinese character string by character string, wherein substring only includes the Chinese of continuous adjacent
Word character.
Further, pending character string is the address character string in policy information, and character string participle device further includes:
Characteristic determination module 581, for being directed to each address character string, according to each chinese character in the address character string
The word segmentation result of string determines the feature and its weighted value of the address character string;
Comparison module 582 compares two address character strings for the feature and its weighted value according to two address character strings
Meaning it is whether identical;
Relating module 583, if the meaning for two address character strings is identical, by the guarantor where two address character strings
The corresponding declaration form of single information is associated.
Further, character string participle device further includes:
First collection module 501, for collecting Chinese everyday expressions from internet;
Second collection module 502, the insurance professional term for obtaining insurance industry;
Dictionary establishes module 503, for using Chinese everyday expressions and insurance professional term, establishing everyday words dictionary.
Further, character string participle device further includes:
Word library updating module 504 is updated everyday words dictionary for being spaced at every predetermined time.
Each module realizes the process of respective function in a kind of character string participle device provided in this embodiment, specifically refers to
The description of previous embodiment 1, details are not described herein again.
Embodiment 3
The present embodiment provides a computer readable storage medium, computer journey is stored on the computer readable storage medium
Sequence realizes character string segmenting method in embodiment 1, to avoid repeating, here no longer when the computer program is executed by processor
It repeats.Alternatively, realizing that character string segments a module/unit in device in embodiment 2 when the computer program is executed by processor
Function, to avoid repeating, which is not described herein again.
Embodiment 4
Fig. 6 is the schematic diagram for the terminal device that one embodiment of the invention provides.As shown in fig. 6, the terminal of the embodiment is set
Standby 60 include:Processor 61, memory 62 and it is stored in the computer journey that can be run in memory 62 and on processor 61
Sequence 63, such as character string segment program.Processor 61 realizes above-mentioned each character string segmenting method when executing computer program 63
Step in embodiment, such as step S1 shown in FIG. 1 to step S7.Alternatively, reality when processor 61 executes computer program 63
The function of each module/unit in existing above-mentioned each character string participle device embodiment, such as module 51 shown in Fig. 5 is to module 57
Function.
Illustratively, computer program 63 can be divided into one or more module/units, one or more mould
Block/unit is stored in memory 62, and is executed by processor 61, to complete the present invention.One or more module/units can
To be the series of computation machine program instruction section that can complete specific function, the instruction segment is for describing computer program 63 at end
Implementation procedure in end equipment 60.For example, computer program 63 can be divided into acquisition module, cutting module, composite module,
Circulation searching module, the first identification module, the second identification module and result determining module, detailed description are as follows for each function module:
Acquisition module, for obtaining pending Chinese character string;
Cutting module obtains n single word a for carrying out single character segmentation to pending Chinese character stringi, wherein
I ∈ [1, n], n are the number for the chinese character for including in the Chinese character string;
Composite module, if being less than n for i, by aiWith adjacent single word ai+1It is combined, obtains temporary word aiai+1;
Circulation searching module, if being used for temporary word aiai+1Exist in preset everyday words dictionary, then by the temporary word
aiai+1With adjacent single word ai+2It is combined, obtains temporary word aiai+1ai+2, and use temporary word aiai+1ai+2Continue common
It is searched in word dictionary, until temporary word aiai+1ai+2...akUntil being not present in everyday words dictionary, wherein k ∈ [i+1,
n];
First identification module, if being used for temporary word aiai+1ai+2...akIt is not present in everyday words dictionary, then by aiai+ 1ai+2...ak-1It is identified as effective word, and from single word akStart, if k is equal to n, by akIt is identified as effective word, if k is less than n,
Then by akAs ai, return to composite module and continue to execute;
Second identification module, if being used for temporary word aiai+1ai+2...akExist in everyday words dictionary and k=n, then will
aiai+1ai+2...akIt is identified as effective word;
As a result determining module, effective word for will identify that are determined as the word segmentation result of Chinese character string.
Further, acquisition module includes:
Character string acquiring unit, for obtaining the pending character string for including chinese character and non-chinese character;
Non- Chinese character acquiring unit, for identification m non-chinese character string in pending character string, wherein non-Chinese Character
Symbol string only includes the non-chinese character of continuous adjacent, and m is zero or positive integer;
Cutting unit, for being m+1 sub- words by pending string segmentation using m non-chinese character string as cut-off
Each substring, is determined as pending Chinese character string, wherein substring only includes the Chinese character of continuous adjacent by symbol string
Character.
Further, pending character string is the address character string in policy information, and computer program 63 can also be divided
It is cut into:
Characteristic determination module, for being directed to each address character string, according to each Chinese character string in the address character string
Word segmentation result, determine the feature and its weighted value of the address character string;
Comparison module, for according to the feature and its weighted value of two address character strings, comparing two address character strings
Whether meaning is identical;
Relating module, if the meaning for two address character strings is identical, by the declaration form where two address character strings
The corresponding declaration form of information is associated.
Further, computer program 63 can also be divided into:
First collection module, for collecting Chinese everyday expressions from internet;
Second collection module, the insurance professional term for obtaining insurance industry;
Dictionary establishes module, for using Chinese everyday expressions and insurance professional term, establishing everyday words dictionary.
Further, computer program 63 can also be divided into:
Word library updating module is updated everyday words dictionary for being spaced at every predetermined time.
Terminal device 60 can be the computing devices such as desktop PC, notebook, palm PC and cloud server.It should
Terminal device may include, but be not limited only to, processor 61, memory 62.It will be understood by those skilled in the art that Fig. 6 is only
The example of terminal device 60 does not constitute the restriction to terminal device 60, may include components more more or fewer than diagram, or
Person combines certain components or different components, such as terminal device 60 can also be set including input-output equipment, network insertion
Standby, bus etc..
Processor 61 can be central processing unit (Central Processing Unit, CPU), can also be other
General processor, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor
Deng.
Memory 62 can be the internal storage unit of terminal device 60, such as the hard disk or memory of terminal device 60.It deposits
Reservoir 62 can also be the plug-in type hard disk being equipped on the External memory equipment of terminal device 60, such as terminal device 60, intelligence
Storage card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card)
Deng.Further, memory 62 can also both include terminal device 60 internal storage unit and also including External memory equipment.It deposits
Reservoir 62 is used to store other programs and the data needed for computer program and terminal device.Memory 62 can be also used for temporarily
When store the data that has exported or will export.
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each work(
Can unit, module division progress for example, in practical application, can be as needed and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of device are divided into different functional units or module, to complete above description
All or part of function.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If being realized in the form of SFU software functional unit in above-mentioned integrated module/unit and being sold as independent product
Or it in use, can be stored in a computer read/write memory medium.Based on this understanding, the present invention realizes above-mentioned reality
All or part of flow in a method is applied, relevant hardware can also be instructed to complete by computer program, computer
Program can be stored in a computer readable storage medium, and the computer program is when being executed by processor, it can be achieved that above-mentioned each
The step of a embodiment of the method.Wherein, computer program includes computer program code, and computer program code can be source generation
Code form, object identification code form, executable file or certain intermediate forms etc..Computer-readable medium may include:It can take
Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disc, CD with computer program code, computer storage
Device, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory),
Electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the content that computer-readable medium includes can
To carry out increase and decrease appropriate according to legislation in jurisdiction and the requirement of patent practice, such as in certain jurisdictions, root
According to legislation and patent practice, computer-readable medium is including being not electric carrier signal and telecommunication signal.
The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to the foregoing embodiments
Invention is explained in detail, it will be understood by those of ordinary skill in the art that:It still can be to aforementioned each implementation
Technical solution recorded in example is modified or equivalent replacement of some of the technical features;And these modification or
It replaces, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution should all include
Within protection scope of the present invention.
Claims (10)
1. a kind of character string segmenting method, which is characterized in that the character string segmenting method includes:
Obtain pending Chinese character string;
Single character segmentation is carried out to the Chinese character string, obtains n single word ai, wherein i ∈ [1, n], n are the Chinese Character
The number for the chinese character for including in symbol string;
If i is less than n, by aiWith adjacent single word ai+1It is combined, obtains temporary word aiai+1;
If the temporary word aiai+1Exist in preset everyday words dictionary, then by the temporary word aiai+1With adjacent single word
ai+2It is combined, obtains temporary word aiai+1ai+2, and use the temporary word aiai+1ai+2Continue in the everyday words dictionary
It is searched, until temporary word aiai+1ai+2...akUntil being not present in the everyday words dictionary, wherein k ∈ [i+1, n];
If the temporary word aiai+1ai+2...akIt is not present in the everyday words dictionary, then by aiai+1ai+2...ak-1It is identified as
Effective word, and from single word akStart, if k is equal to n, by akIt is identified as effective word, if k is less than n, by akAs ai, continue
If executing the i is less than n, by aiWith adjacent single word ai+1It is combined, obtains temporary word aiai+1The step of;
If the temporary word aiai+1ai+2...akExist in the everyday words dictionary and k=n, then by aiai+1ai+2...akKnow
It Wei not effective word;
The effective word that will identify that is determined as the word segmentation result of the Chinese character string.
2. character string segmenting method as described in claim 1, which is characterized in that described to obtain pending Chinese character string packet
It includes:
Obtain the pending character string for including chinese character and non-chinese character;
Identify m non-chinese character string in the pending character string, wherein the non-chinese character string only includes continuous phase
The adjacent non-chinese character, m is zero or positive integer;
It is m+1 substring by the pending string segmentation using the m non-chinese character string as cut-off, it will be every
A substring is determined as the pending Chinese character string, wherein the substring only includes continuous adjacent
The chinese character.
3. character string segmenting method as claimed in claim 2, which is characterized in that the pending character string is in policy information
Address character string, two described address character strings are executed respectively complete described in the effective word that will identify that be determined as it is described
After the word segmentation result of Chinese character string, the character string segmenting method further includes:
For each described address character string, according to the word segmentation result of each Chinese character string in the address character string, really
The feature and its weighted value of the fixed address character string;
According to the feature and its weighted value of two described address character strings, compare two described address character strings meaning whether phase
Together;
It is if the meaning of two described address character strings is identical, the policy information where two described address character strings is corresponding
Declaration form is associated.
4. character string segmenting method as described in any one of claims 1 to 3, which is characterized in that described to obtain the pending Chinese
Before word character string, the character string segmenting method further includes:
Chinese everyday expressions are collected from internet;
Obtain the insurance professional term of insurance industry;
Using the Chinese everyday expressions and the insurance professional term, the everyday words dictionary is established.
5. character string segmenting method as claimed in claim 4, which is characterized in that described to use the Chinese everyday expressions and institute
Insurance professional term is stated, is established after the everyday words dictionary, the character string segmenting method further includes:
It is spaced at every predetermined time and the everyday words dictionary is updated.
6. a kind of character string segments device, which is characterized in that the character string segments device and includes:
Acquisition module, for obtaining pending Chinese character string;
Cutting module obtains n single word a for carrying out single character segmentation to the Chinese character stringi, wherein i ∈ [1, n],
N is the number for the chinese character for including in the Chinese character string;
Composite module, if being less than n for i, by aiWith adjacent single word ai+1It is combined, obtains temporary word aiai+1;
Circulation searching module, if being used for the temporary word aiai+1Exist in preset everyday words dictionary, then by the temporary word
aiai+1With adjacent single word ai+2It is combined, obtains temporary word aiai+1ai+2, and use the temporary word aiai+1ai+2Continue
It is searched in the everyday words dictionary, until temporary word aiai+1ai+2...akUntil being not present in the everyday words dictionary,
Wherein, k ∈ [i+1, n];
First identification module, if being used for the temporary word aiai+1ai+2...akIt is not present, then will in the everyday words dictionary
aiai+1ai+2...ak-1It is identified as effective word, and from single word akStart, if k is equal to n, by akIt is identified as effective word, if k is small
In n, then by akAs ai, return to the composite module and continue to execute;
Second identification module, if being used for the temporary word aiai+1ai+2...akExist in the everyday words dictionary and k=n,
Then by aiai+1ai+2...akIt is identified as effective word;
As a result determining module, effective word for will identify that are determined as the word segmentation result of the Chinese character string.
7. character string as claimed in claim 6 segments device, which is characterized in that the acquisition module includes:
Character string acquiring unit, for obtaining the pending character string for including chinese character and non-chinese character;
Non- Chinese character acquiring unit, for identification m non-chinese character string in the pending character string, wherein the non-Chinese
Word character string only includes the non-chinese character of continuous adjacent, and m is zero or positive integer;
Cutting unit, for being m+1 by the pending string segmentation using the m non-chinese character string as cut-off
Each substring is determined as the pending Chinese character string, wherein the substring only wraps by substring
The chinese character containing continuous adjacent.
8. character string as claimed in claim 7 segments device, which is characterized in that the pending character string is in policy information
Address character string, character string participle device further includes:
Characteristic determination module, for being directed to each described address character string, according to each Chinese Character in the address character string
The word segmentation result for according with string, determines the feature and its weighted value of the address character string;
Comparison module compares two described address characters for the feature and its weighted value according to two described address character strings
Whether the meaning of string is identical;
Relating module will be where two described address character strings if the meaning for two described address character strings is identical
The corresponding declaration form of policy information is associated.
9. a kind of terminal device, including memory, processor and it is stored in the memory and can be on the processor
The computer program of operation, which is characterized in that the processor realizes such as claim 1 to 5 when executing the computer program
The step of any one character string segmenting method.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, feature to exist
In the step of realization character string segmenting method as described in any one of claim 1 to 5 when the computer program is executed by processor
Suddenly.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810030722.XA CN108363686A (en) | 2018-01-12 | 2018-01-12 | A kind of character string segmenting method, device, terminal device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810030722.XA CN108363686A (en) | 2018-01-12 | 2018-01-12 | A kind of character string segmenting method, device, terminal device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108363686A true CN108363686A (en) | 2018-08-03 |
Family
ID=63006134
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810030722.XA Pending CN108363686A (en) | 2018-01-12 | 2018-01-12 | A kind of character string segmenting method, device, terminal device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108363686A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109766539A (en) * | 2018-11-30 | 2019-05-17 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN109858011A (en) * | 2018-11-30 | 2019-06-07 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN110362827A (en) * | 2019-07-11 | 2019-10-22 | 腾讯科技(深圳)有限公司 | A kind of keyword extracting method, device and storage medium |
CN110688852A (en) * | 2019-09-27 | 2020-01-14 | 西安赢瑞电子有限公司 | Chinese character word frequency storage method |
CN111178061A (en) * | 2019-12-20 | 2020-05-19 | 沈阳雅译网络技术有限公司 | Multi-lingual word segmentation method based on code conversion |
CN111651984A (en) * | 2019-02-19 | 2020-09-11 | 北京京东尚科信息技术有限公司 | Method and device for processing article description text and computer readable storage medium |
CN111831869A (en) * | 2020-06-30 | 2020-10-27 | 深圳价值在线信息科技股份有限公司 | Method and device for checking duplicate of character string, terminal equipment and storage medium |
CN113268988A (en) * | 2021-07-19 | 2021-08-17 | 中国平安人寿保险股份有限公司 | Text entity analysis method and device, terminal equipment and storage medium |
CN113627722A (en) * | 2021-07-02 | 2021-11-09 | 湖北美和易思教育科技有限公司 | Simple answer scoring method based on keyword segmentation, terminal and readable storage medium |
CN113779990A (en) * | 2021-09-10 | 2021-12-10 | 中国联合网络通信集团有限公司 | Chinese word segmentation method, device, equipment and storage medium |
WO2022121172A1 (en) * | 2020-12-10 | 2022-06-16 | 平安科技(深圳)有限公司 | Text error correction method and apparatus, electronic device, and computer readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102411583A (en) * | 2010-09-20 | 2012-04-11 | 阿里巴巴集团控股有限公司 | Text matching method and device |
CN105138514A (en) * | 2015-08-24 | 2015-12-09 | 昆明理工大学 | Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction |
CN105224610A (en) * | 2015-09-08 | 2016-01-06 | 方正国际软件有限公司 | The method and apparatus that a kind of address is compared |
CN106033416A (en) * | 2015-03-09 | 2016-10-19 | 阿里巴巴集团控股有限公司 | A string processing method and device |
-
2018
- 2018-01-12 CN CN201810030722.XA patent/CN108363686A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102411583A (en) * | 2010-09-20 | 2012-04-11 | 阿里巴巴集团控股有限公司 | Text matching method and device |
CN106033416A (en) * | 2015-03-09 | 2016-10-19 | 阿里巴巴集团控股有限公司 | A string processing method and device |
CN105138514A (en) * | 2015-08-24 | 2015-12-09 | 昆明理工大学 | Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction |
CN105224610A (en) * | 2015-09-08 | 2016-01-06 | 方正国际软件有限公司 | The method and apparatus that a kind of address is compared |
Non-Patent Citations (1)
Title |
---|
范春晓: "《Web数据分析关键技术及解决方案》", 31 August 2017 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109766539A (en) * | 2018-11-30 | 2019-05-17 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN109858011A (en) * | 2018-11-30 | 2019-06-07 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN111651984A (en) * | 2019-02-19 | 2020-09-11 | 北京京东尚科信息技术有限公司 | Method and device for processing article description text and computer readable storage medium |
CN110362827A (en) * | 2019-07-11 | 2019-10-22 | 腾讯科技(深圳)有限公司 | A kind of keyword extracting method, device and storage medium |
CN110362827B (en) * | 2019-07-11 | 2024-05-14 | 腾讯科技(深圳)有限公司 | Keyword extraction method, keyword extraction device and storage medium |
CN110688852A (en) * | 2019-09-27 | 2020-01-14 | 西安赢瑞电子有限公司 | Chinese character word frequency storage method |
CN111178061B (en) * | 2019-12-20 | 2023-03-10 | 沈阳雅译网络技术有限公司 | Multi-lingual word segmentation method based on code conversion |
CN111178061A (en) * | 2019-12-20 | 2020-05-19 | 沈阳雅译网络技术有限公司 | Multi-lingual word segmentation method based on code conversion |
CN111831869A (en) * | 2020-06-30 | 2020-10-27 | 深圳价值在线信息科技股份有限公司 | Method and device for checking duplicate of character string, terminal equipment and storage medium |
CN111831869B (en) * | 2020-06-30 | 2023-11-03 | 深圳价值在线信息科技股份有限公司 | Character string duplicate checking method, device, terminal equipment and storage medium |
WO2022121172A1 (en) * | 2020-12-10 | 2022-06-16 | 平安科技(深圳)有限公司 | Text error correction method and apparatus, electronic device, and computer readable storage medium |
CN113627722A (en) * | 2021-07-02 | 2021-11-09 | 湖北美和易思教育科技有限公司 | Simple answer scoring method based on keyword segmentation, terminal and readable storage medium |
CN113627722B (en) * | 2021-07-02 | 2024-04-02 | 湖北美和易思教育科技有限公司 | Simple answer scoring method based on keyword segmentation, terminal and readable storage medium |
CN113268988A (en) * | 2021-07-19 | 2021-08-17 | 中国平安人寿保险股份有限公司 | Text entity analysis method and device, terminal equipment and storage medium |
CN113268988B (en) * | 2021-07-19 | 2021-10-29 | 中国平安人寿保险股份有限公司 | Text entity analysis method and device, terminal equipment and storage medium |
CN113779990A (en) * | 2021-09-10 | 2021-12-10 | 中国联合网络通信集团有限公司 | Chinese word segmentation method, device, equipment and storage medium |
CN113779990B (en) * | 2021-09-10 | 2023-10-31 | 中国联合网络通信集团有限公司 | Chinese word segmentation method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363686A (en) | A kind of character string segmenting method, device, terminal device and storage medium | |
US11544459B2 (en) | Method and apparatus for determining feature words and server | |
CN107544982B (en) | Text information processing method and device and terminal | |
CN110209660B (en) | Cheating group mining method and device and electronic equipment | |
CN107784110B (en) | Index establishing method and device | |
WO2021189977A1 (en) | Address coding method and apparatus, and computer device and computer-readable storage medium | |
CN112148305B (en) | Application detection method, device, computer equipment and readable storage medium | |
CN106909575B (en) | Text clustering method and device | |
CN108363729A (en) | A kind of string comparison method, device, terminal device and storage medium | |
CN110610196A (en) | Desensitization method, system, computer device and computer-readable storage medium | |
CN102867049B (en) | Chinese PINYIN quick word segmentation method based on word search tree | |
CN112084781B (en) | Standard term determining method, device and storage medium | |
CN111325030A (en) | Text label construction method and device, computer equipment and storage medium | |
CN106445918A (en) | Chinese address processing method and system | |
CN113568940A (en) | Data query method, device, equipment and storage medium | |
CN110717086A (en) | Mass data clustering analysis method and device | |
Rahmani et al. | Improving code example recommendations on informal documentation using bert and query-aware lsh: A comparative study | |
Kim et al. | IDAR: Fast supergraph search using DAG integration | |
CN113792170B (en) | Graph data dividing method and device and computer equipment | |
CN113468866B (en) | Method and device for analyzing non-standard JSON string | |
Mahmud et al. | An improved hashing approach for biological sequence to solve exact pattern matching problems | |
Zhang et al. | Geo-seq2seq: Twitter user geolocation on noisy data through sequence to sequence learning | |
CN115391551A (en) | Event detection method and device | |
CN113656466A (en) | Policy data query method, device, equipment and storage medium | |
CN113010642A (en) | Semantic relation recognition method and device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180803 |
|
RJ01 | Rejection of invention patent application after publication |