CN108363686A

CN108363686A - A kind of character string segmenting method, device, terminal device and storage medium

Info

Publication number: CN108363686A
Application number: CN201810030722.XA
Authority: CN
Inventors: 刘行行
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2018-01-12
Filing date: 2018-01-12
Publication date: 2018-08-03

Abstract

The invention discloses a kind of character string segmenting method, device, terminal device and storage medium, the method includes：Obtain pending Chinese character string；Single character segmentation is carried out to Chinese character string, obtains n single word a_i, by a_iWith adjacent single word a_i+1It is combined, obtains temporary word a_ia_i+1；If temporary word a_ia_i+1Exist in preset everyday words dictionary, then by temporary word a_ia_i+1With adjacent single word a_i+2It is combined, obtains temporary word a_ia_i+1a_i+2, and use temporary word a_ia_i+1a_i+2Continuation is searched in the everyday words dictionary, until temporary word a_ia_i+1a_i+2...a_kUntil being not present in everyday words dictionary；By a_ia_i+1a_i+2...a_k‑1It is identified as effective word, and from single word a_kStart, continues to combine lookup with adjacent single word.Technical scheme of the present invention is long it is not necessary that initial word is arranged, and avoids everyday words and is separated by mistake, improves the accuracy of character string participle, also, realization method is more simple and easy to do, execution efficiency higher, to effectively improve the versatility and participle efficiency of character string participle.

Description

A kind of character string segmenting method, device, terminal device and storage medium

Technical field

The present invention relates to field of computer technology more particularly to a kind of character string segmenting method, device, terminal device and deposit Storage media.

Background technology

Currently, traditional character string segmenting method generally use Forward Maximum Method algorithm, the algorithm is firstly the need of setting One most major term is long, then to character string to be identified from left to right according to the most major term progress row scan, and will most major term it is long Character string matched with the word in dictionary, if do not matched, just shorten length continually look for, until find or at Until individual character.

But the size of most major term length can influence the accuracy and participle efficiency of character string participle, if most major term length is too short, Long word will be separated by mistake, influence the accuracy of participle, if word length is long, matching times obviously increase, and influence execution efficiency. In actual use, in order to reach satisfied word segmentation result, repeated multiple times to most major term progress row move often is also needed to State adjusts, and causes to segment less efficient.

Invention content

A kind of character string segmenting method of offer of the embodiment of the present invention, device, terminal device and storage medium, it is existing to solve The accuracy and segment less efficient problem that character string segments in technology.

In a first aspect, the embodiment of the present invention provides a kind of character string segmenting method, including：

Obtain pending Chinese character string；

Single character segmentation is carried out to the Chinese character string, obtains n single word a_i, wherein i ∈ [1, n], n are described The number for the chinese character for including in Chinese character string；

If i is less than n, by a_iWith adjacent single word a_i+1It is combined, obtains temporary word a_ia_i+1；

If the temporary word a_ia_i+1Exist in preset everyday words dictionary, then by the temporary word a_ia_i+1With adjacent list A word a_i+2It is combined, obtains temporary word a_ia_i+1a_i+2, and use the temporary word a_ia_i+1a_i+2Continue in the everyday words word It is searched in library, until temporary word a_ia_i+1a_i+2...a_kUntil being not present in the everyday words dictionary, wherein k ∈ [i+1, n]；

If the temporary word a_ia_i+1a_i+2...a_kIt is not present in the everyday words dictionary, then by a_ia_i+1a_i+2...a_k-1Know Not Wei effective word, and from single word a_kStart, if k is equal to n, by a_kIt is identified as effective word, if k is less than n, by a_kAs a_i, If continuing to execute the i is less than n, by a_iWith adjacent single word a_i+1It is combined, obtains temporary word a_ia_i+1The step of；

If the temporary word a_ia_i+1a_i+2...a_kExist in the everyday words dictionary and k=n, then by a_ia_i+ ₁a_i+2...a_kIt is identified as effective word；

The effective word that will identify that is determined as the word segmentation result of the Chinese character string.

Second aspect, the embodiment of the present invention provide a kind of character string participle device, including：

Acquisition module, for obtaining pending Chinese character string；

Cutting module obtains n single word a for carrying out single character segmentation to the Chinese character string_i, wherein i ∈ [1, n], n are the number for the chinese character for including in the Chinese character string；

Composite module, if being less than n for i, by a_iWith adjacent single word a_i+1It is combined, obtains temporary word a_ia_i+1；

Circulation searching module, if being used for the temporary word a_ia_i+1Exist in preset everyday words dictionary, then faces described When word a_ia_i+1With adjacent single word a_i+2It is combined, obtains temporary word a_ia_i+1a_i+2, and use the temporary word a_ia_i+1a_i+2After Continue and searched in the everyday words dictionary, until temporary word a_ia_i+1a_i+2...a_kThere is no be in the everyday words dictionary Only, wherein k ∈ [i+1, n]；

First identification module, if being used for the temporary word a_ia_i+1a_i+2...a_kIt is not present in the everyday words dictionary, then By a_ia_i+1a_i+2...a_k-1It is identified as effective word, and from single word a_kStart, if k is equal to n, by a_kIt is identified as effective word, if k Less than n, then by a_kAs a_iIf continuing to execute the i is less than n, by a_iWith adjacent single word a_i+1It is combined, obtains interim Word a_ia_i+1The step of；

Second identification module, if being used for the temporary word a_ia_i+1a_i+2...a_kExist in the everyday words dictionary and k =n, then by a_ia_i+1a_i+2...a_kIt is identified as effective word；

As a result determining module, effective word for will identify that are determined as the word segmentation result of the Chinese character string.

The third aspect, the embodiment of the present invention provide a kind of terminal device, including memory, processor and are stored in described In memory and the computer program that can run on the processor, the processor are realized when executing the computer program The step of character string segmenting method.

Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer-readable storage medium The step of matter is stored with computer program, and the computer program realizes the character string segmenting method when being executed by processor.

Character string segmenting method provided in an embodiment of the present invention has the following advantages that compared with prior art：The present invention is implemented In character string segmenting method, device, terminal device and storage medium that example is provided, pass through the pending Chinese character to getting Character string carries out single character segmentation, then since first single word, after it is combined into temporary word with adjacent single word The temporary word is searched in conventional dictionary, if the temporary word exists in everyday words dictionary, by the temporary word and next phase Adjacent single word is combined into new temporary word, and continues to search the new temporary word in conventional dictionary, if remaining able to inquire It arrives, then continues to combine, and continue to inquire in everyday words dictionary, until inquiry is combined into less than newest in everyday words dictionary Until temporary word, it is effective word to take a temporary word at this time, then will reject remaining single word and next phase after effective word Adjacent single word continues to combine, and continues to inquire in everyday words dictionary, until the single word of Chinese character string is all handled Until complete.Character string participle is carried out by the way of single character segmentation and combination, it is long it is not necessary that initial word is arranged, avoid everyday words It is separated by mistake, to improve the accuracy of character string participle, meanwhile, relative to traditional Forward Maximum Method algorithm, this hair The realization method of the technical solution of bright embodiment is more simple and easy to do, execution efficiency higher, to effectively improve character string participle Versatility and participle efficiency.

Description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the present invention Example, for those of ordinary skill in the art, without having to pay creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is the implementation flow chart for the character string segmenting method that the embodiment of the present invention 1 provides；

Fig. 2 is the implementation flow chart of step S1 in the character string segmenting method that the embodiment of the present invention 1 provides；

Fig. 3 is the declaration form where two address character strings of analysis in the character string segmenting method that the embodiment of the present invention 1 provides The implementation flow chart of correlation degree between the corresponding declaration form of information；

Fig. 4 is to establish everyday words dictionary and to everyday words word in the character string segmenting method that the embodiment of the present invention 1 provides The implementation flow chart that library is updated；

Fig. 5 is the schematic diagram for the character string participle device that the embodiment of the present invention 2 provides；

Fig. 6 is the schematic diagram for the terminal device that the embodiment of the present invention 4 provides.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, the every other implementation that those of ordinary skill in the art are obtained without creative efforts Example, shall fall within the protection scope of the present invention.

Embodiment 1

Referring to Fig. 1, Fig. 1 shows the implementation process of character string segmenting method provided in an embodiment of the present invention.The character String segmenting method can be applied to the matching analysis in insurance industry to policy information.Details are as follows：

S1：Obtain pending Chinese character string.

In embodiments of the present invention, the acquisition modes of pending Chinese character string are not limited, can receive Chinese character string input by user can be the Chinese character string identified by image scanning or figure, can also be The Chinese character string extracted from the text message of preservation.

Further, Chinese character string is obtained from the policy information of declaration form database.Wherein, policy information is purchased for user The attribute information for the declaration form that product of insuring generates, for example, home address information, work unit's information etc..

It should be noted that the character string segmenting method in the embodiment of the present invention is segmented for Chinese character string.If Need the character string for carrying out character string participle in addition to containing chinese character, it includes English character or Arabic numerals etc. also to contain Other characters including character then need first to identify non-chinese character therein, then non-chinese character is separated several Chinese character string carries out word segmentation processing.

S2：Single character segmentation is carried out to pending Chinese character string, obtains n single word a_i, wherein i ∈ [1, n], n Number for the chinese character for including in the Chinese character string.

In embodiments of the present invention, cutting is carried out according to single word to pending Chinese character string, obtains n Chinese character a_i, can be by n Chinese character a_iIt is preserved with array form according to the sequence of Chinese character string, each Chinese character is one of the array Element, that is to say, that each of carry out after single character segmentation putting in order according to the Chinese Character in Chinese character string for single word Symbol sequence is from left to right arranged in order.

For example, if pending Chinese character string is " In Luohu District of Shenzhen Municipal ", carries out single character segmentation and obtain six lists A word is respectively：a₁=deep, a₂=ditch between fields, a₃=city, a₄=sieve, a₅=lake, a₆=area.

S3：If i is less than n, by a_iWith adjacent single word a_i+1It is combined, obtains temporary word a_ia_i+1。

Specifically, from first character a₁Start, by a₁With a₂It is combined, obtains temporary word a₁a₂。

It should be noted that as i=n, a at this time_iIt has been the last character in pending Chinese character string, has not had There is adjacent single word, therefore combination can not be continued and obtain new temporary word.

S4：If temporary word a_ia_i+1Exist in preset everyday words dictionary, then by temporary word a_ia_i+1With adjacent single word a_i+2It is combined, obtains temporary word a_ia_i+1a_i+2, and use temporary word a_ia_i+1a_i+2Continuation is searched in everyday words dictionary, Until temporary word a_ia_i+1a_i+2...a_kUntil being not present in everyday words dictionary, wherein k ∈ [i+1, n].

In embodiments of the present invention, preset everyday words dictionary contains common Chinese-character words, and the everyday words word Library can be periodically updated.

If the temporary word that step S3 is obtained exists in the everyday words dictionary, continue temporary word a_ia_i+1With it is adjacent Single word a_i+2It is combined, obtains temporary word a_ia_i+1a_i+2If temporary word a_ia_i+1a_i+2Still exist in everyday words dictionary, then Continue temporary word a_ia_i+1a_i+2With a_i+3It is combined, obtains temporary word a_ia_i+1a_i+2a_i+3, and it is continuing with temporary word a_ia_i+ ₁a_i+2a_i+3It is searched in everyday words dictionary, until temporary word a_ia_i+1a_i+2...a_kIn the absence of in everyday words dictionary, terminate to look into It looks for.

It should be noted that the value range of k is more than or equal to i+1 and to be less than or equal to n, that is to say, that as k=n, The temporary word a searched at this time in everyday words dictionary_ia_i+1a_i+2...a_nHave arrived at the last one of pending Chinese character string Therefore single word is completed to temporary word a_ia_i+1a_i+2...a_nLookup after, no matter the temporary word is in everyday words dictionary No presence does not all continue to be searched.

Continuation is illustrated by taking the pending Chinese character string " In Luohu District of Shenzhen Municipal " that step S2 is referred to as an example, for step Six single word " a that rapid S2 is obtained₁=deep, a₂=ditch between fields, a₃=city, a₄=sieve, a₅=lake, a₆=area ", first by " depth " and " ditch between fields " It is combined into temporary word " Shenzhen ", and searches the temporary word " Shenzhen " in everyday words dictionary, since " Shenzhen " is in everyday words dictionary Middle presence then continues " Shenzhen " and adjacent " city " being combined into temporary word " Shenzhen ", and continuation is searched in everyday words dictionary The temporary word " Shenzhen ", since " Shenzhen " still exists in everyday words dictionary, therefore continuing will be " Shenzhen " and adjacent " sieve " is combined into temporary word " Shenzhen sieve ", since " Shenzhen sieve " is not present in everyday words dictionary, then stops inquiring, currently Temporary word be " Shenzhen sieve ".

S5：If temporary word a_ia_i+1a_i+2...a_kIt is not present in everyday words dictionary, then by a_ia_i+1a_i+2...a_k-1It is identified as Effective word, and from single word a_kStart, if k is equal to n, by a_kIt is identified as effective word, if k is less than n, by a_kAs a_i, and return Step S3 is returned to continue to execute.

Specifically, if obtaining temporary word a according to step S4_ia_i+1a_i+2...a_kIt is not present, then will in everyday words dictionary a_ia_i+1a_i+2...a_k-1It is identified as effective word, and from single word a_kStart, if k is less than n, by a_kAs a_i, return to step S3 weights It is new to execute.

As k=i+1, if a_ia_i+1It is not present in everyday words dictionary, then by single word a_iIt is identified as effective word, and not It needs to judge the single word a_iIt whether there is in everyday words dictionary.

As k=n, if a_ia_i+1a_i+2...a_nIt is not present in everyday words dictionary, then a at this time_kIt has been the pending Chinese The last character in word character string can not continue combination and obtain new temporary word, therefore no longer without adjacent single word Return to step S3, directly by a_kIt is identified as effective word, and flow jumps to step S7 and continues to execute.

Continuation is illustrated by taking the pending Chinese character string " In Luohu District of Shenzhen Municipal " that step S2 is referred to as an example, when common When temporary word " Shenzhen sieve " being not present in word dictionary, then " Shenzhen " is identified as effective word, is then opened from single word " sieve " Begin, return to step S3 continues to be combined into new temporary word " Luohu " with adjacent " lake ", and continues to carry out in everyday words dictionary It searches.

S6：If temporary word a_ia_i+1a_i+2...a_kExist in everyday words dictionary and k=n, then by a_ia_i+1a_i+2...a_kKnow It Wei not effective word.

Specifically, if obtaining temporary word a according to step S4_ia_i+1a_i+2...a_kExist in everyday words dictionary and k=n, Then illustrate temporary word a_ia_i+1a_i+2...a_kThe last one single word of pending Chinese character string is had arrived at, and this is interim Word a_ia_i+1a_i+2...a_kExist in everyday words dictionary, then directly by a_ia_i+1a_i+2...a_kIt is identified as effective word, and is completed Effective word identification to pending Chinese character string.

Continuation is illustrated by taking the pending Chinese character string " In Luohu District of Shenzhen Municipal " that step S2 is referred to as an example, when common When temporary word " Shenzhen sieve " being not present in word dictionary, then " Shenzhen " is identified as effective word, is then opened from single word " sieve " Begin, continues to be combined into new temporary word " Luohu " with adjacent " lake " according to step S3, and continue to search in everyday words dictionary The temporary word " Luohu ", if being combined into and facing with adjacent " area " by " Luohu " there are temporary word " Luohu " in everyday words dictionary When word " Luohu District ", and continue to search the temporary word " Luohu District " in everyday words dictionary, if in everyday words dictionary existing interim Word " Luohu District ", and k=n at this time, then do not continue to circulation searching, and " Luohu District " is directly identified as effective word.

S7：The effective word that will identify that is determined as the word segmentation result of pending Chinese character string.

Specifically, effective word step S5 and step S6 identified is as the word segmentation result of the Chinese character string.

Continuation is illustrated by taking the pending Chinese character string " In Luohu District of Shenzhen Municipal " that step S2 is referred to as an example, according to step Effective word that rapid S5 and step S6 is identified is：" Shenzhen " and " Luohu District ", therefore, Chinese character string " Shenzhen's Luohu The word segmentation result in area " is：" Shenzhen " and " Luohu District ".

In the corresponding embodiments of Fig. 1, single character segmentation is carried out to the pending Chinese character string got first, so Afterwards since first single word, it is combined into adjacent single word after temporary word and searches this in conventional dictionary temporarily The temporary word and next adjacent single word are combined into new face by word if the temporary word exists in everyday words dictionary When word, and continue the new temporary word is searched in conventional dictionary, if remaining able to inquire, continue to combine, and continue exist It is inquired in everyday words dictionary, until inquiry is less than the newest temporary word being combined into everyday words dictionary, takes one at this time A temporary word is effective word, then remaining single word continues to combine with next adjacent single word after rejecting effective word, And continue to inquire in everyday words dictionary, until the single word of Chinese character string has all been handled.Pass through single character segmentation Character string participle is carried out with the mode of combination, it is long it is not necessary that initial word is arranged, it avoids everyday words and is separated by mistake, improve character string The accuracy of participle, meanwhile, relative to traditional Forward Maximum Method algorithm, the realization side of the technical solution of the embodiment of the present invention Formula is more simple and easy to do, execution efficiency higher, to effectively improve the versatility and participle efficiency of character string participle.

Next, on the basis of the corresponding embodiments of Fig. 1, come to step S1 below by a specific embodiment In the concrete methods of realizing of the pending Chinese character string of acquisition that refers to be described in detail.

Referring to Fig. 2, Fig. 2 shows the specific implementation flow of the step S1 of the embodiment of the present invention, details are as follows：

S11：Obtain the pending character string for including chinese character and non-chinese character.

In embodiments of the present invention, the acquisition modes for treating processing character string are not limited, and can be that reception user is defeated The information entered can be the character string identified by image scanning or figure, can also be the text message from preservation The character string of middle extraction.

Further, pending character string is directly acquired from the policy information of declaration form database.Wherein, policy information is User buys the attribute information for the declaration form that insurance products generate, for example, home address information, work unit's information etc..

Specifically, pending character string include chinese character and non-chinese character, wherein non-chinese character refer to include English Other characters including the characters such as Chinese character or Arabic numerals, such as punctuation mark, English word, English alphabet, I Additional character as uncle's number and "@" etc..

For example, if pending character string is work unit's information " address in policy information：The East Roads Sun Gang, Luohu District 3012 Number Times Squares Zhong Min Building B " then contains punctuation mark, English character and Arabic data in work unit's information and exists Other interior characters.

S12：Identify m non-chinese character string in pending character string, wherein non-chinese character string only includes continuous phase Adjacent non-chinese character, m is zero or positive integer.

Specifically, chinese character and the non-Chinese character in pending character string are identified according to the Unicode of character (Unicode) Character.

Unicode is generated to solve the limitation of traditional character coding method, it is each of each language Character setting is unified and unique binary coding, with meet across language, it is cross-platform carry out text conversion, processing is wanted It asks.Wherein, 0x4E00 to 0x9FA5 indicates the binary coding range of simplified Hanzi.

The first character for treating processing character string starts to be traversed one by one from left to right, if the Unicode of the character exists In the range of 0x4E00 to 0x9FA5, then it is chinese character by the character recognition, is otherwise identified as non-chinese character, and according to knowledge The non-chinese character not gone out obtains m non-chinese character string.If m is zero, illustrate that the pending character string is Chinese character string, Not comprising non-chinese character.

S13：It is m+1 substring by pending string segmentation using m non-chinese character string as cut-off, wherein Each substring is determined as to pending Chinese character string, substring only includes the chinese character of continuous adjacent.

Specifically, pending string segmentation is m+1 son by the m non-chinese character string identified according to step S12 Character string, each substring are Chinese character string.

Continue with the pending character string " address in step S11：The Times Squares the Zhong Min B of the East Roads Sun Gang, Luohu District 3012 It is illustrated for seat ", the non-chinese character string identified according to step S12 is：“：", " 3012 " and " B ", it is non-with this three Chinese character string is divided into four substrings as cut-off, the pending character string, respectively：" address ", " Luohu District The East Roads Sun Gang ", " number Times Squares Zhong Min " and " seat ".The Chinese character string of i.e. pending word segmentation processing is " address ", " Luohu District The East Roads Sun Gang ", " number Times Squares Zhong Min " and " seat ".

In the corresponding embodiments of Fig. 2, by being carried out to the pending character string comprising chinese character and non-chinese character Cutting is handled, and the non-chinese character to identify treats processing character string and carry out first time participle, obtain m as cut-off Non-chinese character string and m+1 Chinese character string can accurately filter out the Chinese character string in pending character string, Jin Erke To execute step S2 to step S7 to m+1 Chinese character string, by the word segmentation result of obtained Chinese character string and non-Chinese Character Word segmentation result of the symbol string as pending character string, to further effectively improve the participle accuracy of pending character string and divide Word efficiency.

On the basis of the corresponding embodiments of Fig. 2, if pending character string is the declaration form being stored in declaration form database The address character string of information, then two address character strings are executed respectively complete mentioned by step S7 will identify that it is effective Word is determined as after the word segmentation result of the Chinese character string, further includes the policy information pair analyzed where two address character strings The processing procedure of correlation degree between the declaration form answered.

Referring to Fig. 3, Fig. 3 shows the policy information where two address character strings of analysis provided in an embodiment of the present invention The specific implementation flow of correlation degree between corresponding declaration form, details are as follows：

S81：For each address character string, according to the word segmentation result of each Chinese character string in the address character string, really The feature and its weighted value of the fixed address character string.

Specifically, for each address character string m+1 in the address character string is obtained after executing completion step S7 The word segmentation result of a Chinese character string and m non-chinese character string, using the effective word of each of the word segmentation result as the ground The feature of location character string, using occurrence number of each feature in the address character as the weighted value of this feature.

S82：According to the feature and its weighted value of two address character strings, compare two address character strings meaning whether It is identical.

It specifically,, can will be each according to the obtained features of step S81 and its weighted value for each address character string Word frequency of the weighted value of feature as this feature, the feature vector of the address character string is built according to the word frequency of each feature, so It uses cosine similarity algorithm to calculate the similarity of two address character strings afterwards, two address character strings is judged according to the similarity Meaning it is whether identical.

As another embodiment of the present invention, for each address character string, in the feature obtained according to step S81 and its After weighted value, the address character string corresponding two can also be obtained using simhash algorithms according to each feature and its weighted value Then system number string calculates the Hamming distances between corresponding two binary strings of two address character strings, according to hamming away from From judging whether the meaning of two address character strings is identical.

If similarity is more than preset similarity threshold or Hamming distances are less than preset distance threshold, two are confirmed The meaning of a address character string is identical, otherwise it is assumed that the meaning of two address character strings differs.

It should be noted that preset similarity threshold and preset distance threshold can according to the needs of practical application into Row setting, is not limited herein.

It is understood that similarity is bigger or Hamming distances are smaller, then the meaning of two address character strings is closer, Conversely, similarity is smaller or Hamming distances are bigger, then the meaning gap of two address character strings is bigger.

It should be noted that in embodiments of the present invention, when whether the meaning for comparing two address character strings is identical, only Chinese character string is compared, it is whether not identical to the meaning of non-chinese character string to judge.This is because analyzing When correlation degree between two declaration forms, have been able to effectively embody the similar of address character string by the comparison of Chinese character string Degree, when the meaning of the Chinese character string of two address character strings is identical, even if the meaning of non-chinese character string differs, example If house number is different, it is also assumed that there are relevances between two declaration forms.

In other embodiments, between needing the accurate matching result according to two address character strings to analyze two declaration forms Correlation degree when, can also according to after determining that the meaning of two address character strings is identical to the comparison of Chinese character string, into Whether the content that one step compares the non-chinese character string in two address character strings in such a way that character directly compares is identical, if It is identical then to confirm that the meaning of two address character strings is identical, confirm that the meaning of two address character strings differs if differing.

S83：If the meaning of two address character strings is identical, the policy information where two address character strings is corresponded to Declaration form be associated.

Specifically, if determining that the meaning of two address character strings is identical according to step S82, confirm the two address characters There are relevances between the corresponding declaration form of policy information where going here and there, and the meaning of two address character strings is closer, then recognizes Correlation degree between two declaration forms is higher, and the meaning gap of two address character strings is bigger, then it is assumed that two declaration forms Between correlation degree it is lower.

It is latent between the relevant staff of insurance company can be helped accurately to excavate declaration form by being associated to declaration form In relationship, be conducive to relevant staff and declaration form is analyzed, finds various possible insurance fraud risks.

In the corresponding embodiments of Fig. 3, by policy information address character string carry out word segmentation processing after, according to point Word result determines the feature and its weighted value of address character string, and according to the feature and its weighted value of two address character strings, than It is whether identical compared with the meaning of two address character strings, and when comparison result is that meaning is identical, it will be where two address character strings The corresponding declaration form of policy information be associated, so as to help the relevant staff of insurance company accurately excavate declaration form it Between potential relationship, be conducive to relevant staff and declaration form analyzed, find various possible insurance fraud risks.

On the basis of above example, before the pending Chinese character string of the acquisition that step S1 is referred to, may be used also To execute the processing procedure for establishing everyday words dictionary.

Referring to Fig. 4, Fig. 4 shows the specific implementation flow provided in an embodiment of the present invention for establishing everyday words dictionary, in detail It states as follows：

S01：Chinese everyday expressions were collected from internet.

Specifically, the mode for Chinese everyday expressions being collected from internet includes directly from the Internet download Chinese everyday words Language, such as common noun, hot word etc. further include obtaining the higher various words of search rate from the search engine on internet, And the Chinese website that visit capacity is larger from internet obtains the Chinese terms etc. frequently occurred, can also include from internet On other Chinese wordbanks in obtain Chinese everyday expressions, but it is not limited to this, can also include that other various are obtained from internet Take the channel of common Chinese terms.

S02：Obtain the insurance professional term of insurance industry.

Specifically, for insurance industry, insurance professional term is targetedly obtained, including directly protected from the Internet download The technical term and noun that dangerous industry is related to, or insure correlation word etc., such as Chinese safety from insurance related web site search The insurance professional term of insurance company has " one account of Pingan Insurance is logical to be logged in ", " Pingan Insurance's phone vehicle insurance ", " Pingan Insurance quotient City " etc., but it is not limited to this, can also include the insurance professional term that insurance industry is obtained from other various channels.

Can be arranged side by side it should be noted that there is no inevitable priority to execute sequence between step S01 and step S02 The relationship of execution, is not limited herein.

S03：Using Chinese everyday expressions and insurance professional term, everyday words dictionary is established.

Specifically, the Chinese everyday expressions that step S01 is collected and the insurance professional term that step S02 is obtained are integrated into In everyday words dictionary.

Further, the everyday words dictionary after integration is arranged, it includes duplicate removal, denoising, screening, sieve to arrange flow Select sensitive word, repeatedly denoising, regular etc..

In the corresponding embodiments of Fig. 4, everyday words dictionary combines Chinese everyday expressions and insurance industry in network Insurance professional term, not only increased the word amount term of reference of everyday words dictionary, but also by insurance industry profession name The addition of word enhances the accuracy and validity that word match is carried out in the character string participle for being related to insurance field.

With continued reference to Fig. 4, as shown in figure 4, using Chinese everyday expressions and the professional name of insurance mentioned by step S03 Word is established everyday words dictionary and is established after everyday words dictionary, can also be updated to everyday words dictionary.Details are as follows：

S04：It is spaced at every predetermined time and everyday words dictionary is updated.

Specifically, it periodically from the high frequency Chinese hot word of the interconnection recent insurance industry of online collection, and is recorded in default Temporary wordlist in, be spaced at every predetermined time, each word in inquiring the temporary wordlist in everyday words dictionary whether In the presence of if the word in temporary wordlist is not present in everyday words dictionary, which being added in everyday words dictionary, and to facing When vocabulary in each word inquiry after the completion of, the content of the temporary wordlist is emptied, to discharge the memory space of temporary wordlist, is carried The cyclic utilization rate of high temporary wordlist.

It should be noted that scheduled time interval can be one day or one week, but it is not limited to this, specifically can be with It is configured according to the needs of practical application, is not limited herein.

It is still to be updated to everyday words dictionary by each scheduled time interval in the corresponding embodiments of Fig. 4, The high frequency Chinese hot word of insurance industry is supplemented in everyday words dictionary, it can be ensured that the hot topic constantly generated in insurance industry Keyword keeps the sensibility of height, by constantly updating and improve everyday words dictionary, everyday words dictionary is made to go from strength to strength, to Constantly enhancing carries out the accuracy and validity of word match in the character string participle for being related to insurance field.

It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.

Embodiment 2

Corresponding to the character string segmenting method of foregoing embodiments, Fig. 5 shows character string provided in an embodiment of the present invention point The structure diagram of word device illustrates only and the relevant part of the embodiment of the present invention for convenience of description.

Referring to Fig. 5, character string participle device includes：Acquisition module 51, cutting module 52, composite module 53, cycle Searching module 54, the first identification module 55, the second identification module 56 and result determining module 57, each function module are described in detail such as Under：

Acquisition module 51, for obtaining pending Chinese character string；

Cutting module 52 obtains n single word a for carrying out single character segmentation to pending Chinese character string_i, In, i ∈ [1, n], n are the number for the chinese character for including in the Chinese character string；

Composite module 53, if being less than n for i, by a_iWith adjacent single word a_i+1It is combined, obtains temporary word a_ia_i+1；

Circulation searching module 54, if being used for temporary word a_ia_i+1Exist in preset everyday words dictionary, then by the temporary word a_ia_i+1With adjacent single word a_i+2It is combined, obtains temporary word a_ia_i+1a_i+2, and use temporary word a_ia_i+1a_i+2Continue common It is searched in word dictionary, until temporary word a_ia_i+1a_i+2...a_kUntil being not present in everyday words dictionary, wherein k ∈ [i+1, n]；

First identification module 55, if being used for temporary word a_ia_i+1a_i+2...a_kIt is not present in everyday words dictionary, then by a_ia_i+ ₁a_i+2...a_k-1It is identified as effective word, and from single word a_kStart, if k is equal to n, by a_kIt is identified as effective word, if k is less than n, Then by a_kAs a_i, return to composite module 53 and continue to execute；

Second identification module 56, if being used for temporary word a_ia_i+1a_i+2...a_kExist in everyday words dictionary and k=n, then By a_ia_i+1a_i+2...a_kIt is identified as effective word；

As a result determining module 57, effective word for will identify that are determined as the word segmentation result of Chinese character string.

Further, acquisition module 51 includes：

Character string acquiring unit 511, for obtaining the pending character string for including chinese character and non-chinese character；

Non- Chinese character acquiring unit 512, for identification m non-chinese character string in pending character string, wherein non-Chinese character Character string only includes the non-chinese character of continuous adjacent, and m is zero or positive integer；

Cutting unit 513, for being m+1 son by pending string segmentation using m non-chinese character string as cut-off Each substring is determined as pending Chinese character string by character string, wherein substring only includes the Chinese of continuous adjacent Word character.

Further, pending character string is the address character string in policy information, and character string participle device further includes：

Characteristic determination module 581, for being directed to each address character string, according to each chinese character in the address character string The word segmentation result of string determines the feature and its weighted value of the address character string；

Comparison module 582 compares two address character strings for the feature and its weighted value according to two address character strings Meaning it is whether identical；

Relating module 583, if the meaning for two address character strings is identical, by the guarantor where two address character strings The corresponding declaration form of single information is associated.

Further, character string participle device further includes：

First collection module 501, for collecting Chinese everyday expressions from internet；

Second collection module 502, the insurance professional term for obtaining insurance industry；

Dictionary establishes module 503, for using Chinese everyday expressions and insurance professional term, establishing everyday words dictionary.

Further, character string participle device further includes：

Word library updating module 504 is updated everyday words dictionary for being spaced at every predetermined time.

Each module realizes the process of respective function in a kind of character string participle device provided in this embodiment, specifically refers to The description of previous embodiment 1, details are not described herein again.

Embodiment 3

The present embodiment provides a computer readable storage medium, computer journey is stored on the computer readable storage medium Sequence realizes character string segmenting method in embodiment 1, to avoid repeating, here no longer when the computer program is executed by processor It repeats.Alternatively, realizing that character string segments a module/unit in device in embodiment 2 when the computer program is executed by processor Function, to avoid repeating, which is not described herein again.

Embodiment 4

Fig. 6 is the schematic diagram for the terminal device that one embodiment of the invention provides.As shown in fig. 6, the terminal of the embodiment is set Standby 60 include：Processor 61, memory 62 and it is stored in the computer journey that can be run in memory 62 and on processor 61 Sequence 63, such as character string segment program.Processor 61 realizes above-mentioned each character string segmenting method when executing computer program 63 Step in embodiment, such as step S1 shown in FIG. 1 to step S7.Alternatively, reality when processor 61 executes computer program 63 The function of each module/unit in existing above-mentioned each character string participle device embodiment, such as module 51 shown in Fig. 5 is to module 57 Function.

Illustratively, computer program 63 can be divided into one or more module/units, one or more mould Block/unit is stored in memory 62, and is executed by processor 61, to complete the present invention.One or more module/units can To be the series of computation machine program instruction section that can complete specific function, the instruction segment is for describing computer program 63 at end Implementation procedure in end equipment 60.For example, computer program 63 can be divided into acquisition module, cutting module, composite module, Circulation searching module, the first identification module, the second identification module and result determining module, detailed description are as follows for each function module：

Acquisition module, for obtaining pending Chinese character string；

Cutting module obtains n single word a for carrying out single character segmentation to pending Chinese character string_i, wherein I ∈ [1, n], n are the number for the chinese character for including in the Chinese character string；

Circulation searching module, if being used for temporary word a_ia_i+1Exist in preset everyday words dictionary, then by the temporary word a_ia_i+1With adjacent single word a_i+2It is combined, obtains temporary word a_ia_i+1a_i+2, and use temporary word a_ia_i+1a_i+2Continue common It is searched in word dictionary, until temporary word a_ia_i+1a_i+2...a_kUntil being not present in everyday words dictionary, wherein k ∈ [i+1, n]；

First identification module, if being used for temporary word a_ia_i+1a_i+2...a_kIt is not present in everyday words dictionary, then by a_ia_i+ ₁a_i+2...a_k-1It is identified as effective word, and from single word a_kStart, if k is equal to n, by a_kIt is identified as effective word, if k is less than n, Then by a_kAs a_i, return to composite module and continue to execute；

Second identification module, if being used for temporary word a_ia_i+1a_i+2...a_kExist in everyday words dictionary and k=n, then will a_ia_i+1a_i+2...a_kIt is identified as effective word；

As a result determining module, effective word for will identify that are determined as the word segmentation result of Chinese character string.

Further, acquisition module includes：

Character string acquiring unit, for obtaining the pending character string for including chinese character and non-chinese character；

Non- Chinese character acquiring unit, for identification m non-chinese character string in pending character string, wherein non-Chinese Character Symbol string only includes the non-chinese character of continuous adjacent, and m is zero or positive integer；

Cutting unit, for being m+1 sub- words by pending string segmentation using m non-chinese character string as cut-off Each substring, is determined as pending Chinese character string, wherein substring only includes the Chinese character of continuous adjacent by symbol string Character.

Further, pending character string is the address character string in policy information, and computer program 63 can also be divided It is cut into：

Characteristic determination module, for being directed to each address character string, according to each Chinese character string in the address character string Word segmentation result, determine the feature and its weighted value of the address character string；

Comparison module, for according to the feature and its weighted value of two address character strings, comparing two address character strings Whether meaning is identical；

Relating module, if the meaning for two address character strings is identical, by the declaration form where two address character strings The corresponding declaration form of information is associated.

Further, computer program 63 can also be divided into：

First collection module, for collecting Chinese everyday expressions from internet；

Second collection module, the insurance professional term for obtaining insurance industry；

Dictionary establishes module, for using Chinese everyday expressions and insurance professional term, establishing everyday words dictionary.

Further, computer program 63 can also be divided into：

Word library updating module is updated everyday words dictionary for being spaced at every predetermined time.

Terminal device 60 can be the computing devices such as desktop PC, notebook, palm PC and cloud server.It should Terminal device may include, but be not limited only to, processor 61, memory 62.It will be understood by those skilled in the art that Fig. 6 is only The example of terminal device 60 does not constitute the restriction to terminal device 60, may include components more more or fewer than diagram, or Person combines certain components or different components, such as terminal device 60 can also be set including input-output equipment, network insertion Standby, bus etc..

Processor 61 can be central processing unit (Central Processing Unit, CPU), can also be other General processor, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor Deng.

Memory 62 can be the internal storage unit of terminal device 60, such as the hard disk or memory of terminal device 60.It deposits Reservoir 62 can also be the plug-in type hard disk being equipped on the External memory equipment of terminal device 60, such as terminal device 60, intelligence Storage card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) Deng.Further, memory 62 can also both include terminal device 60 internal storage unit and also including External memory equipment.It deposits Reservoir 62 is used to store other programs and the data needed for computer program and terminal device.Memory 62 can be also used for temporarily When store the data that has exported or will export.

It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each work( Can unit, module division progress for example, in practical application, can be as needed and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of device are divided into different functional units or module, to complete above description All or part of function.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.

If being realized in the form of SFU software functional unit in above-mentioned integrated module/unit and being sold as independent product Or it in use, can be stored in a computer read/write memory medium.Based on this understanding, the present invention realizes above-mentioned reality All or part of flow in a method is applied, relevant hardware can also be instructed to complete by computer program, computer Program can be stored in a computer readable storage medium, and the computer program is when being executed by processor, it can be achieved that above-mentioned each The step of a embodiment of the method.Wherein, computer program includes computer program code, and computer program code can be source generation Code form, object identification code form, executable file or certain intermediate forms etc..Computer-readable medium may include：It can take Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disc, CD with computer program code, computer storage Device, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), Electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the content that computer-readable medium includes can To carry out increase and decrease appropriate according to legislation in jurisdiction and the requirement of patent practice, such as in certain jurisdictions, root According to legislation and patent practice, computer-readable medium is including being not electric carrier signal and telecommunication signal.

The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although with reference to the foregoing embodiments Invention is explained in detail, it will be understood by those of ordinary skill in the art that：It still can be to aforementioned each implementation Technical solution recorded in example is modified or equivalent replacement of some of the technical features；And these modification or It replaces, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution should all include Within protection scope of the present invention.

Claims

1. a kind of character string segmenting method, which is characterized in that the character string segmenting method includes：

Obtain pending Chinese character string；

Single character segmentation is carried out to the Chinese character string, obtains n single word a_i, wherein i ∈ [1, n], n are the Chinese Character The number for the chinese character for including in symbol string；

If the temporary word a_ia_i+1Exist in preset everyday words dictionary, then by the temporary word a_ia_i+1With adjacent single word a_i+2It is combined, obtains temporary word a_ia_i+1a_i+2, and use the temporary word a_ia_i+1a_i+2Continue in the everyday words dictionary It is searched, until temporary word a_ia_i+1a_i+2...a_kUntil being not present in the everyday words dictionary, wherein k ∈ [i+1, n]；

If the temporary word a_ia_i+1a_i+2...a_kIt is not present in the everyday words dictionary, then by a_ia_i+1a_i+2...a_k-1It is identified as Effective word, and from single word a_kStart, if k is equal to n, by a_kIt is identified as effective word, if k is less than n, by a_kAs a_i, continue If executing the i is less than n, by a_iWith adjacent single word a_i+1It is combined, obtains temporary word a_ia_i+1The step of；

If the temporary word a_ia_i+1a_i+2...a_kExist in the everyday words dictionary and k=n, then by a_ia_i+1a_i+2...a_kKnow It Wei not effective word；

2. character string segmenting method as described in claim 1, which is characterized in that described to obtain pending Chinese character string packet It includes：

Obtain the pending character string for including chinese character and non-chinese character；

Identify m non-chinese character string in the pending character string, wherein the non-chinese character string only includes continuous phase The adjacent non-chinese character, m is zero or positive integer；

It is m+1 substring by the pending string segmentation using the m non-chinese character string as cut-off, it will be every A substring is determined as the pending Chinese character string, wherein the substring only includes continuous adjacent The chinese character.

3. character string segmenting method as claimed in claim 2, which is characterized in that the pending character string is in policy information Address character string, two described address character strings are executed respectively complete described in the effective word that will identify that be determined as it is described After the word segmentation result of Chinese character string, the character string segmenting method further includes：

For each described address character string, according to the word segmentation result of each Chinese character string in the address character string, really The feature and its weighted value of the fixed address character string；

According to the feature and its weighted value of two described address character strings, compare two described address character strings meaning whether phase Together；

It is if the meaning of two described address character strings is identical, the policy information where two described address character strings is corresponding Declaration form is associated.

4. character string segmenting method as described in any one of claims 1 to 3, which is characterized in that described to obtain the pending Chinese Before word character string, the character string segmenting method further includes：

Chinese everyday expressions are collected from internet；

Obtain the insurance professional term of insurance industry；

Using the Chinese everyday expressions and the insurance professional term, the everyday words dictionary is established.

5. character string segmenting method as claimed in claim 4, which is characterized in that described to use the Chinese everyday expressions and institute Insurance professional term is stated, is established after the everyday words dictionary, the character string segmenting method further includes：

It is spaced at every predetermined time and the everyday words dictionary is updated.

6. a kind of character string segments device, which is characterized in that the character string segments device and includes：

Acquisition module, for obtaining pending Chinese character string；

Cutting module obtains n single word a for carrying out single character segmentation to the Chinese character string_i, wherein i ∈ [1, n], N is the number for the chinese character for including in the Chinese character string；

Circulation searching module, if being used for the temporary word a_ia_i+1Exist in preset everyday words dictionary, then by the temporary word a_ia_i+1With adjacent single word a_i+2It is combined, obtains temporary word a_ia_i+1a_i+2, and use the temporary word a_ia_i+1a_i+2Continue It is searched in the everyday words dictionary, until temporary word a_ia_i+1a_i+2...a_kUntil being not present in the everyday words dictionary, Wherein, k ∈ [i+1, n]；

First identification module, if being used for the temporary word a_ia_i+1a_i+2...a_kIt is not present, then will in the everyday words dictionary a_ia_i+1a_i+2...a_k-1It is identified as effective word, and from single word a_kStart, if k is equal to n, by a_kIt is identified as effective word, if k is small In n, then by a_kAs a_i, return to the composite module and continue to execute；

Second identification module, if being used for the temporary word a_ia_i+1a_i+2...a_kExist in the everyday words dictionary and k=n, Then by a_ia_i+1a_i+2...a_kIt is identified as effective word；

7. character string as claimed in claim 6 segments device, which is characterized in that the acquisition module includes：

Non- Chinese character acquiring unit, for identification m non-chinese character string in the pending character string, wherein the non-Chinese Word character string only includes the non-chinese character of continuous adjacent, and m is zero or positive integer；

Cutting unit, for being m+1 by the pending string segmentation using the m non-chinese character string as cut-off Each substring is determined as the pending Chinese character string, wherein the substring only wraps by substring The chinese character containing continuous adjacent.

8. character string as claimed in claim 7 segments device, which is characterized in that the pending character string is in policy information Address character string, character string participle device further includes：

Characteristic determination module, for being directed to each described address character string, according to each Chinese Character in the address character string The word segmentation result for according with string, determines the feature and its weighted value of the address character string；

Comparison module compares two described address characters for the feature and its weighted value according to two described address character strings Whether the meaning of string is identical；

Relating module will be where two described address character strings if the meaning for two described address character strings is identical The corresponding declaration form of policy information is associated.

9. a kind of terminal device, including memory, processor and it is stored in the memory and can be on the processor The computer program of operation, which is characterized in that the processor realizes such as claim 1 to 5 when executing the computer program The step of any one character string segmenting method.

10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, feature to exist In the step of realization character string segmenting method as described in any one of claim 1 to 5 when the computer program is executed by processor Suddenly.