CN107992570A

CN107992570A - Character string method for digging, device, electronic equipment and computer-readable recording medium

Info

Publication number: CN107992570A
Application number: CN201711230875.0A
Authority: CN
Inventors: 李泽中
Original assignee: Beijing Xiaodu Information Technology Co Ltd
Current assignee: Beijing Xiaodu Information Technology Co Ltd
Priority date: 2017-11-29
Filing date: 2017-11-29
Publication date: 2018-05-04

Abstract

The embodiment of the present disclosure discloses a kind of character string method for digging, device, electronic equipment and computer-readable recording medium, and the character string method for digging includes：Training string data collection is obtained, wherein, the trained string data collection includes training string data and character string characteristic；The trained string data collection is trained, obtains target string judgment models；Target string judgement is carried out to test character string according to the target string judgment models.The disclosure can improve the validity of string segmentation, improve retrieval validity, finally effectively improve the service quality of businessman or service provider, strengthen user experience.

Description

Character string method for digging, device, electronic equipment and computer-readable recording medium

Technical field

This disclosure relates to technical field of information processing, and in particular to a kind of character string method for digging, device, electronic equipment and Computer-readable recording medium.

Background technology

With the development of Internet technology, more and more businessmans or service provider by internet platform come for Family provides service, and makes every effort to improve service quality, and strengthens user experience, strives for more user's orders, to lift existing resource Utilization rate, be that businessman or service provider create more values.But current user is provided using businessman or service During the retrieval service that business provides, retrieval result hit rate can not meet the requirement of user, so as to weaken user experience.

The content of the invention

The embodiment of the present disclosure provides a kind of character string method for digging, device, electronic equipment and computer-readable recording medium.

In a first aspect, a kind of character string method for digging is provided in the embodiment of the present disclosure.

Specifically, the character string method for digging, including：

Training string data collection is obtained, wherein, the trained string data collection includes training string data and word Symbol string characteristic；

The trained string data collection is trained, obtains target string judgment models；

Target string judgement is carried out to test character string according to the target string judgment models.

With reference to first aspect, the disclosure is described to obtain training character string number in the first implementation of first aspect Training string data is obtained according to concentrating, including：

Obtain history string data；

The data of target string will be confirmed as in the history string data as training positive sample；

The data of non-targeted character string will be confirmed as in the history string data as training negative sample；

Based on the trained positive sample and training negative sample generation training string data.

With reference to first aspect, the disclosure is in the first implementation of first aspect, the character string characteristic bag Include：Word frequency score values of the character string w in default historical time section, the mutual information score value of character string w, the comentropy point of character string w Whether value, character string w are one or more in preset name.

With reference to first aspect, the disclosure is in the first implementation of first aspect, described pair of trained string data Collection is trained, and obtains target string judgment models, including：

Feature weight value corresponding with character string characteristic is got based on the trained string data training；

Weighted value generation target string judgment models based on the character string characteristic.

With reference to first aspect, the disclosure is described based on training character string number in the first implementation of first aspect Feature weight value corresponding with character string characteristic is got according to training, including：

It is trained based on the trained string data collection, obtains feature weight and determine model；

Determine that model determines feature weight value corresponding with the character string characteristic based on the feature weight.

With reference to first aspect, the disclosure is described special based on the character string in the first implementation of first aspect The weighted value generation target string judgment models of data are levied, including：

The probability calculation model that character string w is target string is generated according to the weighted value of the character string characteristic；

The character string that probability is met to preset condition confirms as target string.

With reference to first aspect, in the first implementation of first aspect, the probability calculation model represents the disclosure For：

Wherein, f_iRepresent the ith feature in character string characteristic, λ_iRepresent ith feature f_iCorresponding weighted value, p Represent the probable value that character string is target string.

With reference to first aspect, probability is met preset condition by the disclosure in the first implementation of first aspect Character string confirms as target string, including：

The character string that probability is more than to predetermined probabilities threshold value confirms as target string.

With reference to first aspect, for the disclosure in the first implementation of first aspect, the test character string is default The character string of input in historical time section.

With reference to first aspect with the first implementation of first aspect, the disclosure is in second of realization side of first aspect In formula, the method further includes：Predetermined registration operation is performed to the target string.

Second aspect, provides a kind of character string excavating gear in the embodiment of the present disclosure.

Specifically, the character string excavating gear, including：

Acquisition module, is configured as obtaining trained string data collection, wherein, the trained string data collection includes instruction Practice string data and character string characteristic；

Training module, is configured as being trained the trained string data collection, obtains target string and judge mould Type；

Judgment module, is configured as carrying out target string to test character string according to the target string judgment models Judge.

With reference to second aspect, in the first implementation of second aspect, the acquisition module includes the disclosure：

Acquisition submodule, is configured as obtaining history string data；

First confirms submodule, is configured as making the data for confirming as target string in the history string data For training positive sample；

Second confirms submodule, is configured as that the data of non-targeted character string will be confirmed as in the history string data As training negative sample；

First generation submodule, is configured as based on the trained positive sample and training negative sample generation training character string number According to.

With reference to second aspect, the disclosure is in the first implementation of second aspect, the character string characteristic bag Include：Word frequency score values of the character string w in default historical time section, the mutual information score value of character string w, the comentropy point of character string w Whether value, character string w are one or more in preset name.

With reference to second aspect, in the first implementation of second aspect, the training module includes the disclosure：

Training submodule, is configured as getting and character string characteristic pair based on the trained string data training The feature weight value answered；

Second generation submodule, is configured as the generation target string of the weighted value based on the character string characteristic and sentences Disconnected model.

With reference to second aspect, in the first implementation of second aspect, the trained submodule includes the disclosure：

Training unit, is configured as being trained based on the trained string data collection, obtains feature weight and determine mould Type；

Determination unit, is configured as determining that model is determined based on the feature weight corresponding with the character string characteristic Feature weight value.

With reference to second aspect, the disclosure is in the first implementation of second aspect, the second generation submodule bag Include：

Generation unit, it is target character to be configured as generating character string w according to the weighted value of the character string characteristic The probability calculation model of string；

Confirmation unit, the character string for being configured as meeting probability preset condition confirm as target string.

With reference to second aspect, in the first implementation of second aspect, the probability calculation model represents the disclosure For：

With reference to second aspect, in the first implementation of second aspect, the confirmation unit is configured as the disclosure The character string that probability is more than to predetermined probabilities threshold value confirms as target string.

With reference to second aspect, for the disclosure in the first implementation of second aspect, the test character string is default The character string of input in historical time section.

With reference to the first of second aspect and second aspect implementation, the disclosure is in second of realization side of second aspect In formula, described device further includes：Execution module, is configured as performing predetermined registration operation to the target string.

The third aspect, the embodiment of the present disclosure provide a kind of electronic equipment, including memory and processor, the memory Character string excavating gear is supported to perform the computer of character string method for digging in above-mentioned first aspect for storing one or more Instruction, the processor are configurable for performing the computer instruction stored in the memory.The character string mining dress Communication interface can also be included by putting, for character string excavating gear and other equipment or communication.

Fourth aspect, the embodiment of the present disclosure provide a kind of computer-readable recording medium, for storing character string mining Computer instruction used in device, it includes filled for performing character string method for digging in above-mentioned first aspect for character string mining Put involved computer instruction.

The technical solution that the embodiment of the present disclosure provides can include the following benefits：

Above-mentioned technical proposal, by considering various characters string feature and distributing corresponding weighted value for each feature, to divide Analysis determines whether a certain character string is target string, than such as whether for that can be added in segmentation dictionary or retrieval dictionary New character strings, and then abundant segmentation dictionary or the content for retrieving dictionary, improve the validity of string segmentation, improving retrieval has Effect property, finally effectively improves the service quality of businessman or service provider, strengthens user experience.

It should be appreciated that the general description and following detailed description of the above are only exemplary and explanatory, not The disclosure can be limited.

Brief description of the drawings

With reference to attached drawing, by the detailed description of following non-limiting embodiment, the further feature of the disclosure, purpose and excellent Point will be apparent.In the accompanying drawings：

Fig. 1 shows the flow chart of the character string method for digging according to one embodiment of the disclosure；

Fig. 2 shows the flow chart of the step S101 according to Fig. 1 illustrated embodiments；

Fig. 3 shows the flow chart of the step S102 according to Fig. 1 illustrated embodiments；

Fig. 4 shows the flow chart of the step S301 according to Fig. 3 illustrated embodiments；

Fig. 5 shows the flow chart of the step S302 according to Fig. 3 illustrated embodiments；

Fig. 6 shows the structure diagram of the character string excavating gear according to one embodiment of the disclosure；

Fig. 7 shows the structure diagram of the acquisition module 601 according to Fig. 6 illustrated embodiments；

Fig. 8 shows the structure diagram of the training module 602 according to Fig. 6 illustrated embodiments；

Fig. 9 shows the structure diagram of the training submodule 801 according to Fig. 8 illustrated embodiments；

Figure 10 shows the structure diagram according to the second of Fig. 8 illustrated embodiments the generation submodule 802；

Figure 11 shows the structure diagram of the electronic equipment according to one embodiment of the disclosure；

Figure 12 is adapted for the computer system for realizing the character string method for digging according to one embodiment of the disclosure Structure diagram.

Embodiment

Hereinafter, the illustrative embodiments of the disclosure will be described in detail with reference to the attached drawings, so that those skilled in the art can Easily realize them.In addition, for the sake of clarity, the portion unrelated with description illustrative embodiments is eliminated in the accompanying drawings Point.

In the disclosure, it should be appreciated that the term of " comprising " or " having " etc. is intended to refer to disclosed in this specification Feature, numeral, step, behavior, component, part or presence of its combination, and be not intended to exclude other one or more features, Numeral, step, behavior, component, part or its combination there is a possibility that or be added.

It also should be noted that in the case where there is no conflict, the feature in embodiment and embodiment in the disclosure It can be mutually combined.Describe the disclosure in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

The technical solution that the embodiment of the present disclosure provides, by considering various characters string feature and being distributed for each feature corresponding Weighted value, to analyze whether definite a certain character string is target string, than such as whether for can be added to segmentation dictionary or Person retrieves the new character strings in dictionary, and then abundant segmentation dictionary or the content for retrieving dictionary, and that improves string segmentation has Effect property, improves retrieval validity, finally effectively improves the service quality of businessman or service provider, strengthens user experience.

The character string of disclosed technique scheme can be used for the purpose of retrieving, search for, matching, for the convenience of narration, hereafter In be described in detail by taking retrieval as an example for disclosed technique scheme.

Fig. 1 shows the flow chart of the character string method for digging according to one embodiment of the disclosure.As shown in Figure 1, the word Symbol string mining method comprises the following steps S101-S103：

In step S101, training string data collection is obtained, wherein, the trained string data collection includes training word Accord with string data and character string characteristic；

In step s 102, the trained string data collection is trained, obtains target string judgment models；

In step s 103, target string is carried out to test character string according to the target string judgment models to sentence It is disconnected.

In view of current user, businessman or service carry when using businessman's retrieval service that either service provider provides Typically directly obtained for business by character string input by user or after splitting according to general dictionary for character string word or Person's word is retrieved as retrieval object, since in many cases, these retrieval objects are not present in segmentation dictionary or inspection In rope dictionary, therefore many noises of naturally occurring in the retrieval result obtained based on the retrieval object, it is impossible to retrieve exactly User wants the content seen, retrieval result hit rate it is impossible to meet user requirement, so as to reduce businessman or service The quality of provider's service, weakens user experience.

In this embodiment, propose a kind of character string method for digging, this method by considering various characters string feature, and Analyze whether definite a certain character string is target string based on training string data, after the target string so obtained It is continuous to add into segmentation dictionary or retrieval dictionary, and then abundant segmentation dictionary, the content for retrieving dictionary, improve character string Validity is retrieved, specifically, obtains training string data collection first, wherein, the trained string data collection includes training String data and character string characteristic, are then trained training string data collection, obtain target string judgement Model, finally according to target string judgment models to test character string carry out target string judgement, subsequently can by determine Target string is added into segmentation dictionary or retrieval dictionary, this makes it possible to the validity for improving string segmentation, and then The hit rate of retrieval result is improved, improves the quality of businessman or service provider service, strengthens user experience.

In an optional implementation of the present embodiment, as shown in Fig. 2, the step S101, i.e., described to obtain training String data concentrates the step of obtaining training string data, including step S201-S204：

In step s 201, history string data is obtained；

In step S202, the data of target string will be confirmed as in the history string data as the positive sample of training This；

It is in step S203, the data that non-targeted character string is confirmed as in the history string data are negative as training Sample；

In step S204, based on the trained positive sample and training negative sample generation training string data.

In this embodiment, more accurate training data in order to obtain, when can obtain default history at random first Between history string data in section, wherein, the length of the history string data can according to the needs of practical application into Row is set, for example for the retrieval service based on internet platform, the length of the history string data can be set to 2-5 A character；Then verified for the history string data, will confirm as subsequently adding in history string data To segmentation dictionary or the data for retrieving target string in dictionary as training positive sample, conversely, by history character string number Confirm as adding the data to the non-targeted character string in segmentation dictionary or retrieval dictionary in as training negative sample； Finally training positive sample and training negative sample are combined, form training string data.

In an optional implementation of the present embodiment, the character string characteristic includes：Character string w is gone through default Whether the word frequency score value in the history period, the mutual information score value of character string w, the comentropy score value of character string w, character string w are pre- If the one or more in title.

Above-mentioned character string characteristic is that the character string that can improve that the disclosure obtains after test of many times and verification is divided Cut validity and retrieve the characteristic of validity.

Wherein, word frequency score value fs of the character string w in default historical time section₁It is represented by：

f₁=log (count (w))

Wherein, count (w) is performed retrieval, inquiry, with the predetermined behaviour of equity for character string w in default historical time section The number of work.

Word frequency score value fs of the character string w in default historical time section₁The character string is more frequent is made for higher characterization With.

Wherein, the mutual information score value f of character string w₂It is represented by：

Wherein, N represents the length of character string w, c₁c₂ … c_NRepresent the character in character string w, p (c_i,c_i+1) represent c_i, c_i+1The probability of two character co-occurrences, p (c_i) represent character c_iThe probability of appearance.

The character string w mutual information score values f₂The tight ness rating of character is bigger inside higher characterization character string.

Wherein, the comentropy score value f of character string w₃It is represented by：

f₃=H_left(w)+H_right(w)

Wherein, H_left(w) left comentropy, H are represented_right(w) right comentropy is represented, A represents left neighbours' word set of character string w Close, right neighbours' set of words of B expression character strings w, p (aw | w) and p (wb | w) represent that left neighbours' word and right neighbours' word occur respectively Conditional probability.

The comentropy score value f of the character string w₃The exterior flexibility ratio of the higher characterization character string is bigger.

Wherein, whether the character string w is that preset name this feature belongs to artificial knowledge's feature, if a character string W occurs as preset name, then it is likely to be one and needs to add to the word in dictionary, and the preset name such as can be with For name of firm, service provider title, name of product or service name etc..Whether the character string w is preset name This feature can be divided into whether character string w is name of firm or service provider title again, and whether character string w is ProductName Claim or service name the two features, specifically, the character string w whether be name of firm or service provider title this One feature can be expressed as：

Whether the character string w is that name of product or service name this feature can be expressed as：

In an optional implementation of the present embodiment, as shown in figure 3, the step S102, i.e., to training character string Data set is trained, the step of obtaining target string judgment models, including step S301-S302：

In step S301, spy corresponding with character string characteristic is got based on the trained string data training Levy weighted value；

In step s 302, the weighted value generation target string judgment models based on the character string characteristic.

In this embodiment, it is based on the trained character string first with the method and optimization algorithm of model training Data set trains to obtain one group of feature weight value corresponding with character string characteristic, and is based further on character string characteristic Weighted value generation target string judgment models.

Wherein, those skilled in the art can calculate according to the needs of practical application for the method and optimization of model training Method makes choice, and the disclosure is not especially limited it.

In an optional implementation of the present embodiment, as shown in figure 4, the step S301, i.e., based on the training The step of string data training gets feature weight value corresponding with character string characteristic, including step S401-S402：

In step S401, it is trained based on the trained string data collection, obtains feature weight and determine model；

In step S402, determine that model determines spy corresponding with the character string characteristic based on the feature weight Levy weighted value.

Mentioned above, the character string characteristic that the disclosure is considered has many kinds, these characteristics can be used to characterize Some character string is performed the necessity of a certain predetermined registration operation, for example adds to necessity in segmentation dictionary or retrieval dictionary Property.But the contribution that above-mentioned character string characteristic judges for necessity is different, that is to say, that using above-mentioned more When kind characteristic one character string of characterization is performed the necessity of predetermined registration operation, the weight of different characteristic does not answer one to treat as Benevolence, but should distinguish and treat.

Therefore, in this embodiment, determined using the mode of training pattern polyoptimal algorithm for above-mentioned a variety of Optimal weight distribution for characteristic, such as, the trained string data collection can be based on, using a kind of in machine learning Simple Logic Regression Models efficient, that practical application is very extensive obtain a feature weight and determine model, further use this A feature weight determines that model obtains feature weight value corresponding with above-mentioned various features data, usually optimal by being superimposed One group of optimal feature weight value corresponding with the various features data can be obtained by changing algorithm.

In an optional implementation of the present embodiment, as shown in figure 5, the step S302, i.e., based on the character The step of weighted value generation target string judgment models of string characteristic, including step S501-S502：

In step S501, it is target string to generate character string w according to the weighted value of the character string characteristic Probability calculation model；

In step S502, the character string that probability is met to preset condition confirms as target string.

In this embodiment, the target string judgment models may include the probability that character string w is target string Computation model and carry out whether character string w is part that target string judges according to obtained probable value, specifically, first The probability calculation model that character string w is target string, institute are generated according to the weighted value of the character string characteristic obtained above Stating probability calculation model can be expressed as：

Wherein, f_iRepresent the ith feature in character string characteristic, λ_iRepresent ith feature f_iCorresponding weighted value, p Represent the probable value that character string w is target string.Then the character string is judged further according to the corresponding probable values of a certain character string w Whether w is target string, such as, probable value may be considered character string more than the character string of predetermined probabilities threshold value and compare in itself Important, information content is bigger, and for retrieving, inquiring about, with the relatively effective character string of reciprocity predetermined registration operation, then can The character string that probable value is more than to predetermined probabilities threshold value confirms as target string, and such character string is added to segmentation dictionary Or in retrieval dictionary, the validity of string segmentation can be improved, and then improve the hit rate of retrieval result.

Wherein, the probability threshold value can be configured according to the needs of practical application, and the disclosure does not make its specific value It is specific to limit.

In an optional implementation of the present embodiment, the test character string is input in default historical time section Character string, the test character string is similar with training character string, its length can be set according to the needs of practical application, such as It can be set to 2-5.

In an optional implementation of the present embodiment, the method is further included to perform the target string and preset The step of operation, wherein, the predetermined registration operation includes：Add into the dictionary files such as segmentation dictionary, retrieval dictionary, perform inspection Rope, perform search, perform inquiry, perform matching in one or more.

When the predetermined registration operation, to add, extremely segmentation dictionary, retrieval dictionary are when in dictionary file, the one of the present embodiment In a optional implementation, the method, which further includes, judges that the test character string whether there is the step in dictionary file, The step is primarily to determine the follow-up judgement for whether being necessary to carry out target string, that is to say, that only for not existing Test character string in dictionary file carries out the judgement of target string.

Following is embodiment of the present disclosure, can be used for performing embodiments of the present disclosure.

Fig. 6 shows the structure diagram of the character string excavating gear according to one embodiment of the disclosure, which can pass through Software, hardware or both are implemented in combination with as some or all of of electronic equipment.As shown in fig. 6, the character string is dug Pick device includes：

Acquisition module 601, is configured as obtaining trained string data collection, wherein, the trained string data Ji Bao Include trained string data and character string characteristic；

Training module 602, is configured as being trained the trained string data collection, obtains target string judgement Model；

Judgment module 603, is configured as carrying out target word to test character string according to the target string judgment models Symbol string judges.

In this embodiment, propose a kind of character string excavating gear, the device by considering various characters string feature, and Analyze whether definite a certain character string is target string based on training string data, after the target string so obtained It is continuous to add into segmentation dictionary or retrieval dictionary, and then abundant segmentation dictionary, the content for retrieving dictionary, improve character string Validity is retrieved, specifically, training string data collection is obtained by acquisition module 601 first, wherein, the trained character string Data set includes training string data and character string characteristic, then by training module 602 to training string data Collection is trained, and obtains target string judgment models, finally by judgment module 603 according to target string judgment models pair Test character string and carry out target string judgement, definite target string can be subsequently added to segmentation dictionary or docuterm In allusion quotation, this makes it possible to the validity for improving string segmentation, and then the hit rate of retrieval result is improved, improve businessman or clothes The quality of business provider service, strengthens user experience.

In an optional implementation of the present embodiment, as shown in fig. 7, the acquisition module 601 includes：

Acquisition submodule 701, is configured as obtaining history string data；

First confirms submodule 702, is configured as that the number of target string will be confirmed as in the history string data According to as training positive sample；

Second confirms submodule 703, is configured as that non-targeted character string will be confirmed as in the history string data Data are as training negative sample；

First generation submodule 704, is configured as based on the trained positive sample and training negative sample generation training character String data.

In this embodiment, more accurate training data in order to obtain, can by acquisition submodule 701 first with Machine obtains the history string data in default historical time section, wherein, the length of the history string data can basis The needs of practical application are configured, such as the retrieval service based on internet platform, the history character string number According to length can be set to 2-5 character；Then verified for the history string data, first confirms submodule 702 It will confirm as subsequently adding to segmentation dictionary or retrieving the data of the target string in dictionary in history string data As training positive sample, conversely, the second confirmation submodule 703 will be confirmed as to add to segmented word in history string data The data of allusion quotation or the non-targeted character string in retrieval dictionary are as training negative sample；Last first generation submodule 704 will instruct Practice positive sample and training negative sample combines, form training string data.

f₁=log (count (w))

f₃=H_left(w)+H_right(w)

In an optional implementation of the present embodiment, as shown in figure 8, the training module 602 includes：

Training submodule 801, is configured as getting and character string characteristic based on the trained string data training According to corresponding feature weight value；

Second generation submodule 802, is configured as the weighted value generation target character based on the character string characteristic String judgment models.

In this embodiment, the method and optimization algorithm first by training submodule 801 using model training One group of feature weight value corresponding with character string characteristic is got based on the trained string data training, and further Target string judgment models are generated by the second weighted value of the generation submodule 802 based on character string characteristic.

In an optional implementation of the present embodiment, as shown in figure 9, the trained submodule 801 includes：

Training unit 901, is configured as being trained based on the trained string data collection, obtains feature weight and determine Model；

Determination unit 902, is configured as determining that model determines and the character string characteristic based on the feature weight Corresponding feature weight value.

Therefore, in this embodiment, training unit 901 is determined pair using the mode of training pattern polyoptimal algorithm The optimal weight distribution for above-mentioned various features data, such as, the trained string data collection can be based on, uses machine A kind of simple Logic Regression Models efficient, that practical application is very extensive obtain a feature weight and determine model in study, into One step determination unit 902 determines that model obtains feature weight corresponding with above-mentioned various features data using this feature weight Value, usually can obtain one group corresponding with the various features data optimal feature power by being superimposed optimization algorithm Weight values.

In an optional implementation of the present embodiment, as shown in Figure 10, the second generation submodule 802 includes：

Generation unit 1001, it is target to be configured as generating character string w according to the weighted value of the character string characteristic The probability calculation model of character string；

Confirmation unit 1002, the character string for being configured as meeting probability preset condition confirm as target string.

In this embodiment, the target string judgment models may include the probability that character string w is target string Computation model and carry out whether character string w is part that target string judges according to obtained probable value, specifically, generation Unit 1001 is first according to the probability that the weighted value of the character string characteristic obtained above generation character string w is target string Computation model, the probability calculation model can be expressed as：

Wherein, f_iRepresent the ith feature in character string characteristic, λ_iRepresent ith feature f_iCorresponding weighted value, p Represent the probable value that character string w is target string.Then confirmation unit 1002 is further according to the corresponding probable values of a certain character string w Judge whether character string w is target string, such as, the character string that probable value is more than predetermined probabilities threshold value may be considered word Symbol string itself is important, and information content is bigger, and for retrieving, inquiring about, with the relatively effective character of reciprocity predetermined registration operation String, then the character string that probable value can be more than to predetermined probabilities threshold value confirms as target string, by such character string Add into segmentation dictionary or retrieval dictionary, the validity of string segmentation can be improved, and then improve the life of retrieval result Middle rate.

In an optional implementation of the present embodiment, described device further includes execution module, is configured as to described Target string performs predetermined registration operation, wherein, the predetermined registration operation includes：Add to dictionary texts such as segmentation dictionary, retrieval dictionaries In part, perform retrieval, perform search, perform inquiry, perform matching in one or more.

When the predetermined registration operation, to add, extremely segmentation dictionary, retrieval dictionary are when in dictionary file, the one of the present embodiment In a optional implementation, described device further includes the second judgment module, which can be configured as judging the test character String whether there is in dictionary file, and the setting of the second judgment module is primarily to determine subsequently whether be necessary to carry out target The judgement of character string, that is to say, that judgment module carries out target word only for the test character string not existed in dictionary file Accord with the judgement of string.

The disclosure also discloses a kind of electronic equipment, and Figure 11 shows the knot of the electronic equipment according to one embodiment of the disclosure Structure block diagram, as shown in figure 11, the electronic equipment 1100 include memory 1101 and processor 1102；Wherein,

The memory 1101 is used to store one or more computer instruction, wherein, one or more computer Instruction is performed by the processor 1102 to realize：

One or more computer instruction can be also performed by the processor 1102 to realize：

The training string data that obtains concentrates acquisition training string data, including：

Obtain history string data；

The character string characteristic includes：Word frequency score values of the character string w in default historical time section, character string w's Whether mutual information score value, the comentropy score value of character string w, character string w are one or more in preset name.

Described pair of trained string data collection is trained, and obtains target string judgment models, including：

It is described that feature weight value corresponding with character string characteristic is got based on training string data training, wrap Include：

The weighted value generation target string judgment models based on the character string characteristic, including：

The probability calculation model is expressed as：

The character string that probability is met to preset condition confirms as target string, including：

The test character string is the character string of input in default historical time section.

Further include：

Predetermined registration operation is performed to the target string.

Figure 12 is suitable for the structure for being used for realizing the computer system of the character string method for digging according to disclosure embodiment Schematic diagram.

As shown in figure 12, computer system 1200 includes central processing unit (CPU) 1201, its can according to be stored in only Read the program in memory (ROM) 1202 or be loaded into from storage part 1208 in random access storage device (RAM) 1203 Program and perform the various processing in the embodiment shown in above-mentioned Fig. 1-5.In RAM1203, also it is stored with system 1200 and grasps Various programs and data needed for making.CPU1201, ROM1202 and RAM1203 are connected with each other by bus 1204.Input/defeated Go out (I/O) interface 1205 and be also connected to bus 1204.

I/O interfaces 1205 are connected to lower component：Importation 1206 including keyboard, mouse etc.；Including such as cathode The output par, c 1207 of ray tube (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage part including hard disk etc. 1208；And the communications portion 1209 of the network interface card including LAN card, modem etc..Communications portion 1209 passes through Communication process is performed by the network of such as internet.Driver 1210 is also according to needing to be connected to I/O interfaces 1205.It is detachable to be situated between Matter 1211, such as disk, CD, magneto-optic disk, semiconductor memory etc., are installed on driver 1210 as needed, so as to Storage part 1208 is mounted into as needed in the computer program read from it.

Especially, according to embodiment of the present disclosure, computer is may be implemented as above with reference to Fig. 1-5 methods described Software program.For example, embodiment of the present disclosure includes a kind of computer program product, it includes being tangibly embodied in and its can The computer program on medium is read, the computer program includes the program generation for the character string method for digging for being used to perform Fig. 1-5 Code.In such embodiment, which can be downloaded and installed by communications portion 1209 from network, And/or it is mounted from detachable media 1211.

Flow chart and block diagram in attached drawing, it is illustrated that according to the system, method and computer of the various embodiments of the disclosure Architectural framework in the cards, function and the operation of program product.At this point, each square frame in course diagram or block diagram can be with A part for a module, program segment or code is represented, a part for the module, program segment or code includes one or more The executable instruction of logic function as defined in being used for realization.It should also be noted that some as replace realization in, institute in square frame The function of mark can also be with different from the order marked in attached drawing generation.For example, two square frames succeedingly represented are actual On can perform substantially in parallel, they can also be performed in the opposite order sometimes, this is depending on involved function.Also It is noted that the combination of each square frame and block diagram in block diagram and/or flow chart and/or the square frame in flow chart, Ke Yiyong The dedicated hardware based systems of functions or operations as defined in execution is realized, or can be referred to specialized hardware and computer The combination of order is realized.

Being described in unit or module involved in disclosure embodiment can be realized by way of software, also may be used Realized in a manner of by hardware.Described unit or module can also be set within a processor, these units or module Title do not form restriction to the unit or module in itself under certain conditions.

As on the other hand, the disclosure additionally provides a kind of computer-readable recording medium, the computer-readable storage medium Matter can be computer-readable recording medium included in device described in the above embodiment；Can also be individualism, Without the computer-readable recording medium in supplying equipment.Computer-readable recording medium storage has one or more than one journey Sequence, described program is used for performing by one or more than one processor is described in disclosed method.

Above description is only the preferred embodiment of the disclosure and the explanation to institute's application technology principle.People in the art Member should be appreciated that invention scope involved in the disclosure, however it is not limited to the technology that the particular combination of above-mentioned technical characteristic forms Scheme, while should also cover in the case where not departing from the inventive concept, carried out by above-mentioned technical characteristic or its equivalent feature The other technical solutions for being combined and being formed.Such as features described above has similar work(with the (but not limited to) disclosed in the disclosure The technical solution that the technical characteristic of energy is replaced mutually and formed.

The present disclosure discloses A1, a kind of character string method for digging, the described method includes：Training string data collection is obtained, Wherein, the trained string data collection includes training string data and character string characteristic；To the trained character string Data set is trained, and obtains target string judgment models；According to the target string judgment models to testing character string Carry out target string judgement.A2, the method according to A1, the training string data that obtains concentrate acquisition training character String data, including：Obtain history string data；The data that target string is confirmed as in the history string data are made For training positive sample；The data of non-targeted character string will be confirmed as in the history string data as training negative sample；Base In the trained positive sample and training negative sample generation training string data.A3, the method according to A1, the character string Characteristic includes：Word frequency score values of the character string w in default historical time section, the mutual information score value of character string w, character string w Comentropy score value, whether character string w is one or more in preset name.A4, the method according to A1, described pair of instruction Practice string data collection to be trained, obtain target string judgment models, including：Assembled for training based on the trained string data Get feature weight value corresponding with character string characteristic；Weighted value generation target based on the character string characteristic Character string judgment models.A5, the method according to A4, it is described to be got and character string spy based on training string data training The corresponding feature weight value of data is levied, including：It is trained based on the trained string data collection, obtains feature weight and determine Model；Determine that model determines feature weight value corresponding with the character string characteristic based on the feature weight.A6, basis Method described in A4, the weighted value generation target string judgment models based on the character string characteristic, including：Root Weighted value generation character string w according to the character string characteristic is the probability calculation model of target string；Probability is met The character string of preset condition confirms as target string.A7, the method according to A6, the probability calculation model are expressed as：

Wherein, f_iRepresent the ith feature in character string characteristic, λ_iRepresent ith feature f_iCorresponding weighted value, p Represent the probable value that character string is target string.Probability, is met the character string of preset condition by A8, the method according to A6 Target string is confirmed as, including：The character string that probability is more than to predetermined probabilities threshold value confirms as target string.A9, basis Method described in A1, the test character string are the character string of input in default historical time section.A10, the side according to A1 Method, further includes：Predetermined registration operation is performed to the target string.

The present disclosure discloses B11, a kind of character string excavating gear, described device includes：Acquisition module, is configured as obtaining Training string data collection, wherein, the trained string data collection includes training string data and character string characteristic； Training module, is configured as being trained the trained string data collection, obtains target string judgment models；Judge mould Block, is configured as carrying out target string judgement to test character string according to the target string judgment models.B12, basis Device described in B11, the acquisition module include：Acquisition submodule, is configured as obtaining history string data；First confirms Submodule, is configured as the data using target string is confirmed as in the history string data as training positive sample；The Two confirm submodule, are configured as the data that non-targeted character string is confirmed as in the history string data are negative as training Sample；First generation submodule, is configured as based on the trained positive sample and training negative sample generation training string data. B13, the device according to B11, the character string characteristic include：Word frequency of the character string w in default historical time section Score value, the mutual information score value of character string w, the comentropy score value of character string w, character string w whether be one kind in preset name or It is a variety of.B14, the device according to B11, the training module include：Training submodule, is configured as being based on the trained word Symbol string data training gets feature weight value corresponding with character string characteristic；Second generation submodule, is configured as base Target string judgment models are generated in the weighted value of the character string characteristic.B15, the device according to B14, it is described Training submodule includes：Training unit, is configured as being trained based on the trained string data collection, obtains feature weight Determine model；Determination unit, is configured as determining that model determines and the character string characteristic pair based on the feature weight The feature weight value answered.B16, the device according to B14, the second generation submodule include：Generation unit, is configured as The probability calculation model that character string w is target string is generated according to the weighted value of the character string characteristic；Confirmation unit, The character string for being configured as meeting probability preset condition confirms as target string.B17, the device according to B16, it is described Probability calculation model is expressed as：

Wherein, f_iRepresent the ith feature in character string characteristic, λ_iRepresent ith feature f_iCorresponding weighted value, p Represent the probable value that character string is target string.B18, the device according to B16, the confirmation unit is configured as will be general The character string that rate is more than predetermined probabilities threshold value confirms as target string.B19, the device according to B11, the test character Go here and there to preset the character string inputted in historical time section.B20, the device according to B11, further include：Execution module, is configured To perform predetermined registration operation to the target string.

The present disclosure discloses C21, a kind of electronic equipment, including memory and processor；Wherein, the memory is used to deposit One or more computer instruction is stored up, wherein, one or more computer instruction is performed by the processor to realize such as A1-A10 any one of them methods.

The disclosure also discloses D22, a kind of computer-readable recording medium, is stored thereon with computer instruction, the calculating Machine instruction realizes such as A1-A10 any one of them methods when being executed by processor.

Claims

A kind of 1. character string method for digging, it is characterised in that the described method includes：

Training string data collection is obtained, wherein, the trained string data collection includes training string data and character string Characteristic；

The trained string data collection is trained, obtains target string judgment models；

Target string judgement is carried out to test character string according to the target string judgment models.
2. according to the method described in claim 1, it is characterized in that, the training string data that obtains concentrates acquisition training word String data is accorded with, including：

Obtain history string data；

The data of target string will be confirmed as in the history string data as training positive sample；

The data of non-targeted character string will be confirmed as in the history string data as training negative sample；

Based on the trained positive sample and training negative sample generation training string data.
3. according to the method described in claim 1, it is characterized in that, the character string characteristic includes：Character string w is default Word frequency score value in historical time section, the mutual information score value of character string w, the comentropy score value of character string w, character string w whether be One or more in preset name.
4. according to the method described in claim 1, it is characterized in that, described pair of trained string data collection is trained, obtain Target string judgment models, including：

Feature weight value corresponding with character string characteristic is got based on the trained string data training；

Weighted value generation target string judgment models based on the character string characteristic.
5. according to the method described in claim 4, it is characterized in that, described got and word based on training string data training The corresponding feature weight value of symbol string characteristic, including：

It is trained based on the trained string data collection, obtains feature weight and determine model；

Determine that model determines feature weight value corresponding with the character string characteristic based on the feature weight.
6. the according to the method described in claim 4, it is characterized in that, weighted value life based on the character string characteristic Into target string judgment models, including：

The probability calculation model that character string w is target string is generated according to the weighted value of the character string characteristic；

The character string that probability is met to preset condition confirms as target string.
7. according to the method described in claim 6, it is characterized in that, the probability calculation model is expressed as：

<mrow> <mi>p</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mn>1</mn> <mo>+</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <msub> <mo>&Sigma;</mo> <mi>i</mi> </msub> <msub> <mi>&lambda;</mi> <mi>i</mi> </msub> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

Wherein, f_iRepresent the ith feature in character string characteristic, λ_iRepresent ith feature f_iCorresponding weighted value, p are represented Character string is the probable value of target string.
8. a kind of character string excavating gear, it is characterised in that described device includes：

Acquisition module, is configured as obtaining trained string data collection, wherein, the trained string data collection includes training word Accord with string data and character string characteristic；

Training module, is configured as being trained the trained string data collection, obtains target string judgment models；

Judgment module, is configured as sentencing test character string progress target string according to the target string judgment models It is disconnected.
9. a kind of electronic equipment, it is characterised in that including memory and processor；Wherein,

The memory is used to store one or more computer instruction, wherein, one or more computer instruction is by institute Processor is stated to perform to realize such as claim 1-7 any one of them methods.
10. a kind of computer-readable recording medium, is stored thereon with computer instruction, it is characterised in that the computer instruction quilt Such as claim 1-7 any one of them methods are realized when processor performs.