CN109597987A

CN109597987A - A kind of text restoring method, device and electronic equipment

Info

Publication number: CN109597987A
Application number: CN201811248320.3A
Authority: CN
Inventors: 周书恒; 刘金星; 祝慧佳; 赵智源; 郭亚
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2019-04-09
Also published as: TW202016765A; TWI749349B; WO2020082890A1

Abstract

The invention relates to a kind of text restoring method, device and electronic equipments.Text restoring method includes: acquisition target text；Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle text includes the character that can not form participle；Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, segments text after obtaining at least one matching；Text input preset language model will be segmented after at least one set of matching, and obtain the confidence level for segmenting text after at least one set matches；Based on the confidence level for segmenting text after at least one set of matching, from segmented after at least one matching select the target text in text go back original text.

Description

A kind of text restoring method, device and electronic equipment

Technical field

The invention relates to technical field of network security more particularly to a kind of text restoring method, device and electronics Equipment.

Background technique

With the rise of internet, the convenience of information transmitting increases internet information amount at geometry grade.User is normal It often will receive the greyish black junk information, such as promotional information, fraud information, illegal advertisement information etc. for producing transmission in internet.For These junk information can generally be intercepted by the network platform.However, greyish black production is at present in order to around the various anti-of platform Control means can propagate junk information in a manner of the expression that divides by means of characters.For example normal content is that " I is that lightning is borrowed money, and can open by force and borrow 5000-10000w ", is expressed as that " I is lightning Ren former times money, opens Ren former times 5000-10000w " though can bend by dividing by means of characters mode.

In view of this, in order to improve the recognition capability that the network platform is directed to junk information, the variation how to express dividing by means of characters Text reverts back normal text, is technical problems to be solved in this application.

Summary of the invention

The embodiment of the present application purpose is to provide a kind of text restoring method, device and electronic equipment, can express dividing by means of characters Variation text revert back normal text.

To achieve the goals above, the embodiment of the present application is achieved in that

In a first aspect, providing a kind of text restoring method, comprising:

Obtain target text；

Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle Text includes the character that can not form participle；

Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one Text is segmented after kind matching；

Text input preset language model will be segmented after at least one set matching, divided after obtaining at least one set of matching The confidence level of word text；

Based on the confidence level for segmenting text after at least one set of matching, segmented in text after at least one matching Select the target text goes back original text.

Second aspect provides a kind of text reduction apparatus, comprising:

Module is obtained, target text is obtained；

Word segmentation module carries out word segmentation processing to the target text, the participle text after obtaining the target text participle, The participle text includes the character that can not form participle；

Matching module matches the character that can not form participle in the participle text, is obtained based on dividing by means of characters sample set Text is segmented after at least one matching；

Evaluation module will segment text input preset language model after at least one set of matching, obtain described at least one The confidence level of text is segmented after group matching；

Module is chosen, based on the confidence level for segmenting text after at least one set of matching, after at least one matching The target text is selected in participle text goes back original text.

The third aspect provides a kind of electronic equipment, comprising: memory, processor and is stored on the memory simultaneously The computer program that can be run on the processor, the computer program are executed by the processor:

Obtain target text；

Fourth aspect provides a kind of computer readable storage medium, is stored on the computer readable storage medium Computer program, the computer program realize following steps when being executed by processor:

Obtain target text；

As can be seen from the technical scheme provided by the above embodiments of the present application, the embodiment of the present application first divides target text Word processing, determines the character that can not form participle, these characters that can not form participle are carried out as the matched object that divides by means of characters Matching reduction segments text after obtaining at least one matching.Later, it is segmented after being matched by preset language model at least one Text carries out the assessment of confidence level, and segments text as target text after preferentially filtering out optimal matching based on confidence level Also original text.The variation text that dividing by means of characters is expressed effectively can be reduced into normal text by the scheme of the embodiment of the present application, can be improved Recognition capability of the network platform to junk information.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The some embodiments recorded in application embodiment, for those of ordinary skill in the art, in not making the creative labor property Under the premise of, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is the step schematic diagram of text restoring method provided by the embodiments of the present application；

Fig. 2 is the flow diagram of text restoring method provided by the embodiments of the present application in practical applications；

Fig. 3 is the hardware structural diagram of electronic equipment provided by the embodiments of the present application；

Fig. 4 is the logical construction schematic diagram of text reduction apparatus provided by the embodiments of the present application.

Specific embodiment

In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common The application protection all should belong in technical staff's every other embodiment obtained without creative efforts Range.

As previously mentioned, greyish black production at present can send the junk information that dividing by means of characters mode is expressed, to bypass the supervision of the network platform. In view of this, the application is intended to provide a kind of technical solution that the variation text of dividing by means of characters expression can be reverted back to normal text, The network platform can be improved to the recognition capability of junk information.

Fig. 1 is the flow chart of one embodiment text restoring method of the application.The text restoring method of Fig. 1 can be by text Reduction apparatus executes.This method comprises:

Step S102 obtains target text.

For step S102:

The embodiment of the present application does not make specific limit to the source of target text.

As exemplary introduction, target text can be the text envelope that the user got from network social intercourse platform sends Breath.

For example, evaluation information, the chat message etc. of user's transmission can be obtained from shopping at network platform.

It should be understood that but all network platforms need the information object supervised, all can serve as target text.

Step S104 carries out word segmentation processing to target text, and the participle text after obtaining target text participle segments text Character comprising participle can not be formed.

For step S104:

Existing any segmenting method can be used in the present embodiment, word segmentation processing is carried out to target text, so that it is determined that mesh out The character of participle can not be formed in mark text.

As exemplary introduction, it may include: Chinese character, the radical of Chinese character, the Chinese that determines, which can not form the character of participle, Any one in the radical of word, these character high probabilities that can not form participle are expressed in a manner of dividing by means of characters, be it is subsequent into The key object of row dividing by means of characters identification.

Step S106 is matched based on dividing by means of characters sample set to that can not form the character of participle in participle text, obtain to Text is segmented after a kind of few matching.

For step S106:

The sample set that divides by means of characters includes pre-set dividing by means of characters expression-form." flower ", " excuse shellfish " for example, " fancy top shellfish " correspondence Corresponding " borrow ", " Ren former times money " correspondence " loaning bill ", " Ren former times money " correspondence " borrowing money " etc. for certain words dividing by means of characters expression-form, Can be " Ren former times " it is corresponding borrow, " mouth shellfish " correspondence " " etc. for a certain Chinese character dividing by means of characters expression-form.

In this step, by the sample set that divides by means of characters, dividing by means of characters can be carried out to the character that can not form participle in participle text Match, reverts back the information of normal expression.

Specifically, it can be matched to the character that can not form participle that line direction is adjacent in text is segmented.

For example, participle text is " six directions adopts San month shellfish and million $ ", dividing by means of characters sample set record " adopting San " corresponds to " coloured silk ", " shellfish is simultaneous " corresponds to " earning ".It is known that " adopting ", " San ", " moon ", " shellfish ", " simultaneous " " $ " be to segment not being determined as in text The character of molecule then matches above-mentioned adjacent " adopting ", " San ", " moon ", " shellfish " ", and " based on dividing by means of characters sample set, obtains Matching after molecule text are as follows: " the lottery ticket moon earns million ".

Similarly, it can also be matched to the character that can not form participle that column direction is adjacent in text is segmented；

For example, participle text are as follows: " add cell-phone number xx, it can be low from arbitrage

The heart "；

It can then be matched based on dividing by means of characters sample set, " certainly " adjacent to column direction, " heart ", be divided after the matching determined Ziwen sheet are as follows: " add cell-phone number xx, can low interest arbitrage ".

Step S108 will segment text input preset language model after at least one set matching, obtain at least one set matching The confidence level of text is segmented afterwards；

For step 108:

It should be understood that participle text not necessarily correctly goes back original text after the matching determined based on dividing by means of characters sample set This, it is therefore desirable to the confidence level that text is segmented after matching is assessed using preset language model evaluation.Text is segmented after matching The size of this confidence level is able to reflect the reduction accuracy rate of participle text after the matching.

It should be understood that preset language model be according to actual application scenarios flexible setting, the embodiment of the present application to this not Make specific limit.

As exemplary introduction, it is assumed that the scheme of the embodiment of the present application is for restoring the rubbish expressed in a manner of dividing by means of characters in network Rubbish information.Preset language model can be obtained by the training of junk information sample set.Text is segmented after by least one set matching After inputting preset language model, evaluation criteria of the preset language model based on junk information is literary to segmenting after at least one set matching This confidence level is given a mark.Wherein, the confidence level score value that text is segmented after matching is higher, then more may be junk information, right The reduction accuracy rate answered is also higher.

Alternatively, the preset language model of the embodiment of the present application is using the expression way of correct sentence as evaluation criteria, to extremely The confidence level that text is segmented after few one group of matching is given a mark.For example, the correct sentence structure of " subject and predicate, guest " is based on, at least The confidence level for segmenting text after one group of matching is given a mark.Wherein, the confidence level score value that text is segmented after matching is higher, then corresponds to Reduction accuracy rate it is also higher.

Since the implementation of preset language model is not unique, no longer citing is repeated herein.

Step S110 is divided after at least one matching based on the confidence level for segmenting text after above-mentioned at least one set of matching Target text is selected in word text goes back original text.

For step S110:

This step can segment after above-mentioned at least one matching and choose one of confidence level highest in text as target Text goes back original text.

In the embodiment of the present application, word segmentation processing is carried out to target text first, determines the character that can not form participle, this The character that participle can not be formed a bit carries out matching reduction as the matched object that divides by means of characters, and segments text after obtaining at least one matching This.Later, the assessment by preset language model to text progress confidence level is segmented after at least one matching, and based on confidence level Participle text is preferentially filtered out after optimal matching as target text and goes back original text.The scheme of the embodiment of the present application can have The variation text of dividing by means of characters expression is reduced into normal text by effect, and the network platform can be improved to the recognition capability of junk information.

It describes in detail below to the process of the text restoring method of the embodiment of the present application in practical applications.

The main flow of the text restoring method of the embodiment of the present application includes:

Step 1 obtains target text；

In this step, it can obtain and be sent by user from network social intercourse platform (such as communication software, online shopping software) Target text.

As exemplary introduction, it is assumed that the content of target text is " need Ren former times money, power mouth my cell-phone number ".Obviously, the mesh Marking text is the junk information expressed in a manner of dividing by means of characters.

Step 2 determines participle text；

In this step, word segmentation processing can be carried out to " need Ren former times money, power mouth my cell-phone number ".For convenience of understanding, it is segmented Between with space-separated, corresponding obtained participle text are as follows: " need Ren former times money, power mouth my cell-phone number ".

It should be understood that " needs ", " I ", " cell-phone number " can be determined as segmenting in above-mentioned target text, " Ren ", " former times ", " money ", " power ", " mouth " are can not be as the character of participle.

Step 3, dividing by means of characters matching；

In this step, dividing by means of characters matching is carried out to above-mentioned participle text using dividing by means of characters table resource, wherein " Ren former times " can match and be " borrowing ", " power mouth " can match as " adding ", and " mouth I " can match as matching " ", finally obtained based on dividing by means of characters table resource It includes following two that text is segmented after matching:

The first is " need to borrow money, add my cell-phone number "；

Second is " needing to borrow money, power cell-phone number ".

Step 4, confidence level estimation；

In this step, text input preset language model will be segmented after two kinds of matchings of step 3 kind, " needed with calculating Borrow money, add my cell-phone number " confidence level P1 and " needing to borrow money, power cell-phone number " confidence level P2.

Wherein, preset language model can be disaggregated model, be obtained by the junk information sample training illegally borrowed money.

For example, can using it is some with illegally borrow money common feature as the input vector of preset language model, and pass through Junk information sample is trained preset language model, to continue to optimize the weight of input vector.

" need to borrow money, add my cell-phone number " and " needing to borrow money, power cell-phone number " is being input to the default of training completion After language model, it is clear that the former, which has, illegally borrows money common attribute " add my cell-phone number ", therefore after inputting disaggregated model, can obtain To higher confidence level.

It should be noted that the embodiment of the present application not using function make specifically to limit by preset language model.But it is all Function for classification may be suitable for the preset language model of the embodiment of the present application.

Step 5, probability compare；

In this step, to the confidence level for segmenting text after the confidence level and second of matching for segmenting text after the first matching It carries out size comparison (P1 > P2).Obviously, confidence level is one of larger higher as the probability for correctly going back original text.

Step 6 restores text output；

In this step, the comparison result (P1 > P2) based on step 5, final output go back original text be " need to borrow money, Add my cell-phone number ".

In conclusion the text restoring method of the embodiment of the present application can identify the character that the dividing by means of characters of target text indicates, And carry out matching reduction.In the specific implementation, word segmentation processing first is carried out to target text, can only will be unable to the word as participle Symbol so that matching times be effectively reduced, and improves matched accuracy rate as the matched object that divides by means of characters.And then combine language Speech model further preferentially screens text of the participle text as target text after optimal matching.The calculating letter of entire scheme It is single, it needs to occupy that process resource is relatively fewer, is therefore particularly suitable for the junk information of network platform identification dividing by means of characters expression.

Fig. 3 is the structural schematic diagram of one embodiment electronic equipment of the application.Referring to FIG. 3, in hardware view, the electricity Sub- equipment includes processor, optionally further comprising internal bus, network interface, memory.Wherein, memory may be comprising interior It deposits, such as high-speed random access memory (Random-Access Memory, RAM), it is also possible to further include non-volatile memories Device (non-volatile memory), for example, at least 1 magnetic disk storage etc..Certainly, which is also possible that other Hardware required for business.

Processor, network interface and memory can be connected with each other by internal bus, which can be ISA (Industry Standard Architecture, industry standard architecture) bus, PCI (Peripheral Component Interconnect, Peripheral Component Interconnect standard) bus or EISA (Extended Industry Standard Architecture, expanding the industrial standard structure) bus etc..The bus can be divided into address bus, data/address bus, control always Line etc..Only to be indicated with a four-headed arrow in Fig. 3, it is not intended that an only bus or a type of convenient for indicating Bus.

Memory, for storing program.Specifically, program may include program code, and said program code includes calculating Machine operational order.Memory may include memory and nonvolatile memory, and provide instruction and data to processor.

Processor is from the then operation into memory of corresponding computer program is read in nonvolatile memory, in logical layer Question and answer are formed on face to data mining device.Processor executes the program that memory is stored, and is specifically used for executing following behaviour Make:

Obtain target text；

The text restoring method that the application embodiment illustrated in fig. 1 discloses can be applied in processor, or by processor It realizes.Processor may be a kind of IC chip, the processing capacity with signal.During realization, the above method Each step can be completed by the integrated logic circuit of the hardware in processor or the instruction of software form.Above-mentioned processor It can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.；Can also be digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate Array (Field-Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or crystalline substance Body pipe logical device, discrete hardware components.May be implemented or execute disclosed each method in the embodiment of the present application, step and Logic diagram.General processor can be microprocessor or the processor is also possible to any conventional processor etc..In conjunction with The step of method disclosed in the embodiment of the present application, can be embodied directly in hardware decoding processor and execute completion, or with decoding Hardware and software module combination in processor execute completion.Software module can be located at random access memory, flash memory, read-only storage In the storage medium of this fields such as device, programmable read only memory or electrically erasable programmable memory, register maturation.It should The step of storage medium is located at memory, and processor reads the information in memory, completes the above method in conjunction with its hardware.

The electronic equipment can also carry out method shown in FIG. 1, and realize text reduction apparatus in Fig. 1, embodiment illustrated in fig. 2 Function, no longer repeated herein.

Certainly, other than software realization mode, other implementations are not precluded in the electronic equipment of the application, for example patrol Collect device or the mode of software and hardware combining etc., that is to say, that the executing subject of following process flow is not limited to each patrol Unit is collected, hardware or logical device are also possible to.

The embodiment of the present application also proposed a kind of computer readable storage medium, the computer-readable recording medium storage one A or multiple programs, the one or more program include instruction, and the instruction is when by the portable electronic including multiple application programs When equipment executes, the method that the portable electronic device can be made to execute embodiment illustrated in fig. 1, and be specifically used for executing with lower section Method:

Obtain target text；

It should be understood that text reduction apparatus may be implemented when present treatment executes in the computer readable storage medium of the application In Fig. 1, the function of embodiment illustrated in fig. 2, no longer repeated herein.

Fig. 4 is the structural schematic diagram of one embodiment text reduction apparatus 400 of the application, comprising:

Module 410 is obtained, target text is obtained；

Word segmentation module 420 carries out word segmentation processing to the target text, the participle text after obtaining the target text participle This, the participle text includes the character that can not form participle；

Matching module 430, the character progress based on dividing by means of characters sample set, to participle can not be formed in the participle text Match, segments text after obtaining at least one matching；

Evaluation module 440 will segment text input preset language model after at least one set of matching, obtain it is described at least The confidence level of text is segmented after one group of matching；

Module 450 is chosen, based on the confidence level for segmenting text after at least one set of matching, from least one matching Select the target text in participle text afterwards goes back original text.

The embodiment of the present application carries out word segmentation processing to target text first, determines the character that can not form participle, these The character that participle can not be formed carries out matching reduction as the matched object that divides by means of characters, and segments text after obtaining at least one matching. Later, the assessment of confidence level is carried out to participle text after at least one matching by preset language model, and is selected based on confidence level It is excellent to filter out after optimal matching participle text as target text and go back original text.The scheme of the embodiment of the present application can be effective The variation text of dividing by means of characters expression is reduced into normal text, the network platform can be improved to the recognition capability of junk information.

Optionally, as one embodiment, matching module 430 is specifically used for:

Based on dividing by means of characters sample resource, the character progress that can not form participle adjacent to line direction in the participle text Match.

Optionally, as one embodiment, matching module 430 is specifically used for:

Based on dividing by means of characters sample resource, ranks in the participle text are carried out to the adjacent character that can not form participle Match.

Optionally, it as one embodiment, chooses module 450 and is specifically used for:

One of confidence level highest is chosen in text as the target text from segmenting after at least one matching Also original text.

Optionally, as one embodiment, can not form the character of participle in the participle text includes: Chinese character, Chinese character Radical, Chinese character radical in any one.

Optionally, as one embodiment, the preset language model is based on the training of junk information sample set and obtains.

Optionally, it as one embodiment, obtains module 410 and is specifically used for:

From network social intercourse platform, the target text that user sends is obtained.

It should be understood that the method that Fig. 1 can be performed in the text reduction apparatus of the embodiment of the present application, and realize this method in Fig. 1, figure The function of 2 illustrated embodiments, is no longer repeated herein.

It will be understood by those skilled in the art that the embodiment of this specification can provide as the production of method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or implementation combining software and hardware aspects can be used in this specification The form of example.Moreover, it wherein includes the computer of computer usable program code that this specification, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.

The above is only the embodiments of this specification, are not limited to this specification.For those skilled in the art For, this specification can have various modifications and variations.All any modifications made within the spirit and principle of this specification, Equivalent replacement, improvement etc., should be included within the scope of the claims of this specification.

Claims

1. a kind of text restoring method, comprising:

Obtain target text；

Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle text Character comprising participle can not be formed；

Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one Text is segmented after matching；

Text input preset language model will be segmented after at least one set of matching, segments text after obtaining at least one set of matching This confidence level；

Based on the confidence level for segmenting text after at least one set of matching, segments in text and choose after at least one matching The target text goes back original text out.

2. text restoring method according to claim 1,

Based on dividing by means of characters sample resource, the character that participle can not be formed in the participle text is matched, comprising:

Based on dividing by means of characters sample resource, the character that can not form participle adjacent to line direction in the participle text is matched.

3. text restoring method according to claim 1,

Based on dividing by means of characters sample resource, the character that can not form participle adjacent to column direction in the participle text is matched.

4. text restoring method according to claim 1,

Based on the confidence level for segmenting text after at least one set of matching, segments in text and choose after at least one matching The target text goes back original text out, comprising:

Reduction of one of the confidence level highest as the target text is chosen in text from segmenting after at least one matching Text.

5. text restoring method according to claim 1,

It includes: any one in the radical of Chinese character, the radical of Chinese character, Chinese character that the character of participle can not be formed in the participle text Person.

6. text restoring method according to claim 1,

The preset language model is based on the training of junk information sample set and obtains.

7. text restoring method according to claim 1,

Obtain target text, comprising:

8. a kind of text reduction apparatus, comprising:

Module is obtained, target text is obtained；

Word segmentation module carries out word segmentation processing to the target text, the participle text after obtaining the target text participle, described Participle text includes the character that can not form participle；

Matching module is matched based on dividing by means of characters sample set to that can not form the character of participle in the participle text, obtain to Text is segmented after a kind of few matching；

Evaluation module will segment text input preset language model after at least one set of matching, obtain described at least one set of The confidence level of text is segmented after matching；

Module is chosen, based on the confidence level for segmenting text after at least one set of matching, is segmented after at least one matching The target text is selected in text goes back original text.

9. a kind of electronic equipment includes: memory, processor and is stored on the memory and can transport on the processor Capable computer program, the computer program are executed by the processor:

Obtain target text；

10. a kind of computer readable storage medium, computer program, the meter are stored on the computer readable storage medium Calculation machine program realizes following steps when being executed by processor:

Obtain target text；