CN109597987A - A kind of text restoring method, device and electronic equipment - Google Patents

A kind of text restoring method, device and electronic equipment Download PDF

Info

Publication number
CN109597987A
CN109597987A CN201811248320.3A CN201811248320A CN109597987A CN 109597987 A CN109597987 A CN 109597987A CN 201811248320 A CN201811248320 A CN 201811248320A CN 109597987 A CN109597987 A CN 109597987A
Authority
CN
China
Prior art keywords
text
participle
matching
character
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811248320.3A
Other languages
Chinese (zh)
Inventor
周书恒
刘金星
祝慧佳
赵智源
郭亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811248320.3A priority Critical patent/CN109597987A/en
Publication of CN109597987A publication Critical patent/CN109597987A/en
Priority to TW108127355A priority patent/TWI749349B/en
Priority to PCT/CN2019/103103 priority patent/WO2020082890A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Abstract

The invention relates to a kind of text restoring method, device and electronic equipments.Text restoring method includes: acquisition target text;Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle text includes the character that can not form participle;Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, segments text after obtaining at least one matching;Text input preset language model will be segmented after at least one set of matching, and obtain the confidence level for segmenting text after at least one set matches;Based on the confidence level for segmenting text after at least one set of matching, from segmented after at least one matching select the target text in text go back original text.

Description

A kind of text restoring method, device and electronic equipment
Technical field
The invention relates to technical field of network security more particularly to a kind of text restoring method, device and electronics Equipment.
Background technique
With the rise of internet, the convenience of information transmitting increases internet information amount at geometry grade.User is normal It often will receive the greyish black junk information, such as promotional information, fraud information, illegal advertisement information etc. for producing transmission in internet.For These junk information can generally be intercepted by the network platform.However, greyish black production is at present in order to around the various anti-of platform Control means can propagate junk information in a manner of the expression that divides by means of characters.For example normal content is that " I is that lightning is borrowed money, and can open by force and borrow 5000-10000w ", is expressed as that " I is lightning Ren former times money, opens Ren former times 5000-10000w " though can bend by dividing by means of characters mode.
In view of this, in order to improve the recognition capability that the network platform is directed to junk information, the variation how to express dividing by means of characters Text reverts back normal text, is technical problems to be solved in this application.
Summary of the invention
The embodiment of the present application purpose is to provide a kind of text restoring method, device and electronic equipment, can express dividing by means of characters Variation text revert back normal text.
To achieve the goals above, the embodiment of the present application is achieved in that
In a first aspect, providing a kind of text restoring method, comprising:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle Text includes the character that can not form participle;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one Text is segmented after kind matching;
Text input preset language model will be segmented after at least one set matching, divided after obtaining at least one set of matching The confidence level of word text;
Based on the confidence level for segmenting text after at least one set of matching, segmented in text after at least one matching Select the target text goes back original text.
Second aspect provides a kind of text reduction apparatus, comprising:
Module is obtained, target text is obtained;
Word segmentation module carries out word segmentation processing to the target text, the participle text after obtaining the target text participle, The participle text includes the character that can not form participle;
Matching module matches the character that can not form participle in the participle text, is obtained based on dividing by means of characters sample set Text is segmented after at least one matching;
Evaluation module will segment text input preset language model after at least one set of matching, obtain described at least one The confidence level of text is segmented after group matching;
Module is chosen, based on the confidence level for segmenting text after at least one set of matching, after at least one matching The target text is selected in participle text goes back original text.
The third aspect provides a kind of electronic equipment, comprising: memory, processor and is stored on the memory simultaneously The computer program that can be run on the processor, the computer program are executed by the processor:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle Text includes the character that can not form participle;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one Text is segmented after kind matching;
Text input preset language model will be segmented after at least one set matching, divided after obtaining at least one set of matching The confidence level of word text;
Based on the confidence level for segmenting text after at least one set of matching, segmented in text after at least one matching Select the target text goes back original text.
Fourth aspect provides a kind of computer readable storage medium, is stored on the computer readable storage medium Computer program, the computer program realize following steps when being executed by processor:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle Text includes the character that can not form participle;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one Text is segmented after kind matching;
Text input preset language model will be segmented after at least one set matching, divided after obtaining at least one set of matching The confidence level of word text;
Based on the confidence level for segmenting text after at least one set of matching, segmented in text after at least one matching Select the target text goes back original text.
As can be seen from the technical scheme provided by the above embodiments of the present application, the embodiment of the present application first divides target text Word processing, determines the character that can not form participle, these characters that can not form participle are carried out as the matched object that divides by means of characters Matching reduction segments text after obtaining at least one matching.Later, it is segmented after being matched by preset language model at least one Text carries out the assessment of confidence level, and segments text as target text after preferentially filtering out optimal matching based on confidence level Also original text.The variation text that dividing by means of characters is expressed effectively can be reduced into normal text by the scheme of the embodiment of the present application, can be improved Recognition capability of the network platform to junk information.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The some embodiments recorded in application embodiment, for those of ordinary skill in the art, in not making the creative labor property Under the premise of, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the step schematic diagram of text restoring method provided by the embodiments of the present application;
Fig. 2 is the flow diagram of text restoring method provided by the embodiments of the present application in practical applications;
Fig. 3 is the hardware structural diagram of electronic equipment provided by the embodiments of the present application;
Fig. 4 is the logical construction schematic diagram of text reduction apparatus provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common The application protection all should belong in technical staff's every other embodiment obtained without creative efforts Range.
As previously mentioned, greyish black production at present can send the junk information that dividing by means of characters mode is expressed, to bypass the supervision of the network platform. In view of this, the application is intended to provide a kind of technical solution that the variation text of dividing by means of characters expression can be reverted back to normal text, The network platform can be improved to the recognition capability of junk information.
Fig. 1 is the flow chart of one embodiment text restoring method of the application.The text restoring method of Fig. 1 can be by text Reduction apparatus executes.This method comprises:
Step S102 obtains target text.
For step S102:
The embodiment of the present application does not make specific limit to the source of target text.
As exemplary introduction, target text can be the text envelope that the user got from network social intercourse platform sends Breath.
For example, evaluation information, the chat message etc. of user's transmission can be obtained from shopping at network platform.
It should be understood that but all network platforms need the information object supervised, all can serve as target text.
Step S104 carries out word segmentation processing to target text, and the participle text after obtaining target text participle segments text Character comprising participle can not be formed.
For step S104:
Existing any segmenting method can be used in the present embodiment, word segmentation processing is carried out to target text, so that it is determined that mesh out The character of participle can not be formed in mark text.
As exemplary introduction, it may include: Chinese character, the radical of Chinese character, the Chinese that determines, which can not form the character of participle, Any one in the radical of word, these character high probabilities that can not form participle are expressed in a manner of dividing by means of characters, be it is subsequent into The key object of row dividing by means of characters identification.
Step S106 is matched based on dividing by means of characters sample set to that can not form the character of participle in participle text, obtain to Text is segmented after a kind of few matching.
For step S106:
The sample set that divides by means of characters includes pre-set dividing by means of characters expression-form." flower ", " excuse shellfish " for example, " fancy top shellfish " correspondence Corresponding " borrow ", " Ren former times money " correspondence " loaning bill ", " Ren former times money " correspondence " borrowing money " etc. for certain words dividing by means of characters expression-form, Can be " Ren former times " it is corresponding borrow, " mouth shellfish " correspondence " " etc. for a certain Chinese character dividing by means of characters expression-form.
In this step, by the sample set that divides by means of characters, dividing by means of characters can be carried out to the character that can not form participle in participle text Match, reverts back the information of normal expression.
Specifically, it can be matched to the character that can not form participle that line direction is adjacent in text is segmented.
For example, participle text is " six directions adopts San month shellfish and million $ ", dividing by means of characters sample set record " adopting San " corresponds to " coloured silk ", " shellfish is simultaneous " corresponds to " earning ".It is known that " adopting ", " San ", " moon ", " shellfish ", " simultaneous " " $ " be to segment not being determined as in text The character of molecule then matches above-mentioned adjacent " adopting ", " San ", " moon ", " shellfish " ", and " based on dividing by means of characters sample set, obtains Matching after molecule text are as follows: " the lottery ticket moon earns million ".
Similarly, it can also be matched to the character that can not form participle that column direction is adjacent in text is segmented;
For example, participle text are as follows: " add cell-phone number xx, it can be low from arbitrage
The heart ";
It can then be matched based on dividing by means of characters sample set, " certainly " adjacent to column direction, " heart ", be divided after the matching determined Ziwen sheet are as follows: " add cell-phone number xx, can low interest arbitrage ".
Step S108 will segment text input preset language model after at least one set matching, obtain at least one set matching The confidence level of text is segmented afterwards;
For step 108:
It should be understood that participle text not necessarily correctly goes back original text after the matching determined based on dividing by means of characters sample set This, it is therefore desirable to the confidence level that text is segmented after matching is assessed using preset language model evaluation.Text is segmented after matching The size of this confidence level is able to reflect the reduction accuracy rate of participle text after the matching.
It should be understood that preset language model be according to actual application scenarios flexible setting, the embodiment of the present application to this not Make specific limit.
As exemplary introduction, it is assumed that the scheme of the embodiment of the present application is for restoring the rubbish expressed in a manner of dividing by means of characters in network Rubbish information.Preset language model can be obtained by the training of junk information sample set.Text is segmented after by least one set matching After inputting preset language model, evaluation criteria of the preset language model based on junk information is literary to segmenting after at least one set matching This confidence level is given a mark.Wherein, the confidence level score value that text is segmented after matching is higher, then more may be junk information, right The reduction accuracy rate answered is also higher.
Alternatively, the preset language model of the embodiment of the present application is using the expression way of correct sentence as evaluation criteria, to extremely The confidence level that text is segmented after few one group of matching is given a mark.For example, the correct sentence structure of " subject and predicate, guest " is based on, at least The confidence level for segmenting text after one group of matching is given a mark.Wherein, the confidence level score value that text is segmented after matching is higher, then corresponds to Reduction accuracy rate it is also higher.
Since the implementation of preset language model is not unique, no longer citing is repeated herein.
Step S110 is divided after at least one matching based on the confidence level for segmenting text after above-mentioned at least one set of matching Target text is selected in word text goes back original text.
For step S110:
This step can segment after above-mentioned at least one matching and choose one of confidence level highest in text as target Text goes back original text.
In the embodiment of the present application, word segmentation processing is carried out to target text first, determines the character that can not form participle, this The character that participle can not be formed a bit carries out matching reduction as the matched object that divides by means of characters, and segments text after obtaining at least one matching This.Later, the assessment by preset language model to text progress confidence level is segmented after at least one matching, and based on confidence level Participle text is preferentially filtered out after optimal matching as target text and goes back original text.The scheme of the embodiment of the present application can have The variation text of dividing by means of characters expression is reduced into normal text by effect, and the network platform can be improved to the recognition capability of junk information.
It describes in detail below to the process of the text restoring method of the embodiment of the present application in practical applications.
The main flow of the text restoring method of the embodiment of the present application includes:
Step 1 obtains target text;
In this step, it can obtain and be sent by user from network social intercourse platform (such as communication software, online shopping software) Target text.
As exemplary introduction, it is assumed that the content of target text is " need Ren former times money, power mouth my cell-phone number ".Obviously, the mesh Marking text is the junk information expressed in a manner of dividing by means of characters.
Step 2 determines participle text;
In this step, word segmentation processing can be carried out to " need Ren former times money, power mouth my cell-phone number ".For convenience of understanding, it is segmented Between with space-separated, corresponding obtained participle text are as follows: " need Ren former times money, power mouth my cell-phone number ".
It should be understood that " needs ", " I ", " cell-phone number " can be determined as segmenting in above-mentioned target text, " Ren ", " former times ", " money ", " power ", " mouth " are can not be as the character of participle.
Step 3, dividing by means of characters matching;
In this step, dividing by means of characters matching is carried out to above-mentioned participle text using dividing by means of characters table resource, wherein " Ren former times " can match and be " borrowing ", " power mouth " can match as " adding ", and " mouth I " can match as matching " ", finally obtained based on dividing by means of characters table resource It includes following two that text is segmented after matching:
The first is " need to borrow money, add my cell-phone number ";
Second is " needing to borrow money, power cell-phone number ".
Step 4, confidence level estimation;
In this step, text input preset language model will be segmented after two kinds of matchings of step 3 kind, " needed with calculating Borrow money, add my cell-phone number " confidence level P1 and " needing to borrow money, power cell-phone number " confidence level P2.
Wherein, preset language model can be disaggregated model, be obtained by the junk information sample training illegally borrowed money.
For example, can using it is some with illegally borrow money common feature as the input vector of preset language model, and pass through Junk information sample is trained preset language model, to continue to optimize the weight of input vector.
" need to borrow money, add my cell-phone number " and " needing to borrow money, power cell-phone number " is being input to the default of training completion After language model, it is clear that the former, which has, illegally borrows money common attribute " add my cell-phone number ", therefore after inputting disaggregated model, can obtain To higher confidence level.
It should be noted that the embodiment of the present application not using function make specifically to limit by preset language model.But it is all Function for classification may be suitable for the preset language model of the embodiment of the present application.
Step 5, probability compare;
In this step, to the confidence level for segmenting text after the confidence level and second of matching for segmenting text after the first matching It carries out size comparison (P1 > P2).Obviously, confidence level is one of larger higher as the probability for correctly going back original text.
Step 6 restores text output;
In this step, the comparison result (P1 > P2) based on step 5, final output go back original text be " need to borrow money, Add my cell-phone number ".
In conclusion the text restoring method of the embodiment of the present application can identify the character that the dividing by means of characters of target text indicates, And carry out matching reduction.In the specific implementation, word segmentation processing first is carried out to target text, can only will be unable to the word as participle Symbol so that matching times be effectively reduced, and improves matched accuracy rate as the matched object that divides by means of characters.And then combine language Speech model further preferentially screens text of the participle text as target text after optimal matching.The calculating letter of entire scheme It is single, it needs to occupy that process resource is relatively fewer, is therefore particularly suitable for the junk information of network platform identification dividing by means of characters expression.
Fig. 3 is the structural schematic diagram of one embodiment electronic equipment of the application.Referring to FIG. 3, in hardware view, the electricity Sub- equipment includes processor, optionally further comprising internal bus, network interface, memory.Wherein, memory may be comprising interior It deposits, such as high-speed random access memory (Random-Access Memory, RAM), it is also possible to further include non-volatile memories Device (non-volatile memory), for example, at least 1 magnetic disk storage etc..Certainly, which is also possible that other Hardware required for business.
Processor, network interface and memory can be connected with each other by internal bus, which can be ISA (Industry Standard Architecture, industry standard architecture) bus, PCI (Peripheral Component Interconnect, Peripheral Component Interconnect standard) bus or EISA (Extended Industry Standard Architecture, expanding the industrial standard structure) bus etc..The bus can be divided into address bus, data/address bus, control always Line etc..Only to be indicated with a four-headed arrow in Fig. 3, it is not intended that an only bus or a type of convenient for indicating Bus.
Memory, for storing program.Specifically, program may include program code, and said program code includes calculating Machine operational order.Memory may include memory and nonvolatile memory, and provide instruction and data to processor.
Processor is from the then operation into memory of corresponding computer program is read in nonvolatile memory, in logical layer Question and answer are formed on face to data mining device.Processor executes the program that memory is stored, and is specifically used for executing following behaviour Make:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle Text includes the character that can not form participle;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one Text is segmented after kind matching;
Text input preset language model will be segmented after at least one set matching, divided after obtaining at least one set of matching The confidence level of word text;
Based on the confidence level for segmenting text after at least one set of matching, segmented in text after at least one matching Select the target text goes back original text.
The text restoring method that the application embodiment illustrated in fig. 1 discloses can be applied in processor, or by processor It realizes.Processor may be a kind of IC chip, the processing capacity with signal.During realization, the above method Each step can be completed by the integrated logic circuit of the hardware in processor or the instruction of software form.Above-mentioned processor It can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.;Can also be digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate Array (Field-Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or crystalline substance Body pipe logical device, discrete hardware components.May be implemented or execute disclosed each method in the embodiment of the present application, step and Logic diagram.General processor can be microprocessor or the processor is also possible to any conventional processor etc..In conjunction with The step of method disclosed in the embodiment of the present application, can be embodied directly in hardware decoding processor and execute completion, or with decoding Hardware and software module combination in processor execute completion.Software module can be located at random access memory, flash memory, read-only storage In the storage medium of this fields such as device, programmable read only memory or electrically erasable programmable memory, register maturation.It should The step of storage medium is located at memory, and processor reads the information in memory, completes the above method in conjunction with its hardware.
The electronic equipment can also carry out method shown in FIG. 1, and realize text reduction apparatus in Fig. 1, embodiment illustrated in fig. 2 Function, no longer repeated herein.
Certainly, other than software realization mode, other implementations are not precluded in the electronic equipment of the application, for example patrol Collect device or the mode of software and hardware combining etc., that is to say, that the executing subject of following process flow is not limited to each patrol Unit is collected, hardware or logical device are also possible to.
The embodiment of the present application also proposed a kind of computer readable storage medium, the computer-readable recording medium storage one A or multiple programs, the one or more program include instruction, and the instruction is when by the portable electronic including multiple application programs When equipment executes, the method that the portable electronic device can be made to execute embodiment illustrated in fig. 1, and be specifically used for executing with lower section Method:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle Text includes the character that can not form participle;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one Text is segmented after kind matching;
Text input preset language model will be segmented after at least one set matching, divided after obtaining at least one set of matching The confidence level of word text;
Based on the confidence level for segmenting text after at least one set of matching, segmented in text after at least one matching Select the target text goes back original text.
It should be understood that text reduction apparatus may be implemented when present treatment executes in the computer readable storage medium of the application In Fig. 1, the function of embodiment illustrated in fig. 2, no longer repeated herein.
Fig. 4 is the structural schematic diagram of one embodiment text reduction apparatus 400 of the application, comprising:
Module 410 is obtained, target text is obtained;
Word segmentation module 420 carries out word segmentation processing to the target text, the participle text after obtaining the target text participle This, the participle text includes the character that can not form participle;
Matching module 430, the character progress based on dividing by means of characters sample set, to participle can not be formed in the participle text Match, segments text after obtaining at least one matching;
Evaluation module 440 will segment text input preset language model after at least one set of matching, obtain it is described at least The confidence level of text is segmented after one group of matching;
Module 450 is chosen, based on the confidence level for segmenting text after at least one set of matching, from least one matching Select the target text in participle text afterwards goes back original text.
The embodiment of the present application carries out word segmentation processing to target text first, determines the character that can not form participle, these The character that participle can not be formed carries out matching reduction as the matched object that divides by means of characters, and segments text after obtaining at least one matching. Later, the assessment of confidence level is carried out to participle text after at least one matching by preset language model, and is selected based on confidence level It is excellent to filter out after optimal matching participle text as target text and go back original text.The scheme of the embodiment of the present application can be effective The variation text of dividing by means of characters expression is reduced into normal text, the network platform can be improved to the recognition capability of junk information.
Optionally, as one embodiment, matching module 430 is specifically used for:
Based on dividing by means of characters sample resource, the character progress that can not form participle adjacent to line direction in the participle text Match.
Optionally, as one embodiment, matching module 430 is specifically used for:
Based on dividing by means of characters sample resource, ranks in the participle text are carried out to the adjacent character that can not form participle Match.
Optionally, it as one embodiment, chooses module 450 and is specifically used for:
One of confidence level highest is chosen in text as the target text from segmenting after at least one matching Also original text.
Optionally, as one embodiment, can not form the character of participle in the participle text includes: Chinese character, Chinese character Radical, Chinese character radical in any one.
Optionally, as one embodiment, the preset language model is based on the training of junk information sample set and obtains.
Optionally, it as one embodiment, obtains module 410 and is specifically used for:
From network social intercourse platform, the target text that user sends is obtained.
It should be understood that the method that Fig. 1 can be performed in the text reduction apparatus of the embodiment of the present application, and realize this method in Fig. 1, figure The function of 2 illustrated embodiments, is no longer repeated herein.
It will be understood by those skilled in the art that the embodiment of this specification can provide as the production of method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or implementation combining software and hardware aspects can be used in this specification The form of example.Moreover, it wherein includes the computer of computer usable program code that this specification, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.
The above is only the embodiments of this specification, are not limited to this specification.For those skilled in the art For, this specification can have various modifications and variations.All any modifications made within the spirit and principle of this specification, Equivalent replacement, improvement etc., should be included within the scope of the claims of this specification.

Claims (10)

1. a kind of text restoring method, comprising:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle text Character comprising participle can not be formed;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one Text is segmented after matching;
Text input preset language model will be segmented after at least one set of matching, segments text after obtaining at least one set of matching This confidence level;
Based on the confidence level for segmenting text after at least one set of matching, segments in text and choose after at least one matching The target text goes back original text out.
2. text restoring method according to claim 1,
Based on dividing by means of characters sample resource, the character that participle can not be formed in the participle text is matched, comprising:
Based on dividing by means of characters sample resource, the character that can not form participle adjacent to line direction in the participle text is matched.
3. text restoring method according to claim 1,
Based on dividing by means of characters sample resource, the character that participle can not be formed in the participle text is matched, comprising:
Based on dividing by means of characters sample resource, the character that can not form participle adjacent to column direction in the participle text is matched.
4. text restoring method according to claim 1,
Based on the confidence level for segmenting text after at least one set of matching, segments in text and choose after at least one matching The target text goes back original text out, comprising:
Reduction of one of the confidence level highest as the target text is chosen in text from segmenting after at least one matching Text.
5. text restoring method according to claim 1,
It includes: any one in the radical of Chinese character, the radical of Chinese character, Chinese character that the character of participle can not be formed in the participle text Person.
6. text restoring method according to claim 1,
The preset language model is based on the training of junk information sample set and obtains.
7. text restoring method according to claim 1,
Obtain target text, comprising:
From network social intercourse platform, the target text that user sends is obtained.
8. a kind of text reduction apparatus, comprising:
Module is obtained, target text is obtained;
Word segmentation module carries out word segmentation processing to the target text, the participle text after obtaining the target text participle, described Participle text includes the character that can not form participle;
Matching module is matched based on dividing by means of characters sample set to that can not form the character of participle in the participle text, obtain to Text is segmented after a kind of few matching;
Evaluation module will segment text input preset language model after at least one set of matching, obtain described at least one set of The confidence level of text is segmented after matching;
Module is chosen, based on the confidence level for segmenting text after at least one set of matching, is segmented after at least one matching The target text is selected in text goes back original text.
9. a kind of electronic equipment includes: memory, processor and is stored on the memory and can transport on the processor Capable computer program, the computer program are executed by the processor:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle text Character comprising participle can not be formed;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one Text is segmented after matching;
Text input preset language model will be segmented after at least one set of matching, segments text after obtaining at least one set of matching This confidence level;
Based on the confidence level for segmenting text after at least one set of matching, segments in text and choose after at least one matching The target text goes back original text out.
10. a kind of computer readable storage medium, computer program, the meter are stored on the computer readable storage medium Calculation machine program realizes following steps when being executed by processor:
Obtain target text;
Word segmentation processing is carried out to the target text, the participle text after obtaining the target text participle, the participle text Character comprising participle can not be formed;
Based on dividing by means of characters sample set, the character that participle can not be formed in the participle text is matched, obtains at least one Text is segmented after matching;
Text input preset language model will be segmented after at least one set of matching, segments text after obtaining at least one set of matching This confidence level;
Based on the confidence level for segmenting text after at least one set of matching, segments in text and choose after at least one matching The target text goes back original text out.
CN201811248320.3A 2018-10-25 2018-10-25 A kind of text restoring method, device and electronic equipment Pending CN109597987A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201811248320.3A CN109597987A (en) 2018-10-25 2018-10-25 A kind of text restoring method, device and electronic equipment
TW108127355A TWI749349B (en) 2018-10-25 2019-08-01 Text restoration method, device, electronic equipment and computer readable storage medium
PCT/CN2019/103103 WO2020082890A1 (en) 2018-10-25 2019-08-28 Text restoration method and apparatus, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811248320.3A CN109597987A (en) 2018-10-25 2018-10-25 A kind of text restoring method, device and electronic equipment

Publications (1)

Publication Number Publication Date
CN109597987A true CN109597987A (en) 2019-04-09

Family

ID=65957463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811248320.3A Pending CN109597987A (en) 2018-10-25 2018-10-25 A kind of text restoring method, device and electronic equipment

Country Status (3)

Country Link
CN (1) CN109597987A (en)
TW (1) TWI749349B (en)
WO (1) WO2020082890A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020082890A1 (en) * 2018-10-25 2020-04-30 阿里巴巴集团控股有限公司 Text restoration method and apparatus, and electronic device
WO2024007827A1 (en) * 2022-07-07 2024-01-11 马上消费金融股份有限公司 Word segmentation method and apparatus for text, and computer device and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114040409B (en) * 2021-11-11 2023-06-06 中国联合网络通信集团有限公司 Short message identification method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140013221A1 (en) * 2010-12-24 2014-01-09 Peking University Founder Group Co., Ltd. Method and device for filtering harmful information
CN105574090A (en) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 Sensitive word filtering method and system
CN106156017A (en) * 2015-03-23 2016-11-23 北大方正集团有限公司 Information identifying method and information identification system
CN106874253A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 Recognize the method and device of sensitive information
CN107239447A (en) * 2017-06-05 2017-10-10 厦门美柚信息科技有限公司 Junk information recognition methods and device, system

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6167367A (en) * 1997-08-09 2000-12-26 National Tsing Hua University Method and device for automatic error detection and correction for computerized text files
US7257564B2 (en) * 2003-10-03 2007-08-14 Tumbleweed Communications Corp. Dynamic message filtering
US8396927B2 (en) * 2004-12-21 2013-03-12 Alcatel Lucent Detection of unwanted messages (spam)
CN101876968A (en) * 2010-05-06 2010-11-03 复旦大学 Method for carrying out harmful content recognition on network text and short message service
CN102231873A (en) * 2011-06-22 2011-11-02 中兴通讯股份有限公司 Method and system for monitoring garbage message and monitor processing apparatus
CN102999533A (en) * 2011-09-19 2013-03-27 腾讯科技(深圳)有限公司 Textspeak identification method and system
CN103874033B (en) * 2012-12-12 2017-11-24 上海粱江通信系统股份有限公司 A kind of method that irregular refuse messages are identified based on Chinese word segmentation
CN105550169A (en) * 2015-12-11 2016-05-04 北京奇虎科技有限公司 Method and device for identifying point of interest names based on character length
CN107357778B (en) * 2017-06-22 2020-10-30 达而观信息科技(上海)有限公司 Method and system for identifying and verifying deformed words
CN109597987A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 A kind of text restoring method, device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140013221A1 (en) * 2010-12-24 2014-01-09 Peking University Founder Group Co., Ltd. Method and device for filtering harmful information
CN106156017A (en) * 2015-03-23 2016-11-23 北大方正集团有限公司 Information identifying method and information identification system
CN105574090A (en) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 Sensitive word filtering method and system
CN106874253A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 Recognize the method and device of sensitive information
CN107239447A (en) * 2017-06-05 2017-10-10 厦门美柚信息科技有限公司 Junk information recognition methods and device, system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020082890A1 (en) * 2018-10-25 2020-04-30 阿里巴巴集团控股有限公司 Text restoration method and apparatus, and electronic device
WO2024007827A1 (en) * 2022-07-07 2024-01-11 马上消费金融股份有限公司 Word segmentation method and apparatus for text, and computer device and storage medium

Also Published As

Publication number Publication date
TW202016765A (en) 2020-05-01
TWI749349B (en) 2021-12-11
WO2020082890A1 (en) 2020-04-30

Similar Documents

Publication Publication Date Title
CN106709345A (en) Deep learning method-based method and system for deducing malicious code rules and equipment
CN103605691B (en) Device and method used for processing issued contents in social network
CN109684476B (en) Text classification method, text classification device and terminal equipment
CN107657056A (en) Method and apparatus based on artificial intelligence displaying comment information
CN106874253A (en) Recognize the method and device of sensitive information
CN110334357A (en) A kind of method, apparatus, storage medium and electronic equipment for naming Entity recognition
CN109597987A (en) A kind of text restoring method, device and electronic equipment
CN106844413A (en) The method and device of entity relation extraction
CN102567534B (en) Interactive product user generated content intercepting system and intercepting method for the same
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
CN106951571A (en) A kind of method and apparatus for giving application mark label
CN109800292A (en) The determination method, device and equipment of question and answer matching degree
CN110427628A (en) Web assets classes detection method and device based on neural network algorithm
CN107491536A (en) A kind of examination question method of calibration, examination question calibration equipment and electronic equipment
CN110110332A (en) Text snippet generation method and equipment
CN108920446A (en) A kind of processing method of Engineering document
CN109359198A (en) A kind of file classification method and device
CN105528618A (en) Short image text identification method and device based on social network
CN112328657A (en) Feature derivation method, feature derivation device, computer equipment and medium
CN106844671A (en) medical literature intelligent processing method and system
CN110347841A (en) A kind of method, apparatus, storage medium and the electronic equipment of document content classification
CN106383857A (en) Information processing method and electronic equipment
CN113887202A (en) Text error correction method and device, computer equipment and storage medium
CN108090044A (en) The recognition methods of contact method and device
Zhou et al. Virtual data augmentation: A robust and general framework for fine-tuning pre-trained models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201012

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201012

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190409