CN109299233A - Text data processing method, device, computer equipment and storage medium - Google Patents
Text data processing method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN109299233A CN109299233A CN201811093274.4A CN201811093274A CN109299233A CN 109299233 A CN109299233 A CN 109299233A CN 201811093274 A CN201811093274 A CN 201811093274A CN 109299233 A CN109299233 A CN 109299233A
- Authority
- CN
- China
- Prior art keywords
- text data
- data
- cleaning
- target
- obtains
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 44
- 238000003860 storage Methods 0.000 title claims abstract description 34
- 238000004140 cleaning Methods 0.000 claims abstract description 298
- 238000012545 processing Methods 0.000 claims abstract description 44
- 230000002000 scavenging effect Effects 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims description 37
- 230000014509 gene expression Effects 0.000 claims description 29
- 238000004422 calculation algorithm Methods 0.000 claims description 24
- 238000006243 chemical reaction Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 17
- 241000270322 Lepidosauria Species 0.000 claims description 8
- 238000012544 monitoring process Methods 0.000 claims description 7
- 230000009193 crawling Effects 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 5
- 230000001960 triggered effect Effects 0.000 claims description 3
- 238000000034 method Methods 0.000 abstract description 35
- 230000008569 process Effects 0.000 description 24
- 230000006870 function Effects 0.000 description 9
- 241001269238 Data Species 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 239000000203 mixture Substances 0.000 description 4
- 238000005406 washing Methods 0.000 description 4
- 238000007476 Maximum Likelihood Methods 0.000 description 3
- 244000097202 Rathbunia alamosensis Species 0.000 description 3
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 108010001267 Protein Subunits Proteins 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000007257 malfunction Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 241000239290 Araneae Species 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- GZEDKDBFUBPZNG-UHFFFAOYSA-N tridec-1-yne Chemical compound CCCCCCCCCCCC#C GZEDKDBFUBPZNG-UHFFFAOYSA-N 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The present invention discloses a kind of text data processing method, device, computer equipment and storage medium, applies in big data field more particularly to big data acquisition and processing.This method comprises: obtaining data cleansing request, data cleansing request includes channel identication and scavenging period;Determine that target corpus corresponding with channel identication, target corpus include including at least one urtext data based on channel identication, each urtext data carry a time identifier;Data cleansing record sheet is inquired according to channel identication and scavenging period, determines object time section;Urtext data of the time identifier in object time section are determined as text data to be cleaned;Based on channel identication rule searching database, target cleaning rule corresponding with channel identication is obtained;Text data to be cleaned is cleaned using target cleaning rule, obtains target plain text data.This method can effectively improve text data cleaning efficiency and cleaning quality.
Description
Technical field
The present invention relates to big data processing technology field more particularly to a kind of text data processing method, device, computers
Equipment and storage medium.
Background technique
In the technical fields such as speech recognition and OCR text identification, need to acquire a large amount of text data of specific area, with instruction
Practice the dedicated language model of the specific area, to guarantee the language model trained in the recognition accuracy of the specific area.
Main by artificially collecting and cleaning text data during current language model training, time-consuming for process, low efficiency and mistake
Accidentally rate is higher.Also, in Chinese language model training process, pure Chinese text data need to be acquired as Chinese language model
Text data, and artificially collect and clean in pure Chinese text data procedures, it need to be to the data other than Chinese in this article notebook data
It is cleaned, time-consuming for process, low efficiency and accuracy rate can not ensure.
Summary of the invention
The embodiment of the present invention provides a kind of text data processing method, device, computer equipment and storage medium, to solve
It artificially collects and cleans existing low efficiency and the higher problem of error rate during text data.
A kind of text data processing method, comprising:
Data cleansing request is obtained, the data cleansing request includes channel identication and scavenging period;
Target corpus corresponding with the channel identication, the target corpus packet are determined based on the channel identication
It includes including at least one urtext data, each urtext data carry a time identifier;
Data cleansing record sheet is inquired according to the channel identication and the scavenging period, determines object time section;
Urtext data of the time identifier in the object time section are determined as text data to be cleaned;
Based on the channel identication rule searching database, target cleaning rule corresponding with the channel identication are obtained
Then;
The text data to be cleaned is cleaned using the target cleaning rule, obtains target plain text data.
A kind of text data processing device, comprising:
Data cleansing request module, obtains data cleansing request, data cleansing request include channel identication and
Scavenging period;
Urtext data acquisition module, for determining mesh corresponding with the channel identication based on the channel identication
Corpus is marked, the target corpus includes including at least one urtext data, and each urtext data carry
One time identifier;
Object time section obtains module, for according to the channel identication and scavenging period inquiry data cleansing note
Table is recorded, determines object time section;
Text data to be cleaned obtains module, for the original text by the time identifier in the object time section
Notebook data is determined as text data to be cleaned;
Target cleaning rule obtains module, for being based on the channel identication rule searching database, obtains and the frequency
Road identifies corresponding target cleaning rule;
Target plain text data obtain module, for using the target cleaning rule to the text data to be cleaned into
Row cleaning, obtains target plain text data.
A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing
The computer program run on device, the processor realize above-mentioned text data processing method when executing the computer program
Step.
A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter
The step of calculation machine program realizes above-mentioned text data processing method when being executed by processor.
Above-mentioned text data processing method, device, computer equipment and storage medium, according to data cleansing request in frequency
Road mark determines corresponding target corpus, so that at least one the urtext data for carrying time identifier are obtained, to improve
The acquisition efficiency of urtext data.According to data cleansing request in channel identication and scavenging period determine object time area
Between, and the time tag according to the object time section and the carrying of each urtext data, it determines text data to be cleaned, has
Help avoid carrying out repeated washing to the urtext data being washed in target corpus, so that it is clear to improve text data
The efficiency washed.Text data to be cleaned is cleaned according to the corresponding target cleaning rule of channel identication, it can quick obtaining mesh
Plain text data is marked, process is not necessarily to manual intervention, can effectively improve the efficiency and quality of text cleaning.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is an application environment schematic diagram of one embodiment of the invention text data processing method;
Fig. 2 is a flow chart of one embodiment of the invention text data processing method;
Fig. 3 is another flow chart of one embodiment of the invention text data processing method;
Fig. 4 is another flow chart of one embodiment of the invention text data processing method;
Fig. 5 is another flow chart of one embodiment of the invention text data processing method;
Fig. 6 is another flow chart of one embodiment of the invention text data processing method;
Fig. 7 is another flow chart of one embodiment of the invention text data processing method;
Fig. 8 is a schematic diagram of one embodiment of the invention text data processing unit;
Fig. 9 is a schematic diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
Text data processing method provided in an embodiment of the present invention, text data processing method can be using as shown in Figure 1
Application environment in.Specifically, text data processing method is applied in text data processing system, text data processing
System includes client and server as shown in Figure 1, and client is communicated with server by network, for realizing to text
Notebook data cleans automatically, can quick obtaining batch target plain text data, and acquisition process can save artificial cleaning treatment
Time and cost of labor improve cleaning efficiency and cleaning quality.Wherein, client is also known as user terminal, refers to and server phase
It is corresponding, the program of local service is provided for client.Client it is mountable but be not limited to various personal computers, notebook electricity
On brain, smart phone, tablet computer and portable wearable device.Server can use independent server either multiple clothes
The server cluster of business device composition is realized.
In one embodiment, it as shown in Fig. 2, providing a kind of text data processing method, applies in Fig. 1 in this way
It is illustrated, includes the following steps: for server
S201: obtaining data cleansing request, and data cleansing request includes channel identication and scavenging period.
Wherein, data cleansing request is for realizing the request cleaned automatically to text data.The data cleansing is asked
Specifically user is asked to be sent to the server of text data processing system by client, so that server is based on the data cleansing
Request carries out the request of corresponding text cleaning treatment.Channel identication is the source channel for the text data for needing to clean for identification
Mark.In the present embodiment, the source channel for the text data for needing to clean can be understood as the pre-set classification in each website
The classification channel such as channel, including but not limited to news, finance, amusement, sport and education.Scavenging period refers to this data cleansing
The deadline of the text data clean limited in request.The client that the scavenging period can be triggers the data
System current date when cleaning request is also possible to user and passes through the time that client is independently set.
In the present embodiment, channel identication input frame, scavenging period input are shown on the data cleansing configuration interface of client
Frame and ACK button.User can directly input the channel mark for needing to carry out text cleaning treatment in the channel identication input frame
Know, it can also be by needing to carry out the channel mark of text cleaning treatment with the associated drop-down list selection of channel identication input frame
Know.Display system current date is defaulted in scavenging period input frame, and configures autonomous select button, can directly adopt default
System current date can also independently determine its scavenging period by clicking autonomous select button as scavenging period later.It is selecting
Select determining channel identication and after input time, click ACK button, can trigger data cleaning request so that server can connect
Receive data cleansing request.
S202: determine that target corpus corresponding with channel identication, target corpus include based on channel identication
At least one urtext data, each urtext data carry a time identifier.
Target corpus is the corpus for storing urtext data corresponding with channel identication.The present embodiment
In, multiple corpus are previously stored in the database of text data processing system, each corpus is for storing a kind of source
The corresponding urtext data of channel, so that the corpus is associated with a channel identication.Specifically, server is according to the channel mark
Know inquiry database, to determine corpus corresponding with the channel identication as target corpus, determination process is simple and fast.
Urtext data are stored in untreated text data in corpus, specifically can be using reptile instrument
Swash the text data associated with channel identication got from the separate sources channel of related web site.For example, using crawler work
Tool swashes from Sina website takes the web page contents in " sport " this source channel to be stored in as a urtext data and body
Ssd channel identifies in corresponding corpus.
Each urtext data carry a time identifier, which can be the storage of urtext data and arrive and frequency
Road identifies the time in corresponding corpus.Specifically, it stores in each urtext data in corpus, passes through system
Current time obtains function (such as time ()) and obtains current time in system, so as to carry the system current for the urtext data
Time is as its time identifier.
S203: data cleansing record sheet is inquired according to channel identication and scavenging period, determines object time section.
Data cleansing record sheet is that server is preconfigured for recording the tables of data of information in data cleansing request.Clothes
Sequence of the business device according to the data cleansing request received, successively by the channel identication in all data cleansings request and when cleaning
Between be recorded in the data cleansing record sheet, to determine that the urtext data in corresponding with each channel identication corpus are
It is no cleaned.
In the present embodiment, server inquires data cleansing record sheet according to channel identication and scavenging period, when determining target
Between section specifically include: server inquires data cleansing record sheet according to channel identication, to determine the corresponding mesh of the channel identication
Mark corpus the last time scavenging period (the entrained scavenging period of i.e. last data cleansing request);Then, according to nearest
Scavenging period entrained by scavenging period and this data cleansing request, determines object time section.The object time area
Between be using the last scavenging period as initial time, using the entrained scavenging period of this data cleansing request as deadline
Time interval.It is to be appreciated that the object time section is for determining the original text not being washed in target corpus
The time interval of notebook data helps avoid carrying out repeating to the urtext data being washed in target corpus clear
It washes, reduces the efficiency of text data cleaning.
S204: urtext data of the time identifier in object time section are determined as text data to be cleaned.
In the present embodiment, it is original for determining that each urtext data being stored in target corpus carry one
Text data store is to the time identifier in target corpus.And object time section be for determine in target corpus not by
The time interval for the urtext data cleaned.Therefore, server can be directly by time identifier in object time section
All urtext data are determined as text data to be cleaned, help to improve the cleaning efficiency of text data.The text to be cleaned
Notebook data is stored in the text data not being washed also in target corpus.Since the object time section is with nearest one
Secondary scavenging period is initial time, using the entrained scavenging period of this data cleansing request as the time interval of deadline,
Therefore, urtext data of the time identifier outside object time section are to have cleaned text data, if determining it as to clear
Washing text data may cause repeated washing, to influence the efficiency of text data cleaning.
S205: being based on channel identication rule searching database, obtains target cleaning rule corresponding with channel identication.
Wherein, rule database is the database for storing different cleaning rules, each cleaning rule and source frequency
The content in road is corresponding, so that each cleaning rule is corresponding with a channel identication, so that subsequent inquired according to channel identication is advised
Then database, to obtain target cleaning rule corresponding with channel identication.
In the present embodiment, target cleaning rule corresponding with channel identication includes at least two feature cleaning rules and spy
Levy cleaning sequence.Wherein, feature cleaning rule is for a certain portion in text data to be cleaned corresponding with channel identication
Dtex levies the rule cleaned.In the present embodiment, feature cleaning rule includes but is not limited to special tag cleaning rule, number
Cleaning rule, punctuation mark cleaning rule and foreign language cleaning rule.Feature cleaning sequence is clear for limiting at least two features
Wash the sequence that rule cleans text data to be cleaned, it is possible to understand that the priority being characterized between cleaning rule, feature
The restriction of cleaning sequence helps to ensure the cleaning quality for cleaning text data to be cleaned.It is to be appreciated that due to spy
Levying cleaning rule is the rule cleaned to Partial Feature a certain in text data to be cleaned, and in text data to be cleaned
The content of a certain Partial Feature may hit at least two feature cleaning rules, it is possible to by least two feature cleaning rules into
Row cleaning will lead to cleaning error at this time if not limiting the sequence between two feature cleaning rules, to influence to clean matter
Amount.
S206: cleaning text data to be cleaned using target cleaning rule, obtains target plain text data.
Wherein, target plain text data is obtained after being cleaned using target cleaning rule to text data to be cleaned
Pure text data.Specifically, server sequentially calls corresponding special according to the feature cleaning sequence in target cleaning rule
Sign cleaning rule cleans text data to be cleaned, can obtain target plain text data, and process is not necessarily to manual intervention,
The efficiency and quality of text cleaning can be effectively improved.
Further, after obtaining target plain text data, text data processing method further include: by the pure text of target
Notebook data is stored in training text database corresponding with channel identication.Wherein, training text database is for storing instruction
Practice the database of text data.It is to be appreciated that each training text database is corresponding with a channel identication, and make the training
Text database only stores target plain text data corresponding with channel identication, is directly based upon the training text so as to subsequent
Target plain text data in database, training target Chinese language model corresponding with channel identication, to improve target
The recognition accuracy that Chinese language model pair text data to be identified corresponding with channel identication is identified.Wherein, wait know
Other text data refers to the data for needing to carry out text identification.
In text data processing method provided by the present embodiment, according to data cleansing request in channel identication determine pair
The target corpus answered, so that at least one the urtext data for carrying time identifier are obtained, to improve urtext data
Acquisition efficiency.According to data cleansing request in channel identication and scavenging period determine object time section, and according to the mesh
The time tag for marking time interval and the carrying of each urtext data, determines text data to be cleaned, helps avoid to mesh
The urtext data being washed in mark corpus carry out repeated washing, to improve the efficiency of text data cleaning.According to
Text data to be cleaned is cleaned according to the corresponding target cleaning rule of channel identication, it can quick obtaining target plain text number
According to process is not necessarily to manual intervention, can effectively improve the efficiency and quality of text cleaning.
In one embodiment, target cleaning rule corresponding with channel identication include at least two feature cleaning rules and
Feature cleaning sequence, and feature cleaning rule includes but is not limited to special tag cleaning rule, digital cleaning rule, punctuation mark
Cleaning rule and foreign language cleaning rule.As shown in figure 3, being cleaned using target cleaning rule to text data to be cleaned, obtain
Take target plain text data, comprising:
S301: label cleaning is carried out to text data to be cleaned using special tag cleaning rule, obtains the first textual data
According to.
Wherein, special tag cleaning rule is started the cleaning processing to special tag present in text data to be cleaned
Rule.The special tag includes but is not limited to hyperlink address, the address URL and html tag, these special tags by label and
Character string composition, and character string is composed of at least one of number, letter and punctuation mark.Since special tag forms
Particularity it is therefore, excellent when being cleaned to text data to be cleaned so that it may hit other feature cleaning rules
First text data to be cleaned is cleaned using special tag cleaning rule, obtains the first text data, it is subsequent clear to ensure
Going on smoothly for operation is washed, to ensure the efficiency and quality of data cleansing.
In the present embodiment, server carries out label cleaning to text data to be cleaned using special tag cleaning rule, obtains
It takes the process of the first text data to specifically include: using regular expression corresponding with special tag cleaning rule, treat clear
It washes the special tag occurred in text data to be matched, if being matched to the label in special tag, after deleting label
Character string deletes the label again, to reach the target for removing special tag all in text data to be cleaned.For example, for
" http: // 120.77.246.207/index.aspx? objid=F3BFA010-60E9-4F63-BDED-
The character string after " http " label recognized is first deleted in this address URL B22782EC0513&pagecode=RE ",
" http " label is deleted again, to achieve the purpose that clean special tag, is avoided first cleaning character string when then cleaning label, is easy
Cleaning process is caused to malfunction.
S302: digital cleaning is carried out to the first text data using digital cleaning rule, obtains the second text data.
Wherein, digital cleaning rule is the rule started the cleaning processing to number present in text data to be cleaned.For
The training demand of the subsequent Chinese language model of adaptation, need to acquire pure Chinese text data, therefore, need to be using digital cleaning rule
Digital cleaning is carried out to the first text data, the number occurred in the first text data is converted into hanzi form, to reach
Digital cleaning effect.
In the present embodiment, carrying out number cleaning to the first text data using digital cleaning rule has the following two kinds cleaning side
Formula: the first is to carry out Chinese-character digital conversion to the number occurred in the first text data, such as 123 is converted into " 120
Three ".Second is word for word converted to the number occurred in the first text data, such as 123 is converted into " one two three ".It can manage
Xie Di carries out the process of digital cleaning using digital cleaning rule, specifically according to the first text data to the first text data
The applicable premise of the number of middle appearance, selects corresponding cleaning way to be cleaned, and to reach digital cleaning effect, improves number
The efficiency and quality of word cleaning.
It is to be appreciated that carrying out the number in digital cleaning to the first text data is the number after carrying out label cleaning,
And the number occurred in the first text data may carry punctuation mark, and these punctuation marks have particular meaning in number,
It such as directlys adopt punctuation mark cleaning rule and carries out symbol cleaning, number and the first text data after call sign may be made to clean
In number do not match that, therefore, need to be first using digital cleaning rule to the first textual data to influence the quality of text cleaning
It according to the digital cleaning of progress, then executes step S303 and carries out symbol cleaning, to guarantee the cleaning efficiency and quality of text data.
S303: symbol cleaning is carried out to the second text data using punctuation mark cleaning rule, obtains third text data.
Wherein, punctuation mark cleaning rule is started the cleaning processing to punctuation mark present in text data to be cleaned
Rule.In the present embodiment, carrying out symbol cleaning to the second text data using punctuation mark cleaning rule has deletion and replacement two
Kind mode.Specifically, server is previously stored with punctuation mark allocation list, which stores a plurality of configuration note
Record, each configuration record includes a punctuation mark, cleaning way and applicable premise.The cleaning way includes deleting and replacing two
Kind, this, which is applicable in premise and refers to, is applicable in the premise that a cleaning way cleans punctuation mark.It is to be appreciated that corresponding to replacement
This cleaning way, the configuration record in the punctuation mark allocation list also stores its corresponding substitute, to clean in symbol
The Shi Caiyong substitute replaces punctuation mark.
In the present embodiment, symbol cleaning is carried out to the second text data using punctuation mark cleaning rule, obtains third text
The process of notebook data specifically includes: regular expression corresponding with punctuation mark cleaning rule is used, to the second text data
Middle there is the sentence of punctuation mark and applicable premise corresponding with punctuation mark is matched, suitable according to this if successful match
It carries out symbol to the punctuation mark occurred in the second text data with the corresponding cleaning way of premise to clean, to obtain third text
Data, cleaning process are not necessarily to manual intervention, are conducive to ensure cleaning efficiency, and reduce the error rate of cleaning, improve cleaning matter
Amount.For example, for ": " this punctuation mark, if apply " pause after signal language or indicate prompt hereafter or it is blanket on
Text " this be applicable under the premise of, then to ": " this punctuation mark carry out delete processing;If applying " in mathematical linguistics, is indicating two
The ratio relation of person " this be applicable under the premise of, then to ": ", this punctuation mark is replaced processing, and ": " is converted into
" ratio ".
S304: foreign language cleaning is carried out to third text data using foreign language cleaning rule, obtains target plain text data.
Wherein, foreign language cleaning rule is the rule started the cleaning processing to foreign language present in text data to be cleaned.It should
Foreign language includes but is not limited to English, French and the Japanese etc. that the present embodiment refers to.In the present embodiment, server is cleaned using foreign language
Rule carries out foreign language cleaning to third text data, and the process for obtaining target plain text data specifically includes: using clear with foreign language
The corresponding regular expression of rule is washed, matching treatment is carried out to foreign language present in third text data, if being matched to opposite
The foreign language answered then carries out delete processing to the foreign language, to reach foreign language cleaning purpose.
In text data processing method provided by the present embodiment, successively using special tag cleaning rule, number cleaning
Rule, punctuation mark cleaning rule and foreign language cleaning rule carry out label cleaning, number cleaning, symbol to text data to be cleaned
Cleaning and foreign language cleaning, to obtain target plain text data, to realize the purpose quickly cleaned to text data to be cleaned,
Without manual intervention, the efficiency and quality of text data cleaning can be ensured.In addition, according to the feature cleaning in target cleaning rule
Sequentially, it is sequentially carried out using special tag cleaning rule, digital cleaning rule, punctuation mark cleaning rule and foreign language cleaning rule
Cleaning treatment determines the sequence of cleaning rule, when avoiding the content in text data to be cleaned while being applicable in different cleaning rules,
The problem of need to manually determining cleaning sequence, increasing cost of labor and reduce efficiency appearance, while avoiding the cleaning according to mistake suitable
Sequence cleaned and cause text data cleaning process malfunction this problem appearance.
In one embodiment, as shown in figure 4, carrying out label to text data to be cleaned using special tag cleaning rule
After the step of cleaning, the first text data of acquisition, and digital cleaning is carried out to the first text data using digital cleaning rule,
Before the step of obtaining the second text data, text data processing method further includes, comprising:
S401: brand database is inquired based on channel identication, obtains target branding data corresponding with channel identication.
Wherein, brand database is the database for storing the corresponding branding data of separate sources channel.Target brand
Data are the branding datas corresponding with channel identication stored in brand database.Branding data is field where the channel of source
Branding data, including brand name.Moreover, brand name can be formed using number, letter, symbol and Chinese character, such as 361 °
It is made of number with symbol.For example, 361 ° are branding datas corresponding with channel identication in sports channel;And it is educating
Channel, 361 ° are not branding datas corresponding with channel identication.
S402: the first text data and target branding data are subjected to matching treatment.
In the present embodiment, server matches the first text data with target branding data using fuzzy matching algorithm
Whether processing determines comprising target branding data corresponding with channel identication in the first text data, at according to matching
It manages result and carries out classification processing.Wherein, fuzzy matching algorithm include but is not limited to KMP (Knuth-Morris-Pratt) algorithm and
The string matching algorithms such as BM (Boyer-Moore) algorithm.
S403: if the first text data and target branding data successful match, to the first text data of successful match
It is handled except progress, then the first text data carries out digital cleaning using treated except digital cleaning rule pair, obtains the
Two text datas.
In the present embodiment, if the first text data and target branding data successful match, illustrate in the first text data
Comprising target branding data, since brand name is noun generally in the art, conversion or delete operation is such as carried out, may will affect
The globality of text data, thus server the first text data need to be carried out except handle so that except treated first
Label except content of text corresponding with target branding data carries in text data, so that subsequent cleaned using other features
When rule starts the cleaning processing text data, the content of text of label except carrying is not handled.
Specifically, if the first text data and target branding data successful match, and to the first textual data of successful match
After being handled except progress, so that subsequent all feature cleaning rules are not to label except carrying in the first text data
Content of text cleaned, but other do not carry except label content of text there is still a need for using other feature cleaning rules into
Row cleaning, therefore, server need to the first text data carries out digital cleaning using treated except digital cleaning rule pair, with
The number occurred in first text data is converted into hanzi form, to reach digital cleaning effect.
S403: if the first text data matched with target branding data it is unsuccessful, using digital cleaning rule to first
Text data carries out digital cleaning, obtains the second text data.
In the present embodiment, if the first text data matches unsuccessful with target branding data, illustrate the first text data
In do not include target branding data, then directly adopt digital cleaning rule and digital cleaning carried out to the first text data, by the
The number occurred in one text data is converted into hanzi form, to reach digital cleaning effect.
In text data processing method provided by the present embodiment, brand database is inquired according to channel identication, so as to true
The target branding data handled except the fixed progress with the presence or absence of needs, to be conducive to ensure the complete of follow-up text data cleansing
Property.In the first text data and target branding data successful match, except being carried out to the first text data of successful match
Reason, to avoid it is subsequent cleaned using other feature cleaning rules (such as digital cleaning rule) when, will be with target branding data
Corresponding content of text cleans together, influences the integrality of the first text data, to improve the accurate of text data cleaning
Rate avoids cleaning from malfunctioning.
In one embodiment, it as shown in figure 5, carrying out digital cleaning to the first text data using digital cleaning rule, obtains
Take the second text data, comprising:
S501: extracting digit strings from the first text data, judges numeric word using regular expression matching algorithm
Whether symbol string is thousand quartiles number.
In the present embodiment, the first text data, which can be, carries out text data to be cleaned using special tag cleaning rule
The text data obtained after label cleaning, is also possible to remove the first text data with target branding data successful match
The text data got after outer processing.In the present embodiment, it can be mentioned from the first text data using string matching algorithm
Digit strings are taken out, digital cleaning is carried out to digit strings so as to subsequent.
Regular expression matching algorithm is the algorithm that string matching is carried out based on regular expression.Wherein, regular expressions
Formula (Regular Expression, regex, regexp or RE are often abbreviated as in code), also known as regular expression, use list
A character string describes, matches a series of character strings for meeting certain syntactic rule, is usually used to retrieval, replaces those and meet
The text of some mode (rule).Thousand quartile numbers refer to that in number, adding a comma every three digits, (i.e. kilobit separates
Symbol), to be more easier to recognize numerical value, for example, 1,000,000.
In the present embodiment, server is provided with the regular expressions that can match kilobit separator in digit strings in advance
Formula carries out number to the first text data using digital cleaning rule according to the feature cleaning sequence in target cleaning rule
When cleaning, the regular expression of kilobit separator in digit strings first can be matched using this, to mentioning in the first text data
The digit strings of taking-up are matched, if matching in the digit strings containing kilobit separator, assert the numeric word
Symbol string is thousand quartiles number;Conversely, assert the numerical character if matching in the digit strings without containing kilobit separator
String is not thousand quartiles number.It is to be appreciated that number cleaning first judges before symbol cleaning due in feature cleaning sequence
Whether digit strings are thousand quartiles number, so that subsequent can carry out digital cleaning to digit strings according to judging result, with
When avoiding first carrying out symbol cleaning using punctuation mark cleaning rule, thousand quartiles digital ", " are mistakenly considered comma and are cleaned, are made
Number error after must cleaning, to guarantee cleaning quality.
S502: if digit strings are thousand quartiles number, the kilobit separator in thousand quartiles number is removed, and to removal
Number after kilobit separator carries out Chinese-character digital conversion, obtains the second text data;
In the present embodiment, if server determines that the digit strings are thousand quartiles number, illustrate in the digit strings
", " be kilobit separator, removal thousand quartiles number in kilobit separator, and to removal kilobit separator after number carry out
Chinese-character digital conversion, obtains the second text data, to reach digital cleaning purpose.For example, 1,000,000 this thousand quantile
Word, remove kilobit separator after number be 1000000, then by 1000000 this number carry out Chinese-character digital conversions, that is, turn
Change 1,000,000 into.
S503: if digit strings are not thousand quartiles number, numerical character is judged using regular expression matching algorithm
Whether string is decimal point number.
Specifically, if it is thousand quartiles number that server, which determines the digit strings not, illustrate in the digit strings not
Containing ", " this kilobit separator, at this point, regular expression matching algorithm is used to judge digit strings whether for decimal points
Word.Wherein, decimal point number is to carry the number of decimal deparator " ", such as 123.45.
In the present embodiment, server is provided with the regular expression that can match decimal deparator in digit strings in advance,
When carrying out number cleaning to the first text data using digital cleaning rule, this, which can be used, can match decimal in digit strings
The regular expression of point symbol matches the digit strings extracted in the first text data, if matching the numeric word
Contain decimal deparator in symbol string, then assert that the digit strings are decimal point number;Conversely, if matching the digit strings
In do not contain decimal deparator, then assert the digit strings not and be decimal point number.It is to be appreciated that since feature cleaning is suitable
In sequence, number cleaning first judge whether digit strings are that decimal point is digital before symbol cleaning so that subsequent foundation this sentence
Disconnected result carries out digital cleaning to digit strings, will when to avoid first carrying out symbol cleaning using punctuation mark cleaning rule
Decimal deparator " " is mistakenly considered fullstop and is cleaned, so that the number error after cleaning, to guarantee cleaning quality.
S504: if digit strings are decimal point number, Chinese-character digital is carried out to the number before decimal deparator and is turned
It changes, the number after decimal deparator is word for word converted, and Chinese character replacement is carried out to decimal deparator, obtain the second text
Data.
In the present embodiment, if server determines that the digit strings for decimal point number, illustrate in the digit strings
" " be decimal deparator, then will to before decimal deparator number carry out Chinese-character digital conversion, after decimal deparator
Number word for word converted, and Chinese character replacement is carried out to decimal deparator, i.e., replaced with decimal deparator " point ", acquisition the
Two text datas, to reach digital cleaning purpose.Such as 123.45 this decimal point number, number cleaning when, first to 123
Chinese-character digital conversion is carried out, to be converted into " 123 ", and is word for word converted to 45, to be converted into " four or five ", then
Decimal deparator is replaced with into " point ", with acquisition " 123 points 45 ".
S505: if digit strings are not decimal point number, numerical character is judged using regular expression matching algorithm
String is Chinese quantifier.
Specifically, if it is decimal point number that server, which determines the digit strings not, illustrate in the digit strings not
Containing " " this decimal deparator, at this point, use regular expression matching algorithm judge the digit strings for Chinese quantifier,
Judge whether carry preconfigured Chinese unit after the digit strings, to carry out following digital according to judging result
Cleaning.Chinese unit refers to the unit quantifier of Chinese.Specifically, server be provided in advance but be not limited to point, when, class, li,
Hair, member, block, angle, a, platform, face,, block, select, item, drop, piece, very little, rice, ruler, ten, hundred, ten thousand, hundred million, million, thousand, gram,
Ton, bottle, box, cup, case, bucket, tank, group, double, beam, portion, ticket, time, part and people etc..
S506: if digit strings are Chinese quantifier, Chinese-character digital conversion is carried out to digit strings, obtains the second text
Notebook data.
In the present embodiment, server carries Chinese unit after determining digit strings, then illustrates the digit strings
It is the quantifier (as Chinese quantifier) for characterizing quantity, at this point, carrying out Chinese-character digital conversion to the digit strings, obtains the second text
Notebook data, to reach digital cleaning purpose, such as " 123 pieces " are converted into " 102 tridecyne ".
S507: if digit strings are not Chinese quantifier, digit strings are judged using regular expression matching algorithm
It whether is numerical digit.
Server is determining that digit strings are not Chinese quantifier, then illustrating the digit strings not is the amount for characterizing quantity
Word (being not Chinese quantifier), at this point, regular expression matching algorithm is used to judge digit strings whether for numerical digit.This
In numerical digit include but is not limited to ID card No., phone number, organization mechanism code and contract number etc. by default volume
Number rule generate numerical digit.It, can be according to number since these numerical digits have fixed length and meet specific format
The length or specific format of word character string, configuration can match digit strings whether be numerical digit regular expression.Example
Such as, ID card No. be ten eight-digit number words, be specifically made of 17 bit digital ontology codes and a bit check code, put in order from
Left-to-right is successively are as follows: six bit digital address codes, eight-digit number word date of birth code, three bit digital sequence codes and one-bit digital verification
Code.Server is matched according to the digit strings that the regular expression of number data extracts the first text data, if
With success, then illustrate that the digit strings are numerical digit.
S508: if digit strings are numerical digit, word for word converting numerical digit, and the second textual data is obtained
According to.
In the present embodiment, if server determines that the digit strings for numerical digit, directly carry out the numerical digit
It word for word converts, the second text data is obtained, to reach digital cleaning purpose.For example, phone number 12345678911 is converted into
" one two three four five six seven eight 9.11 ".
S509: if digit strings are not numerical digit, Chinese-character digital conversion is carried out to digit strings, obtains second
Text data.
Specifically, server determines that the digit strings are not numerical digit, then carries out Chinese-character digital to digit strings
Conversion obtains the second text data, to reach digital cleaning purpose, so that it can obtain purer text data.
In text data processing method provided by the present embodiment, successively to the digit strings of the first text data extraction
The cleaning of thousand quartile numbers, the cleaning of decimal point number, Chinese quantifier cleaning and numerical digit cleaning etc. are carried out, by the first textual data
Number in is converted into Chinese character, facilitates the purer target plain text data of subsequent acquisition.Also, to the first text data
The digit strings carry out sequence cleaning of extraction, can avoid the cleaning effect for influencing formerly to clean after rear cleaning operation, from
And ensure cleaning quality.
In one embodiment, as shown in fig. 6, before the step of obtaining data cleansing request, text data processing side
Method further include:
S601: data are obtained and crawl task, it includes task type and file identification that data, which crawl task,.
Wherein, it is that the task of data is crawled for trigger the server that data, which crawl task,.Task type is for limiting this
Secondary data crawl the type of task, are specifically as follows timed task or real-time task two types.Text Flag is for unique
Identify the mark of crawler file.
In the present embodiment, server is pre-created different crawler files, the corresponding file identification of each crawler file,
And in the database by the crawler file and file identification associated storage, it is obtained so as to subsequent according to this document mark corresponding
Crawler file.
Specifically, server can create crawler file corresponding with channel identication based on Scrapy frame.For example, passing through
Scrapy frame crawls the sublink on Sina website's navigation page in all major class, group, group and the news of the sublink page
Content is finally saved in local, then its crawler file creation process includes the following steps:
(1) Scrapy project is created, Scrapy project is such as created using " scrapy startproject XX " instruction, with
The channel of data is crawled needed for determination, wherein XX can be the channels such as news, finance, amusement, sport and education.
(2) item file is write, i.e., the data content definition crawled as needed crawls field.Item is to save to crawl
Data container, main target is exactly to extract structural data, such as webpage from the data source of unstructuredness.Scrapy is provided
Item class meets such demand, and application method is similar with python dictionary, and provide additional protection mechanisms to keep away
Exempt from undefined field mistake caused by misspelling.For example, it is desired to from the website (crawling Sina News here) for wanting to crawl
It obtains with properties: news major class url, news major class title;News group url, news group title;News url, news
title;Headline, news content.
(3) according to Scrapy project and item file, crawler file is write, and the crawler file is stored in server
In database.The crawler file includes spider file (crawling the classes of data for crawling data and limiting), pipelines text
Part (for storing item data) and settings file setting (main set content).
S602: it if task type is real-time task, triggers reptile instrument and executes crawler text corresponding with file identification
Part obtains urtext data.
It is directly logical if it is real-time task that server, which identifies that data crawl the task type in task, in the present embodiment
It crosses the file identification that data crawl in task and obtains corresponding crawler file, directly trigger reptile instrument and execute crawler text
Part, with from the crawler document to website in crawl corresponding urtext data.
S603: if task type is timed task, triggered time monitoring tools, so that the current time in system reaches data
When crawling the clocked flip time carried in task, triggering reptile instrument executes crawler file corresponding with file identification, obtains
Take urtext data.
In the present embodiment, if it is timed task that server, which identifies that data crawl the task type in task, also need to obtain
Take the data to crawl the clocked flip time in task, the clocked flip time be clocked flip server execute data crawl appoint
The time of business.Time monitoring tool is the tool for monitoring system current time, can be Time Watch tool.
Specifically, if task type is timed task, the time monitoring tool installed in trigger the server, to supervise in real time
The current time in system is controlled, and when the current time in system reaches data and crawls the clocked flip time carried in task, triggering is climbed
Worm tool executes corresponding with file identification crawler file, with from the crawler document to website in crawl corresponding original
Beginning text data.
S604: according to classification storage folder corresponding with crawler file, urtext data is stored in classification and are deposited
On the afterbody file for storing up file.
Wherein, classification storage folder corresponding with crawler file, which refers to, is determined in server based on crawler file and stores
The file of various urtext data.In the present embodiment, classification storage folder corresponding with crawler file includes channel
This tertiary storage file of project, major class and group.
In the present embodiment, according to classification storage folder corresponding with crawler file, urtext data are stored in
It is classified on the afterbody file of storage folder.For example, classification storage folder corresponding with crawler file includes frequency
This tertiary storage file of road project, major class and group, then according to urtext data in the website that the crawler file is directed toward
Position or classification level, by urtext data storage into corresponding group, with realize to urtext data carry out
The purpose of classification storage is instructed to obtain the urtext data of specific area during subsequent linguistic model training
Practice, it is made to train the recognition accuracy of resulting language model higher.
In text data processing method provided by the present embodiment, crawling the task type in task in data is to appoint in real time
When business, triggering in real time crawls the corresponding crawler file progress of the file identification in task data with data and crawls operation, can be real
When the corresponding urtext data of quick obtaining.Since crawler file is pre-created and stores, corresponding file need to be only uploaded
Mark can be quickly found out corresponding crawler file and carry out data using the crawler file and crawl, and be conducive to raising data and crawl
Efficiency.When task type in data crawler task is timed task, time monitoring tool clocked flip and text can be passed through
It identifies corresponding crawler file progress data to crawl, process is not necessarily to manual intervention, is conducive to improve the efficiency that data crawl.
Urtext data are stored in the afterbody file of classification storage folder corresponding with crawler file, to realize
Urtext data classification is stored, trains the stronger language model of specific aim so as to subsequent.
In one embodiment, as shown in fig. 7, after the step of obtaining target plain text data, text data processing side
Method further include:
S701: obtaining model training request, and model training request includes channel identication.
Wherein, model training request is that the request of language model training is carried out for trigger the server.Channel identication is to use
In the mark of the source channel for the text data that identification needs to clean.It is to be appreciated that the channel identication in model training request
For determining the source of text data needed for train language model.
S702: from training text database corresponding with channel identication, it is pure to obtain target corresponding with channel identication
Text data.
Training text database is the database for storing training text data.It is to be appreciated that each training text
Database is corresponding with a channel identication, so that the training text database only stores the pure text of target corresponding with channel identication
Notebook data.It is to be appreciated that server is after obtaining model training request, the channel mark in train request based on this model
Know, corresponding target plain text data is obtained from corresponding training text database, to utilize the target plain text data
Carry out model training.It is to be appreciated that the target plain text data is according to plain text number accessed by step S201-S206
According to.
S703: carrying out word segmentation processing to target plain text data, obtains at least two targets participle.
In the present embodiment, server carries out Chinese point to target plain text data using pre-set Chinese word segmentation tool
Word, to obtain at least two targets participle.Wherein, Chinese word segmentation tool includes but is not limited to jieba participle tool, SnowNLP
Participle tool, THULAC (THU Lexical Analyzer for Chinese) participle tool and NLPIR segment tool.Example
Such as, using SnowNLP participle tool to " Hangzhou West Lake landscape is fine, is tourist attraction, annual to attract the trip for largely coming to play
Visitor!" segmented, it can get that "/very/good/to be/travel/famous scenic spot/annual/, which attracts ,/a large amount of/come/swim in the Hangzhou/West Lake/landscape
Play// tourist " etc. targets participle.
S704: segmenting at least two targets using N-gram model and carry out model training, and target Chinese language mould is obtained
Type.
N-gram is to utilize phase in context commonly based on statistical language model algorithm in large vocabulary continuous speech recognition
Collocation information between adjacent word can be calculated and be provided when needing the phonetic continuously without space to be converted into Chinese character string (i.e. sentence)
There is the sentence of maximum probability, to realize the automatic conversion for arriving Chinese character, manually selected without user, avoids many Chinese characters corresponding one
A identical phonetic and lead to coincident code problem.N-gram is assumed based on Markov: the appearance of n-th word and the word of front N-1
Correlation, and it is all uncorrelated to other any words, and the probability of whole sentence is exactly the product of each word probability of occurrence.Wherein, maximum likelihood
Estimation (Maximum Likelihood Estimate) refers to a kind of method of estimation, this can be made in some known parameter by establishing
The maximum probability that a sample occurs, therefore will not remove to select the sample of other small probabilities again, so clear-cut just make this parameter
For estimation true value inwardly.
Specifically, server first uses maximal possibility estimation (Maximum Likelihood Estimate) to calculate each
The word sequence probability of target participle, i.e.,Calculate the word order of each target participle
Column probability, wherein WnIt is segmented for n-th of target, (W1W2…Wn) it is that n target segments the word sequence to be formed;C(W1W2…Wn) be
(W1W2…Wn) word sequence frequency of this word sequence in target plain text data;(W1W2…Wn-1) it is that n-1 target segments shape
At word sequence;C(W1W2…Wn-1) it is (W1W2…Wn-1) word sequence frequency of this word sequence in target plain text data;P
(Wn|W1W2…Wn-1) refer to that n-th of target segments the word sequence that the word sequence to be formed appears in n-1 target participle composition after
Word sequence probability.Then, based on Markov it is assumed that handling the word sequence probability of each target participle, to be formed
Target Chinese language model.In the target Chinese language model formed, the appearance of n-th of target participle is only a with front n-1
Target participle is related, and all uncorrelated to other any words, and the probability of whole sentence is exactly the product of each target participle probability of occurrence
In the present embodiment, the product of the word sequence probability based on each target participle forms target Chinese language model, i.e. P (T)=P
(W1W2W3…Wn)=P (W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1), so that in following model identification process, base
Corresponding word sequence probability, which is segmented, in each target obtains corresponding recognition result.
In the text data processing method of the present embodiment offer, first obtained according to the channel identication in model training request
Corresponding target plain text data, so that subsequent according to the resulting target Chinese language model of target plain text data training
Recognition result it is more acurrate.The reason is that the training process due to target Chinese language model is calculated based on maximal possibility estimation
Method and Markov are it is assumed that segment the appearance that nth object segments in target plain text data only with the target of front N-1
Correlation, and it is all uncorrelated to other any words, and this characteristic makes each in the corresponding target plain text data of different channel
Target participle and the target participle before it to be formed by word sequence frequency completely not identical so that being formed by target Chinese
Language model is higher in the recognition accuracy of corresponding channel.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
In one embodiment, a kind of text data processing device, text data processing equipment and above-described embodiment are provided
Text data processing method corresponds.As shown in figure 8, text data processing equipment includes data cleansing request mould
Block 801, urtext data acquisition module 802, object time section obtains module 803, text data to be cleaned obtains module
804, target cleaning rule obtains module 805 and target plain text data obtains module 806.Each functional module is described in detail such as
Under:
Data cleansing request module 801, acquisition data cleansing request, data cleansing request is including channel identication and clearly
Wash the time.
Urtext data acquisition module 802, for determining target language corresponding with channel identication based on channel identication
Expect library, target corpus includes including at least one urtext data, and each urtext data carry a time identifier.
Object time section obtains module 803, for inquiring data cleansing record sheet according to channel identication and scavenging period,
Determine object time section.
Text data to be cleaned obtains module 804, for the urtext number by time identifier in object time section
According to being determined as text data to be cleaned.
Target cleaning rule obtains module 805, for being based on channel identication rule searching database, acquisition and channel identication
Corresponding target cleaning rule.
Target plain text data obtains module 806, clear for being carried out using target cleaning rule to text data to be cleaned
It washes, obtains target plain text data.
Preferably, target cleaning rule includes special tag cleaning rule, digital cleaning rule, punctuation mark cleaning rule
With foreign language cleaning rule.
Target plain text data obtain module 806 include label cleaning unit, digital cleaning unit, symbol cleaning unit and
Foreign language cleaning unit.
Label cleaning unit is obtained for carrying out label cleaning to text data to be cleaned using special tag cleaning rule
Take the first text data.
Digital cleaning unit obtains second for carrying out digital cleaning to the first text data using digital cleaning rule
Text data.
Symbol cleaning unit is obtained for carrying out symbol cleaning to the second text data using punctuation mark cleaning rule
Third text data.
Foreign language cleaning unit obtains target for carrying out foreign language cleaning to third text data using foreign language cleaning rule
Plain text data.
Preferably, after digital cleaning unit, text data processing device further includes branding data acquiring unit, data
Matching treatment unit, the first matching result processing unit and the first matching result processing unit.
Branding data acquiring unit obtains corresponding with channel identication for inquiring brand database based on channel identication
Target branding data.
Data Matching processing unit, for the first text data and target branding data to be carried out matching treatment.
First matching result processing unit, if for the first text data and target branding data successful match, to
It is handled except being carried out with successful first text data, then using treated except digital cleaning rule pair the first text data
Digital cleaning is carried out, the second text data is obtained.
First matching result processing unit is adopted if matching with target branding data unsuccessful for the first text data
Digital cleaning is carried out to the first text data with digital cleaning rule, obtains the second text data.
Preferably, digital cleaning unit includes text string extracting subelement, thousand quartile numbers cleaning subelement, decimal points
Word judgment sub-unit, decimal point number cleaning subelement, Chinese quantifier judgment sub-unit, Chinese quantifier cleaning subelement, number
Digital judgement subelement, numerical digit cleaning subelement and non-numerical digit clean subelement.
Text string extracting subelement, for extracting digit strings from the first text data, using regular expression
Judge whether digit strings are thousand quartiles number with algorithm.
Thousand quartile numbers clean subelement, if being thousand quartiles number for digit strings, remove in thousand quartiles number
Kilobit separator, and to removal kilobit separator after number carry out Chinese-character digital conversion, obtain the second text data.
Decimal point digital judgement subelement uses regular expression if not being thousand quartiles number for digit strings
Matching algorithm judges whether digit strings are decimal point number.
Decimal point number cleans subelement, if being decimal point number for digit strings, before decimal deparator
Number carry out Chinese-character digital conversion, to after decimal deparator number word for word converted, and to decimal deparator progress
Chinese character replacement, obtains the second text data.
Chinese quantifier judgment sub-unit uses regular expression if not being decimal point number for digit strings
Judge digit strings for Chinese quantifier with algorithm.
Chinese quantifier cleans subelement, if being Chinese quantifier for digit strings, carries out Chinese character to digit strings
Number conversion, obtains the second text data.
Numerical digit judgment sub-unit uses regular expression matching if for digit strings not being Chinese quantifier
Algorithm judges whether digit strings are numerical digit.
Numerical digit cleans subelement, if being numerical digit for digit strings, is word for word turned to numerical digit
It changes, obtains the second text data.
Non- numerical digit cleans subelement, if not being numerical digit for digit strings, carries out to digit strings
Chinese-character digital conversion, obtains the second text data.
Preferably, before data cleansing request module 801, text data processing device further include number crawl appoint
Business acquiring unit, the real-time acquiring unit of text data, text data timing acquisition unit and text data store unit.
Number crawls task acquiring unit, crawls task for obtaining data, data crawl task include task type and
File identification.
The real-time acquiring unit of text data triggers reptile instrument and executes and text if being real-time task for task type
Part identifies corresponding crawler file, obtains urtext data.
Text data timing acquisition unit, if for task type be timed task, triggered time monitoring tools so that
When current time in system reaches data and crawls the clocked flip time carried in task, triggering reptile instrument executes and file identification
Corresponding crawler file obtains urtext data.
Text data store unit, for foundation classification storage folder corresponding with crawler file, by urtext
Data are stored on the afterbody file of classification storage folder.
Preferably, after target plain text data obtains module 806, text data processing device further includes model training
Request unit, plain text data acquiring unit, target participle acquiring unit and language model acquiring unit.
Model training request unit, for obtaining model training request, model training request includes channel identication.
Plain text data acquiring unit, for from training text database corresponding with channel identication, obtaining and frequency
Road identifies corresponding target plain text data.
Target segments acquiring unit, for carrying out word segmentation processing to target plain text data, obtains at least two targets point
Word.
Language model acquiring unit is carried out model training for being segmented using N-gram model at least two targets, obtained
Take target Chinese language model.
Specific about text data processing device limits the limit that may refer to above for text data processing method
Fixed, details are not described herein.Modules in above-mentioned text data processing device can fully or partially through software, hardware and its
Combination is to realize.Above-mentioned each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also be with
It is stored in the memory in computer equipment in a software form, in order to which processor calls the above modules of execution corresponding
Operation.
In one embodiment, a kind of computer equipment is provided, which can be server, internal junction
Composition can be as shown in Figure 9.The computer equipment include by system bus connect processor, memory, network interface and
Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment
Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data
Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating
The database of machine equipment is executed for processor can be achieved the corresponding computer journey of above-described embodiment text data processing method
The data formed in program process, including but not limited to target plain text data.The network interface of the computer equipment be used for it is outer
The terminal in portion passes through network connection communication.To realize a kind of text data processing side when the computer program is executed by processor
Method.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory
And the computer program that can be run on a processor, processor realize above-described embodiment text data when executing computer program
The step of processing method, such as step S201-S206 or Fig. 3 shown in Fig. 2 is to step shown in fig. 7.Alternatively, processing
Device realizes the function of each module/unit in this embodiment of text data processing device, such as Fig. 8 when executing computer program
Shown in data cleansing request module 801, urtext data acquisition module 802, object time section obtain module
803, text data to be cleaned obtains module 804, target cleaning rule obtains module 805 and target plain text data obtains module
806 function, to avoid repeating, which is not described herein again.
In one embodiment, a computer readable storage medium is provided, meter is stored on the computer readable storage medium
The step of calculation machine program, which realizes above-described embodiment text data processing method when being executed by processor, example
Step S201-S206 or Fig. 3 as shown in Figure 2 is to step shown in fig. 7, and to avoid repeating, which is not described herein again.Or
Person, the computer program realize each module in this embodiment of above-mentioned text data processing device/mono- when being executed by processor
The function of member, such as when data cleansing request module 801 shown in Fig. 8, urtext data acquisition module 802, target
Between section obtains module 803, text data to be cleaned obtains module 804, target cleaning rule obtains module 805 and the pure text of target
Notebook data obtains the function of module 806, and to avoid repeating, which is not described herein again.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,
To any reference of memory, storage, database or other media used in each embodiment provided herein,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function
Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing
The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all
It is included within protection scope of the present invention.
Claims (10)
1. a kind of text data processing method characterized by comprising
Data cleansing request is obtained, the data cleansing request includes channel identication and scavenging period;
Determine that target corpus corresponding with the channel identication, the target corpus include extremely based on the channel identication
Few urtext data, each urtext data carry a time identifier;
Data cleansing record sheet is inquired according to the channel identication and the scavenging period, determines object time section;
Urtext data of the time identifier in the object time section are determined as text data to be cleaned;
Based on the channel identication rule searching database, target cleaning rule corresponding with the channel identication is obtained;
The text data to be cleaned is cleaned using the target cleaning rule, obtains target plain text data.
2. text data processing method as described in claim 1, which is characterized in that the target cleaning rule includes special mark
Sign cleaning rule, digital cleaning rule, punctuation mark cleaning rule and foreign language cleaning rule;
It is described that the text data to be cleaned is cleaned using the target cleaning rule, target plain text data is obtained,
Include:
Label cleaning is carried out to the text data to be cleaned using the special tag cleaning rule, obtains the first textual data
According to;
Digital cleaning is carried out to first text data using the digital cleaning rule, obtains the second text data;
Symbol cleaning is carried out to second text data using the punctuation mark cleaning rule, obtains third text data;
Foreign language cleaning is carried out to the third text data using the foreign language cleaning rule, obtains target plain text data.
3. text data processing method as claimed in claim 2, which is characterized in that use special tag cleaning rule described
After the step of carrying out label cleaning to the text data to be cleaned, obtain the first text data, and it is described using the number
Before the step of word cleaning rule carries out digital cleaning to first text data, obtains the second text data, the text
Data processing method further include:
Brand database is inquired based on the channel identication, obtains target branding data corresponding with the channel identication;
First text data and the target branding data are subjected to matching treatment;
If first text data and the target branding data successful match, to the first text data of successful match into
It is handled except row, then using treated except the digital cleaning rule pair, the first text data carries out digital cleaning, obtains
Second text data;
If first text data matched with the target branding data it is unsuccessful, using the digital cleaning rule to institute
It states the first text data and carries out digital cleaning, obtain the second text data.
4. text data processing method as claimed in claim 2, which is characterized in that described to use digital cleaning rule to described
First text data carries out digital cleaning, obtains the second text data, comprising:
Digit strings are extracted from first text data, the numerical character is judged using regular expression matching algorithm
Whether string is thousand quartiles number;
If the digit strings are thousand quartiles number, the kilobit separator in the thousand quartiles number is removed, and to removal
Number after kilobit separator carries out Chinese-character digital conversion, obtains the second text data;
If the digit strings are not thousand quartiles number, the digit strings are judged using regular expression matching algorithm
It whether is decimal point number;
If the digit strings are decimal point number, Chinese-character digital is carried out to the number before the decimal deparator and is turned
It changes, the number after decimal deparator is word for word converted, and Chinese character replacement is carried out to decimal deparator, obtain the second text
Data;
If the digit strings are not decimal point number, the digit strings are judged using regular expression matching algorithm
For Chinese quantifier;
If the digit strings are Chinese quantifier, Chinese-character digital conversion is carried out to the digit strings, obtains the second text
Notebook data;
If the digit strings are not Chinese quantifier, judge that the digit strings are using regular expression matching algorithm
No is numerical digit;
If the digit strings are numerical digit, the numerical digit is word for word converted, obtains the second text data;
If the digit strings are not numerical digit, Chinese-character digital conversion is carried out to the digit strings, obtains second
Text data.
5. text data processing method as described in claim 1, which is characterized in that in the step of the acquisition data cleansing request
Before rapid, the text data processing method further include:
It obtains data and crawls task, it includes task type and file identification that the data, which crawl task,;
If the task type is real-time task, triggers reptile instrument and execute crawler text corresponding with the file identification
Part obtains urtext data;
If the task type is timed task, triggered time monitoring tools, so that the current time in system reaches the data
When crawling the clocked flip time carried in task, triggering reptile instrument executes crawler text corresponding with the file identification
Part obtains urtext data;
According to classification storage folder corresponding with the crawler file, the urtext data are stored in the classification
On the afterbody file of storage folder.
6. text data processing method as described in claim 1, which is characterized in that in the acquisition target plain text data
After step, the text data processing method further include:
Model training request is obtained, the model training request includes channel identication;
From training text database corresponding with the channel identication, the pure text of target corresponding with the channel identication is obtained
Notebook data;
Word segmentation processing is carried out to the target plain text data, obtains at least two targets participle;
Model training is carried out at least two target participles using N-gram model, obtains target Chinese language model.
7. a kind of text data processing device characterized by comprising
Data cleansing request module obtains data cleansing request, and the data cleansing request includes channel identication and cleaning
Time;
Urtext data acquisition module, for determining target language corresponding with the channel identication based on the channel identication
Expect library, the target corpus includes including at least one urtext data, and each urtext data carry for the moment
Between identify;
Object time section obtains module, for according to the channel identication and scavenging period inquiry data cleansing record
Table determines object time section;
Text data to be cleaned obtains module, for the urtext number by the time identifier in the object time section
According to being determined as text data to be cleaned;
Target cleaning rule obtains module, for being based on the channel identication rule searching database, obtains and the channel mark
Sensible corresponding target cleaning rule;
Target plain text data obtains module, clear for being carried out using the target cleaning rule to the text data to be cleaned
It washes, obtains target plain text data.
8. text data processing device as claimed in claim 7, which is characterized in that the target cleaning rule includes special mark
Sign cleaning rule, digital cleaning rule, punctuation mark cleaning rule and foreign language cleaning rule;
The target plain text data obtains module
Label cleaning unit is obtained for carrying out label cleaning to the text data to be cleaned using special tag cleaning rule
Take the first text data;
Digital cleaning unit obtains second for carrying out digital cleaning to first text data using digital cleaning rule
Text data;
Symbol cleaning unit is obtained for carrying out symbol cleaning to second text data using punctuation mark cleaning rule
Third text data;
Foreign language cleaning unit obtains target for carrying out foreign language cleaning to the third text data using foreign language cleaning rule
Plain text data.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor
The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to
The step of any one of 6 text data processing method.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists
In realizing the text data processing method as described in any one of claim 1 to 6 when the computer program is executed by processor
Step.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811093274.4A CN109299233B (en) | 2018-09-19 | 2018-09-19 | Text data processing method, device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811093274.4A CN109299233B (en) | 2018-09-19 | 2018-09-19 | Text data processing method, device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109299233A true CN109299233A (en) | 2019-02-01 |
CN109299233B CN109299233B (en) | 2024-03-01 |
Family
ID=65163361
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811093274.4A Active CN109299233B (en) | 2018-09-19 | 2018-09-19 | Text data processing method, device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109299233B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110096626A (en) * | 2019-03-18 | 2019-08-06 | 平安普惠企业管理有限公司 | Processing method, device, equipment and the storage medium of contract text data |
CN111191421A (en) * | 2019-12-30 | 2020-05-22 | 出门问问信息科技有限公司 | Text processing method and device, computer storage medium and electronic equipment |
CN111797078A (en) * | 2019-04-09 | 2020-10-20 | Oppo广东移动通信有限公司 | Data cleaning method, model training method, device, storage medium and equipment |
CN112199364A (en) * | 2020-10-16 | 2021-01-08 | 平安国际智慧城市科技股份有限公司 | Data cleaning method and device, electronic equipment and storage medium |
CN112287638A (en) * | 2020-10-28 | 2021-01-29 | 云账户技术(天津)有限公司 | Digital display method and device |
CN113064885A (en) * | 2020-12-29 | 2021-07-02 | 中国移动通信集团贵州有限公司 | Data cleaning method and device |
CN117648635B (en) * | 2024-01-30 | 2024-05-03 | 深圳昂楷科技有限公司 | Sensitive information classification and classification method and system and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361064A (en) * | 2014-11-04 | 2015-02-18 | 中国银行股份有限公司 | Data cleaning method for data files and data files processing method |
WO2016101690A1 (en) * | 2014-12-22 | 2016-06-30 | 国家电网公司 | Time sequence analysis-based state monitoring data cleaning method for power transmission and transformation device |
US20180052888A1 (en) * | 2016-08-17 | 2018-02-22 | International Business Machines Corporation | Result set optimization for a search query |
CN107784070A (en) * | 2017-09-15 | 2018-03-09 | 平安科技(深圳)有限公司 | A kind of method, apparatus and equipment for improving data cleansing efficiency |
CN108052665A (en) * | 2017-12-29 | 2018-05-18 | 深圳市中易科技有限责任公司 | A kind of data cleaning method and device based on distributed platform |
CN108446362A (en) * | 2018-03-13 | 2018-08-24 | 平安普惠企业管理有限公司 | Data cleansing processing method, device, computer equipment and storage medium |
-
2018
- 2018-09-19 CN CN201811093274.4A patent/CN109299233B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361064A (en) * | 2014-11-04 | 2015-02-18 | 中国银行股份有限公司 | Data cleaning method for data files and data files processing method |
WO2016101690A1 (en) * | 2014-12-22 | 2016-06-30 | 国家电网公司 | Time sequence analysis-based state monitoring data cleaning method for power transmission and transformation device |
US20180052888A1 (en) * | 2016-08-17 | 2018-02-22 | International Business Machines Corporation | Result set optimization for a search query |
CN107784070A (en) * | 2017-09-15 | 2018-03-09 | 平安科技(深圳)有限公司 | A kind of method, apparatus and equipment for improving data cleansing efficiency |
CN108052665A (en) * | 2017-12-29 | 2018-05-18 | 深圳市中易科技有限责任公司 | A kind of data cleaning method and device based on distributed platform |
CN108446362A (en) * | 2018-03-13 | 2018-08-24 | 平安普惠企业管理有限公司 | Data cleansing processing method, device, computer equipment and storage medium |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110096626A (en) * | 2019-03-18 | 2019-08-06 | 平安普惠企业管理有限公司 | Processing method, device, equipment and the storage medium of contract text data |
CN111797078A (en) * | 2019-04-09 | 2020-10-20 | Oppo广东移动通信有限公司 | Data cleaning method, model training method, device, storage medium and equipment |
CN111191421A (en) * | 2019-12-30 | 2020-05-22 | 出门问问信息科技有限公司 | Text processing method and device, computer storage medium and electronic equipment |
CN111191421B (en) * | 2019-12-30 | 2023-09-12 | 出门问问创新科技有限公司 | Text processing method and device, computer storage medium and electronic equipment |
CN112199364A (en) * | 2020-10-16 | 2021-01-08 | 平安国际智慧城市科技股份有限公司 | Data cleaning method and device, electronic equipment and storage medium |
CN112287638A (en) * | 2020-10-28 | 2021-01-29 | 云账户技术(天津)有限公司 | Digital display method and device |
CN112287638B (en) * | 2020-10-28 | 2022-12-09 | 云账户技术(天津)有限公司 | Digital display method and device |
CN113064885A (en) * | 2020-12-29 | 2021-07-02 | 中国移动通信集团贵州有限公司 | Data cleaning method and device |
CN117648635B (en) * | 2024-01-30 | 2024-05-03 | 深圳昂楷科技有限公司 | Sensitive information classification and classification method and system and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN109299233B (en) | 2024-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN109299233A (en) | Text data processing method, device, computer equipment and storage medium | |
CN108287858B (en) | Semantic extraction method and device for natural language | |
CN111222305B (en) | Information structuring method and device | |
CN109766438A (en) | Biographic information extracting method, device, computer equipment and storage medium | |
CN112084381A (en) | Event extraction method, system, storage medium and equipment | |
WO2015149533A1 (en) | Method and device for word segmentation processing on basis of webpage content classification | |
CN110532563A (en) | The detection method and device of crucial paragraph in text | |
CN111767716A (en) | Method and device for determining enterprise multilevel industry information and computer equipment | |
CN110096572B (en) | Sample generation method, device and computer readable medium | |
CN111325018B (en) | Domain dictionary construction method based on web retrieval and new word discovery | |
CN106030568B (en) | Natural language processing system, natural language processing method and natural language processing program | |
CN107526721B (en) | Ambiguity elimination method and device for comment vocabularies of e-commerce products | |
CN110008473B (en) | Medical text named entity identification and labeling method based on iteration method | |
CN110929520A (en) | Non-named entity object extraction method and device, electronic equipment and storage medium | |
CN111078839A (en) | Structured processing method and processing device for referee document | |
Wang et al. | Mongolian named entity recognition with bidirectional recurrent neural networks | |
CN110968664A (en) | Document retrieval method, device, equipment and medium | |
CN111563212A (en) | Inner chain adding method and device | |
CN108345694B (en) | Document retrieval method and system based on theme database | |
CN110222340B (en) | Training method of book figure name recognition model, electronic device and storage medium | |
CN110874408B (en) | Model training method, text recognition device and computing equipment | |
CN110941713B (en) | Self-optimizing financial information block classification method based on topic model | |
CN115130475A (en) | Extensible universal end-to-end named entity identification method | |
CN114842982A (en) | Knowledge expression method, device and system for medical information system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |