CN109299233A - Text data processing method, device, computer equipment and storage medium - Google Patents

Text data processing method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN109299233A
CN109299233A CN201811093274.4A CN201811093274A CN109299233A CN 109299233 A CN109299233 A CN 109299233A CN 201811093274 A CN201811093274 A CN 201811093274A CN 109299233 A CN109299233 A CN 109299233A
Authority
CN
China
Prior art keywords
text data
data
cleaning
target
obtains
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811093274.4A
Other languages
Chinese (zh)
Other versions
CN109299233B (en
Inventor
黄锦伦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201811093274.4A priority Critical patent/CN109299233B/en
Publication of CN109299233A publication Critical patent/CN109299233A/en
Application granted granted Critical
Publication of CN109299233B publication Critical patent/CN109299233B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present invention discloses a kind of text data processing method, device, computer equipment and storage medium, applies in big data field more particularly to big data acquisition and processing.This method comprises: obtaining data cleansing request, data cleansing request includes channel identication and scavenging period;Determine that target corpus corresponding with channel identication, target corpus include including at least one urtext data based on channel identication, each urtext data carry a time identifier;Data cleansing record sheet is inquired according to channel identication and scavenging period, determines object time section;Urtext data of the time identifier in object time section are determined as text data to be cleaned;Based on channel identication rule searching database, target cleaning rule corresponding with channel identication is obtained;Text data to be cleaned is cleaned using target cleaning rule, obtains target plain text data.This method can effectively improve text data cleaning efficiency and cleaning quality.

Description

Text data processing method, device, computer equipment and storage medium
Technical field
The present invention relates to big data processing technology field more particularly to a kind of text data processing method, device, computers Equipment and storage medium.
Background technique
In the technical fields such as speech recognition and OCR text identification, need to acquire a large amount of text data of specific area, with instruction Practice the dedicated language model of the specific area, to guarantee the language model trained in the recognition accuracy of the specific area. Main by artificially collecting and cleaning text data during current language model training, time-consuming for process, low efficiency and mistake Accidentally rate is higher.Also, in Chinese language model training process, pure Chinese text data need to be acquired as Chinese language model Text data, and artificially collect and clean in pure Chinese text data procedures, it need to be to the data other than Chinese in this article notebook data It is cleaned, time-consuming for process, low efficiency and accuracy rate can not ensure.
Summary of the invention
The embodiment of the present invention provides a kind of text data processing method, device, computer equipment and storage medium, to solve It artificially collects and cleans existing low efficiency and the higher problem of error rate during text data.
A kind of text data processing method, comprising:
Data cleansing request is obtained, the data cleansing request includes channel identication and scavenging period;
Target corpus corresponding with the channel identication, the target corpus packet are determined based on the channel identication It includes including at least one urtext data, each urtext data carry a time identifier;
Data cleansing record sheet is inquired according to the channel identication and the scavenging period, determines object time section;
Urtext data of the time identifier in the object time section are determined as text data to be cleaned;
Based on the channel identication rule searching database, target cleaning rule corresponding with the channel identication are obtained Then;
The text data to be cleaned is cleaned using the target cleaning rule, obtains target plain text data.
A kind of text data processing device, comprising:
Data cleansing request module, obtains data cleansing request, data cleansing request include channel identication and Scavenging period;
Urtext data acquisition module, for determining mesh corresponding with the channel identication based on the channel identication Corpus is marked, the target corpus includes including at least one urtext data, and each urtext data carry One time identifier;
Object time section obtains module, for according to the channel identication and scavenging period inquiry data cleansing note Table is recorded, determines object time section;
Text data to be cleaned obtains module, for the original text by the time identifier in the object time section Notebook data is determined as text data to be cleaned;
Target cleaning rule obtains module, for being based on the channel identication rule searching database, obtains and the frequency Road identifies corresponding target cleaning rule;
Target plain text data obtain module, for using the target cleaning rule to the text data to be cleaned into Row cleaning, obtains target plain text data.
A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing The computer program run on device, the processor realize above-mentioned text data processing method when executing the computer program Step.
A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter The step of calculation machine program realizes above-mentioned text data processing method when being executed by processor.
Above-mentioned text data processing method, device, computer equipment and storage medium, according to data cleansing request in frequency Road mark determines corresponding target corpus, so that at least one the urtext data for carrying time identifier are obtained, to improve The acquisition efficiency of urtext data.According to data cleansing request in channel identication and scavenging period determine object time area Between, and the time tag according to the object time section and the carrying of each urtext data, it determines text data to be cleaned, has Help avoid carrying out repeated washing to the urtext data being washed in target corpus, so that it is clear to improve text data The efficiency washed.Text data to be cleaned is cleaned according to the corresponding target cleaning rule of channel identication, it can quick obtaining mesh Plain text data is marked, process is not necessarily to manual intervention, can effectively improve the efficiency and quality of text cleaning.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is an application environment schematic diagram of one embodiment of the invention text data processing method;
Fig. 2 is a flow chart of one embodiment of the invention text data processing method;
Fig. 3 is another flow chart of one embodiment of the invention text data processing method;
Fig. 4 is another flow chart of one embodiment of the invention text data processing method;
Fig. 5 is another flow chart of one embodiment of the invention text data processing method;
Fig. 6 is another flow chart of one embodiment of the invention text data processing method;
Fig. 7 is another flow chart of one embodiment of the invention text data processing method;
Fig. 8 is a schematic diagram of one embodiment of the invention text data processing unit;
Fig. 9 is a schematic diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
Text data processing method provided in an embodiment of the present invention, text data processing method can be using as shown in Figure 1 Application environment in.Specifically, text data processing method is applied in text data processing system, text data processing System includes client and server as shown in Figure 1, and client is communicated with server by network, for realizing to text Notebook data cleans automatically, can quick obtaining batch target plain text data, and acquisition process can save artificial cleaning treatment Time and cost of labor improve cleaning efficiency and cleaning quality.Wherein, client is also known as user terminal, refers to and server phase It is corresponding, the program of local service is provided for client.Client it is mountable but be not limited to various personal computers, notebook electricity On brain, smart phone, tablet computer and portable wearable device.Server can use independent server either multiple clothes The server cluster of business device composition is realized.
In one embodiment, it as shown in Fig. 2, providing a kind of text data processing method, applies in Fig. 1 in this way It is illustrated, includes the following steps: for server
S201: obtaining data cleansing request, and data cleansing request includes channel identication and scavenging period.
Wherein, data cleansing request is for realizing the request cleaned automatically to text data.The data cleansing is asked Specifically user is asked to be sent to the server of text data processing system by client, so that server is based on the data cleansing Request carries out the request of corresponding text cleaning treatment.Channel identication is the source channel for the text data for needing to clean for identification Mark.In the present embodiment, the source channel for the text data for needing to clean can be understood as the pre-set classification in each website The classification channel such as channel, including but not limited to news, finance, amusement, sport and education.Scavenging period refers to this data cleansing The deadline of the text data clean limited in request.The client that the scavenging period can be triggers the data System current date when cleaning request is also possible to user and passes through the time that client is independently set.
In the present embodiment, channel identication input frame, scavenging period input are shown on the data cleansing configuration interface of client Frame and ACK button.User can directly input the channel mark for needing to carry out text cleaning treatment in the channel identication input frame Know, it can also be by needing to carry out the channel mark of text cleaning treatment with the associated drop-down list selection of channel identication input frame Know.Display system current date is defaulted in scavenging period input frame, and configures autonomous select button, can directly adopt default System current date can also independently determine its scavenging period by clicking autonomous select button as scavenging period later.It is selecting Select determining channel identication and after input time, click ACK button, can trigger data cleaning request so that server can connect Receive data cleansing request.
S202: determine that target corpus corresponding with channel identication, target corpus include based on channel identication At least one urtext data, each urtext data carry a time identifier.
Target corpus is the corpus for storing urtext data corresponding with channel identication.The present embodiment In, multiple corpus are previously stored in the database of text data processing system, each corpus is for storing a kind of source The corresponding urtext data of channel, so that the corpus is associated with a channel identication.Specifically, server is according to the channel mark Know inquiry database, to determine corpus corresponding with the channel identication as target corpus, determination process is simple and fast.
Urtext data are stored in untreated text data in corpus, specifically can be using reptile instrument Swash the text data associated with channel identication got from the separate sources channel of related web site.For example, using crawler work Tool swashes from Sina website takes the web page contents in " sport " this source channel to be stored in as a urtext data and body Ssd channel identifies in corresponding corpus.
Each urtext data carry a time identifier, which can be the storage of urtext data and arrive and frequency Road identifies the time in corresponding corpus.Specifically, it stores in each urtext data in corpus, passes through system Current time obtains function (such as time ()) and obtains current time in system, so as to carry the system current for the urtext data Time is as its time identifier.
S203: data cleansing record sheet is inquired according to channel identication and scavenging period, determines object time section.
Data cleansing record sheet is that server is preconfigured for recording the tables of data of information in data cleansing request.Clothes Sequence of the business device according to the data cleansing request received, successively by the channel identication in all data cleansings request and when cleaning Between be recorded in the data cleansing record sheet, to determine that the urtext data in corresponding with each channel identication corpus are It is no cleaned.
In the present embodiment, server inquires data cleansing record sheet according to channel identication and scavenging period, when determining target Between section specifically include: server inquires data cleansing record sheet according to channel identication, to determine the corresponding mesh of the channel identication Mark corpus the last time scavenging period (the entrained scavenging period of i.e. last data cleansing request);Then, according to nearest Scavenging period entrained by scavenging period and this data cleansing request, determines object time section.The object time area Between be using the last scavenging period as initial time, using the entrained scavenging period of this data cleansing request as deadline Time interval.It is to be appreciated that the object time section is for determining the original text not being washed in target corpus The time interval of notebook data helps avoid carrying out repeating to the urtext data being washed in target corpus clear It washes, reduces the efficiency of text data cleaning.
S204: urtext data of the time identifier in object time section are determined as text data to be cleaned.
In the present embodiment, it is original for determining that each urtext data being stored in target corpus carry one Text data store is to the time identifier in target corpus.And object time section be for determine in target corpus not by The time interval for the urtext data cleaned.Therefore, server can be directly by time identifier in object time section All urtext data are determined as text data to be cleaned, help to improve the cleaning efficiency of text data.The text to be cleaned Notebook data is stored in the text data not being washed also in target corpus.Since the object time section is with nearest one Secondary scavenging period is initial time, using the entrained scavenging period of this data cleansing request as the time interval of deadline, Therefore, urtext data of the time identifier outside object time section are to have cleaned text data, if determining it as to clear Washing text data may cause repeated washing, to influence the efficiency of text data cleaning.
S205: being based on channel identication rule searching database, obtains target cleaning rule corresponding with channel identication.
Wherein, rule database is the database for storing different cleaning rules, each cleaning rule and source frequency The content in road is corresponding, so that each cleaning rule is corresponding with a channel identication, so that subsequent inquired according to channel identication is advised Then database, to obtain target cleaning rule corresponding with channel identication.
In the present embodiment, target cleaning rule corresponding with channel identication includes at least two feature cleaning rules and spy Levy cleaning sequence.Wherein, feature cleaning rule is for a certain portion in text data to be cleaned corresponding with channel identication Dtex levies the rule cleaned.In the present embodiment, feature cleaning rule includes but is not limited to special tag cleaning rule, number Cleaning rule, punctuation mark cleaning rule and foreign language cleaning rule.Feature cleaning sequence is clear for limiting at least two features Wash the sequence that rule cleans text data to be cleaned, it is possible to understand that the priority being characterized between cleaning rule, feature The restriction of cleaning sequence helps to ensure the cleaning quality for cleaning text data to be cleaned.It is to be appreciated that due to spy Levying cleaning rule is the rule cleaned to Partial Feature a certain in text data to be cleaned, and in text data to be cleaned The content of a certain Partial Feature may hit at least two feature cleaning rules, it is possible to by least two feature cleaning rules into Row cleaning will lead to cleaning error at this time if not limiting the sequence between two feature cleaning rules, to influence to clean matter Amount.
S206: cleaning text data to be cleaned using target cleaning rule, obtains target plain text data.
Wherein, target plain text data is obtained after being cleaned using target cleaning rule to text data to be cleaned Pure text data.Specifically, server sequentially calls corresponding special according to the feature cleaning sequence in target cleaning rule Sign cleaning rule cleans text data to be cleaned, can obtain target plain text data, and process is not necessarily to manual intervention, The efficiency and quality of text cleaning can be effectively improved.
Further, after obtaining target plain text data, text data processing method further include: by the pure text of target Notebook data is stored in training text database corresponding with channel identication.Wherein, training text database is for storing instruction Practice the database of text data.It is to be appreciated that each training text database is corresponding with a channel identication, and make the training Text database only stores target plain text data corresponding with channel identication, is directly based upon the training text so as to subsequent Target plain text data in database, training target Chinese language model corresponding with channel identication, to improve target The recognition accuracy that Chinese language model pair text data to be identified corresponding with channel identication is identified.Wherein, wait know Other text data refers to the data for needing to carry out text identification.
In text data processing method provided by the present embodiment, according to data cleansing request in channel identication determine pair The target corpus answered, so that at least one the urtext data for carrying time identifier are obtained, to improve urtext data Acquisition efficiency.According to data cleansing request in channel identication and scavenging period determine object time section, and according to the mesh The time tag for marking time interval and the carrying of each urtext data, determines text data to be cleaned, helps avoid to mesh The urtext data being washed in mark corpus carry out repeated washing, to improve the efficiency of text data cleaning.According to Text data to be cleaned is cleaned according to the corresponding target cleaning rule of channel identication, it can quick obtaining target plain text number According to process is not necessarily to manual intervention, can effectively improve the efficiency and quality of text cleaning.
In one embodiment, target cleaning rule corresponding with channel identication include at least two feature cleaning rules and Feature cleaning sequence, and feature cleaning rule includes but is not limited to special tag cleaning rule, digital cleaning rule, punctuation mark Cleaning rule and foreign language cleaning rule.As shown in figure 3, being cleaned using target cleaning rule to text data to be cleaned, obtain Take target plain text data, comprising:
S301: label cleaning is carried out to text data to be cleaned using special tag cleaning rule, obtains the first textual data According to.
Wherein, special tag cleaning rule is started the cleaning processing to special tag present in text data to be cleaned Rule.The special tag includes but is not limited to hyperlink address, the address URL and html tag, these special tags by label and Character string composition, and character string is composed of at least one of number, letter and punctuation mark.Since special tag forms Particularity it is therefore, excellent when being cleaned to text data to be cleaned so that it may hit other feature cleaning rules First text data to be cleaned is cleaned using special tag cleaning rule, obtains the first text data, it is subsequent clear to ensure Going on smoothly for operation is washed, to ensure the efficiency and quality of data cleansing.
In the present embodiment, server carries out label cleaning to text data to be cleaned using special tag cleaning rule, obtains It takes the process of the first text data to specifically include: using regular expression corresponding with special tag cleaning rule, treat clear It washes the special tag occurred in text data to be matched, if being matched to the label in special tag, after deleting label Character string deletes the label again, to reach the target for removing special tag all in text data to be cleaned.For example, for " http: // 120.77.246.207/index.aspx? objid=F3BFA010-60E9-4F63-BDED- The character string after " http " label recognized is first deleted in this address URL B22782EC0513&pagecode=RE ", " http " label is deleted again, to achieve the purpose that clean special tag, is avoided first cleaning character string when then cleaning label, is easy Cleaning process is caused to malfunction.
S302: digital cleaning is carried out to the first text data using digital cleaning rule, obtains the second text data.
Wherein, digital cleaning rule is the rule started the cleaning processing to number present in text data to be cleaned.For The training demand of the subsequent Chinese language model of adaptation, need to acquire pure Chinese text data, therefore, need to be using digital cleaning rule Digital cleaning is carried out to the first text data, the number occurred in the first text data is converted into hanzi form, to reach Digital cleaning effect.
In the present embodiment, carrying out number cleaning to the first text data using digital cleaning rule has the following two kinds cleaning side Formula: the first is to carry out Chinese-character digital conversion to the number occurred in the first text data, such as 123 is converted into " 120 Three ".Second is word for word converted to the number occurred in the first text data, such as 123 is converted into " one two three ".It can manage Xie Di carries out the process of digital cleaning using digital cleaning rule, specifically according to the first text data to the first text data The applicable premise of the number of middle appearance, selects corresponding cleaning way to be cleaned, and to reach digital cleaning effect, improves number The efficiency and quality of word cleaning.
It is to be appreciated that carrying out the number in digital cleaning to the first text data is the number after carrying out label cleaning, And the number occurred in the first text data may carry punctuation mark, and these punctuation marks have particular meaning in number, It such as directlys adopt punctuation mark cleaning rule and carries out symbol cleaning, number and the first text data after call sign may be made to clean In number do not match that, therefore, need to be first using digital cleaning rule to the first textual data to influence the quality of text cleaning It according to the digital cleaning of progress, then executes step S303 and carries out symbol cleaning, to guarantee the cleaning efficiency and quality of text data.
S303: symbol cleaning is carried out to the second text data using punctuation mark cleaning rule, obtains third text data.
Wherein, punctuation mark cleaning rule is started the cleaning processing to punctuation mark present in text data to be cleaned Rule.In the present embodiment, carrying out symbol cleaning to the second text data using punctuation mark cleaning rule has deletion and replacement two Kind mode.Specifically, server is previously stored with punctuation mark allocation list, which stores a plurality of configuration note Record, each configuration record includes a punctuation mark, cleaning way and applicable premise.The cleaning way includes deleting and replacing two Kind, this, which is applicable in premise and refers to, is applicable in the premise that a cleaning way cleans punctuation mark.It is to be appreciated that corresponding to replacement This cleaning way, the configuration record in the punctuation mark allocation list also stores its corresponding substitute, to clean in symbol The Shi Caiyong substitute replaces punctuation mark.
In the present embodiment, symbol cleaning is carried out to the second text data using punctuation mark cleaning rule, obtains third text The process of notebook data specifically includes: regular expression corresponding with punctuation mark cleaning rule is used, to the second text data Middle there is the sentence of punctuation mark and applicable premise corresponding with punctuation mark is matched, suitable according to this if successful match It carries out symbol to the punctuation mark occurred in the second text data with the corresponding cleaning way of premise to clean, to obtain third text Data, cleaning process are not necessarily to manual intervention, are conducive to ensure cleaning efficiency, and reduce the error rate of cleaning, improve cleaning matter Amount.For example, for ": " this punctuation mark, if apply " pause after signal language or indicate prompt hereafter or it is blanket on Text " this be applicable under the premise of, then to ": " this punctuation mark carry out delete processing;If applying " in mathematical linguistics, is indicating two The ratio relation of person " this be applicable under the premise of, then to ": ", this punctuation mark is replaced processing, and ": " is converted into " ratio ".
S304: foreign language cleaning is carried out to third text data using foreign language cleaning rule, obtains target plain text data.
Wherein, foreign language cleaning rule is the rule started the cleaning processing to foreign language present in text data to be cleaned.It should Foreign language includes but is not limited to English, French and the Japanese etc. that the present embodiment refers to.In the present embodiment, server is cleaned using foreign language Rule carries out foreign language cleaning to third text data, and the process for obtaining target plain text data specifically includes: using clear with foreign language The corresponding regular expression of rule is washed, matching treatment is carried out to foreign language present in third text data, if being matched to opposite The foreign language answered then carries out delete processing to the foreign language, to reach foreign language cleaning purpose.
In text data processing method provided by the present embodiment, successively using special tag cleaning rule, number cleaning Rule, punctuation mark cleaning rule and foreign language cleaning rule carry out label cleaning, number cleaning, symbol to text data to be cleaned Cleaning and foreign language cleaning, to obtain target plain text data, to realize the purpose quickly cleaned to text data to be cleaned, Without manual intervention, the efficiency and quality of text data cleaning can be ensured.In addition, according to the feature cleaning in target cleaning rule Sequentially, it is sequentially carried out using special tag cleaning rule, digital cleaning rule, punctuation mark cleaning rule and foreign language cleaning rule Cleaning treatment determines the sequence of cleaning rule, when avoiding the content in text data to be cleaned while being applicable in different cleaning rules, The problem of need to manually determining cleaning sequence, increasing cost of labor and reduce efficiency appearance, while avoiding the cleaning according to mistake suitable Sequence cleaned and cause text data cleaning process malfunction this problem appearance.
In one embodiment, as shown in figure 4, carrying out label to text data to be cleaned using special tag cleaning rule After the step of cleaning, the first text data of acquisition, and digital cleaning is carried out to the first text data using digital cleaning rule, Before the step of obtaining the second text data, text data processing method further includes, comprising:
S401: brand database is inquired based on channel identication, obtains target branding data corresponding with channel identication.
Wherein, brand database is the database for storing the corresponding branding data of separate sources channel.Target brand Data are the branding datas corresponding with channel identication stored in brand database.Branding data is field where the channel of source Branding data, including brand name.Moreover, brand name can be formed using number, letter, symbol and Chinese character, such as 361 ° It is made of number with symbol.For example, 361 ° are branding datas corresponding with channel identication in sports channel;And it is educating Channel, 361 ° are not branding datas corresponding with channel identication.
S402: the first text data and target branding data are subjected to matching treatment.
In the present embodiment, server matches the first text data with target branding data using fuzzy matching algorithm Whether processing determines comprising target branding data corresponding with channel identication in the first text data, at according to matching It manages result and carries out classification processing.Wherein, fuzzy matching algorithm include but is not limited to KMP (Knuth-Morris-Pratt) algorithm and The string matching algorithms such as BM (Boyer-Moore) algorithm.
S403: if the first text data and target branding data successful match, to the first text data of successful match It is handled except progress, then the first text data carries out digital cleaning using treated except digital cleaning rule pair, obtains the Two text datas.
In the present embodiment, if the first text data and target branding data successful match, illustrate in the first text data Comprising target branding data, since brand name is noun generally in the art, conversion or delete operation is such as carried out, may will affect The globality of text data, thus server the first text data need to be carried out except handle so that except treated first Label except content of text corresponding with target branding data carries in text data, so that subsequent cleaned using other features When rule starts the cleaning processing text data, the content of text of label except carrying is not handled.
Specifically, if the first text data and target branding data successful match, and to the first textual data of successful match After being handled except progress, so that subsequent all feature cleaning rules are not to label except carrying in the first text data Content of text cleaned, but other do not carry except label content of text there is still a need for using other feature cleaning rules into Row cleaning, therefore, server need to the first text data carries out digital cleaning using treated except digital cleaning rule pair, with The number occurred in first text data is converted into hanzi form, to reach digital cleaning effect.
S403: if the first text data matched with target branding data it is unsuccessful, using digital cleaning rule to first Text data carries out digital cleaning, obtains the second text data.
In the present embodiment, if the first text data matches unsuccessful with target branding data, illustrate the first text data In do not include target branding data, then directly adopt digital cleaning rule and digital cleaning carried out to the first text data, by the The number occurred in one text data is converted into hanzi form, to reach digital cleaning effect.
In text data processing method provided by the present embodiment, brand database is inquired according to channel identication, so as to true The target branding data handled except the fixed progress with the presence or absence of needs, to be conducive to ensure the complete of follow-up text data cleansing Property.In the first text data and target branding data successful match, except being carried out to the first text data of successful match Reason, to avoid it is subsequent cleaned using other feature cleaning rules (such as digital cleaning rule) when, will be with target branding data Corresponding content of text cleans together, influences the integrality of the first text data, to improve the accurate of text data cleaning Rate avoids cleaning from malfunctioning.
In one embodiment, it as shown in figure 5, carrying out digital cleaning to the first text data using digital cleaning rule, obtains Take the second text data, comprising:
S501: extracting digit strings from the first text data, judges numeric word using regular expression matching algorithm Whether symbol string is thousand quartiles number.
In the present embodiment, the first text data, which can be, carries out text data to be cleaned using special tag cleaning rule The text data obtained after label cleaning, is also possible to remove the first text data with target branding data successful match The text data got after outer processing.In the present embodiment, it can be mentioned from the first text data using string matching algorithm Digit strings are taken out, digital cleaning is carried out to digit strings so as to subsequent.
Regular expression matching algorithm is the algorithm that string matching is carried out based on regular expression.Wherein, regular expressions Formula (Regular Expression, regex, regexp or RE are often abbreviated as in code), also known as regular expression, use list A character string describes, matches a series of character strings for meeting certain syntactic rule, is usually used to retrieval, replaces those and meet The text of some mode (rule).Thousand quartile numbers refer to that in number, adding a comma every three digits, (i.e. kilobit separates Symbol), to be more easier to recognize numerical value, for example, 1,000,000.
In the present embodiment, server is provided with the regular expressions that can match kilobit separator in digit strings in advance Formula carries out number to the first text data using digital cleaning rule according to the feature cleaning sequence in target cleaning rule When cleaning, the regular expression of kilobit separator in digit strings first can be matched using this, to mentioning in the first text data The digit strings of taking-up are matched, if matching in the digit strings containing kilobit separator, assert the numeric word Symbol string is thousand quartiles number;Conversely, assert the numerical character if matching in the digit strings without containing kilobit separator String is not thousand quartiles number.It is to be appreciated that number cleaning first judges before symbol cleaning due in feature cleaning sequence Whether digit strings are thousand quartiles number, so that subsequent can carry out digital cleaning to digit strings according to judging result, with When avoiding first carrying out symbol cleaning using punctuation mark cleaning rule, thousand quartiles digital ", " are mistakenly considered comma and are cleaned, are made Number error after must cleaning, to guarantee cleaning quality.
S502: if digit strings are thousand quartiles number, the kilobit separator in thousand quartiles number is removed, and to removal Number after kilobit separator carries out Chinese-character digital conversion, obtains the second text data;
In the present embodiment, if server determines that the digit strings are thousand quartiles number, illustrate in the digit strings ", " be kilobit separator, removal thousand quartiles number in kilobit separator, and to removal kilobit separator after number carry out Chinese-character digital conversion, obtains the second text data, to reach digital cleaning purpose.For example, 1,000,000 this thousand quantile Word, remove kilobit separator after number be 1000000, then by 1000000 this number carry out Chinese-character digital conversions, that is, turn Change 1,000,000 into.
S503: if digit strings are not thousand quartiles number, numerical character is judged using regular expression matching algorithm Whether string is decimal point number.
Specifically, if it is thousand quartiles number that server, which determines the digit strings not, illustrate in the digit strings not Containing ", " this kilobit separator, at this point, regular expression matching algorithm is used to judge digit strings whether for decimal points Word.Wherein, decimal point number is to carry the number of decimal deparator " ", such as 123.45.
In the present embodiment, server is provided with the regular expression that can match decimal deparator in digit strings in advance, When carrying out number cleaning to the first text data using digital cleaning rule, this, which can be used, can match decimal in digit strings The regular expression of point symbol matches the digit strings extracted in the first text data, if matching the numeric word Contain decimal deparator in symbol string, then assert that the digit strings are decimal point number;Conversely, if matching the digit strings In do not contain decimal deparator, then assert the digit strings not and be decimal point number.It is to be appreciated that since feature cleaning is suitable In sequence, number cleaning first judge whether digit strings are that decimal point is digital before symbol cleaning so that subsequent foundation this sentence Disconnected result carries out digital cleaning to digit strings, will when to avoid first carrying out symbol cleaning using punctuation mark cleaning rule Decimal deparator " " is mistakenly considered fullstop and is cleaned, so that the number error after cleaning, to guarantee cleaning quality.
S504: if digit strings are decimal point number, Chinese-character digital is carried out to the number before decimal deparator and is turned It changes, the number after decimal deparator is word for word converted, and Chinese character replacement is carried out to decimal deparator, obtain the second text Data.
In the present embodiment, if server determines that the digit strings for decimal point number, illustrate in the digit strings " " be decimal deparator, then will to before decimal deparator number carry out Chinese-character digital conversion, after decimal deparator Number word for word converted, and Chinese character replacement is carried out to decimal deparator, i.e., replaced with decimal deparator " point ", acquisition the Two text datas, to reach digital cleaning purpose.Such as 123.45 this decimal point number, number cleaning when, first to 123 Chinese-character digital conversion is carried out, to be converted into " 123 ", and is word for word converted to 45, to be converted into " four or five ", then Decimal deparator is replaced with into " point ", with acquisition " 123 points 45 ".
S505: if digit strings are not decimal point number, numerical character is judged using regular expression matching algorithm String is Chinese quantifier.
Specifically, if it is decimal point number that server, which determines the digit strings not, illustrate in the digit strings not Containing " " this decimal deparator, at this point, use regular expression matching algorithm judge the digit strings for Chinese quantifier, Judge whether carry preconfigured Chinese unit after the digit strings, to carry out following digital according to judging result Cleaning.Chinese unit refers to the unit quantifier of Chinese.Specifically, server be provided in advance but be not limited to point, when, class, li, Hair, member, block, angle, a, platform, face,, block, select, item, drop, piece, very little, rice, ruler, ten, hundred, ten thousand, hundred million, million, thousand, gram, Ton, bottle, box, cup, case, bucket, tank, group, double, beam, portion, ticket, time, part and people etc..
S506: if digit strings are Chinese quantifier, Chinese-character digital conversion is carried out to digit strings, obtains the second text Notebook data.
In the present embodiment, server carries Chinese unit after determining digit strings, then illustrates the digit strings It is the quantifier (as Chinese quantifier) for characterizing quantity, at this point, carrying out Chinese-character digital conversion to the digit strings, obtains the second text Notebook data, to reach digital cleaning purpose, such as " 123 pieces " are converted into " 102 tridecyne ".
S507: if digit strings are not Chinese quantifier, digit strings are judged using regular expression matching algorithm It whether is numerical digit.
Server is determining that digit strings are not Chinese quantifier, then illustrating the digit strings not is the amount for characterizing quantity Word (being not Chinese quantifier), at this point, regular expression matching algorithm is used to judge digit strings whether for numerical digit.This In numerical digit include but is not limited to ID card No., phone number, organization mechanism code and contract number etc. by default volume Number rule generate numerical digit.It, can be according to number since these numerical digits have fixed length and meet specific format The length or specific format of word character string, configuration can match digit strings whether be numerical digit regular expression.Example Such as, ID card No. be ten eight-digit number words, be specifically made of 17 bit digital ontology codes and a bit check code, put in order from Left-to-right is successively are as follows: six bit digital address codes, eight-digit number word date of birth code, three bit digital sequence codes and one-bit digital verification Code.Server is matched according to the digit strings that the regular expression of number data extracts the first text data, if With success, then illustrate that the digit strings are numerical digit.
S508: if digit strings are numerical digit, word for word converting numerical digit, and the second textual data is obtained According to.
In the present embodiment, if server determines that the digit strings for numerical digit, directly carry out the numerical digit It word for word converts, the second text data is obtained, to reach digital cleaning purpose.For example, phone number 12345678911 is converted into " one two three four five six seven eight 9.11 ".
S509: if digit strings are not numerical digit, Chinese-character digital conversion is carried out to digit strings, obtains second Text data.
Specifically, server determines that the digit strings are not numerical digit, then carries out Chinese-character digital to digit strings Conversion obtains the second text data, to reach digital cleaning purpose, so that it can obtain purer text data.
In text data processing method provided by the present embodiment, successively to the digit strings of the first text data extraction The cleaning of thousand quartile numbers, the cleaning of decimal point number, Chinese quantifier cleaning and numerical digit cleaning etc. are carried out, by the first textual data Number in is converted into Chinese character, facilitates the purer target plain text data of subsequent acquisition.Also, to the first text data The digit strings carry out sequence cleaning of extraction, can avoid the cleaning effect for influencing formerly to clean after rear cleaning operation, from And ensure cleaning quality.
In one embodiment, as shown in fig. 6, before the step of obtaining data cleansing request, text data processing side Method further include:
S601: data are obtained and crawl task, it includes task type and file identification that data, which crawl task,.
Wherein, it is that the task of data is crawled for trigger the server that data, which crawl task,.Task type is for limiting this Secondary data crawl the type of task, are specifically as follows timed task or real-time task two types.Text Flag is for unique Identify the mark of crawler file.
In the present embodiment, server is pre-created different crawler files, the corresponding file identification of each crawler file, And in the database by the crawler file and file identification associated storage, it is obtained so as to subsequent according to this document mark corresponding Crawler file.
Specifically, server can create crawler file corresponding with channel identication based on Scrapy frame.For example, passing through Scrapy frame crawls the sublink on Sina website's navigation page in all major class, group, group and the news of the sublink page Content is finally saved in local, then its crawler file creation process includes the following steps:
(1) Scrapy project is created, Scrapy project is such as created using " scrapy startproject XX " instruction, with The channel of data is crawled needed for determination, wherein XX can be the channels such as news, finance, amusement, sport and education.
(2) item file is write, i.e., the data content definition crawled as needed crawls field.Item is to save to crawl Data container, main target is exactly to extract structural data, such as webpage from the data source of unstructuredness.Scrapy is provided Item class meets such demand, and application method is similar with python dictionary, and provide additional protection mechanisms to keep away Exempt from undefined field mistake caused by misspelling.For example, it is desired to from the website (crawling Sina News here) for wanting to crawl It obtains with properties: news major class url, news major class title;News group url, news group title;News url, news title;Headline, news content.
(3) according to Scrapy project and item file, crawler file is write, and the crawler file is stored in server In database.The crawler file includes spider file (crawling the classes of data for crawling data and limiting), pipelines text Part (for storing item data) and settings file setting (main set content).
S602: it if task type is real-time task, triggers reptile instrument and executes crawler text corresponding with file identification Part obtains urtext data.
It is directly logical if it is real-time task that server, which identifies that data crawl the task type in task, in the present embodiment It crosses the file identification that data crawl in task and obtains corresponding crawler file, directly trigger reptile instrument and execute crawler text Part, with from the crawler document to website in crawl corresponding urtext data.
S603: if task type is timed task, triggered time monitoring tools, so that the current time in system reaches data When crawling the clocked flip time carried in task, triggering reptile instrument executes crawler file corresponding with file identification, obtains Take urtext data.
In the present embodiment, if it is timed task that server, which identifies that data crawl the task type in task, also need to obtain Take the data to crawl the clocked flip time in task, the clocked flip time be clocked flip server execute data crawl appoint The time of business.Time monitoring tool is the tool for monitoring system current time, can be Time Watch tool.
Specifically, if task type is timed task, the time monitoring tool installed in trigger the server, to supervise in real time The current time in system is controlled, and when the current time in system reaches data and crawls the clocked flip time carried in task, triggering is climbed Worm tool executes corresponding with file identification crawler file, with from the crawler document to website in crawl corresponding original Beginning text data.
S604: according to classification storage folder corresponding with crawler file, urtext data is stored in classification and are deposited On the afterbody file for storing up file.
Wherein, classification storage folder corresponding with crawler file, which refers to, is determined in server based on crawler file and stores The file of various urtext data.In the present embodiment, classification storage folder corresponding with crawler file includes channel This tertiary storage file of project, major class and group.
In the present embodiment, according to classification storage folder corresponding with crawler file, urtext data are stored in It is classified on the afterbody file of storage folder.For example, classification storage folder corresponding with crawler file includes frequency This tertiary storage file of road project, major class and group, then according to urtext data in the website that the crawler file is directed toward Position or classification level, by urtext data storage into corresponding group, with realize to urtext data carry out The purpose of classification storage is instructed to obtain the urtext data of specific area during subsequent linguistic model training Practice, it is made to train the recognition accuracy of resulting language model higher.
In text data processing method provided by the present embodiment, crawling the task type in task in data is to appoint in real time When business, triggering in real time crawls the corresponding crawler file progress of the file identification in task data with data and crawls operation, can be real When the corresponding urtext data of quick obtaining.Since crawler file is pre-created and stores, corresponding file need to be only uploaded Mark can be quickly found out corresponding crawler file and carry out data using the crawler file and crawl, and be conducive to raising data and crawl Efficiency.When task type in data crawler task is timed task, time monitoring tool clocked flip and text can be passed through It identifies corresponding crawler file progress data to crawl, process is not necessarily to manual intervention, is conducive to improve the efficiency that data crawl. Urtext data are stored in the afterbody file of classification storage folder corresponding with crawler file, to realize Urtext data classification is stored, trains the stronger language model of specific aim so as to subsequent.
In one embodiment, as shown in fig. 7, after the step of obtaining target plain text data, text data processing side Method further include:
S701: obtaining model training request, and model training request includes channel identication.
Wherein, model training request is that the request of language model training is carried out for trigger the server.Channel identication is to use In the mark of the source channel for the text data that identification needs to clean.It is to be appreciated that the channel identication in model training request For determining the source of text data needed for train language model.
S702: from training text database corresponding with channel identication, it is pure to obtain target corresponding with channel identication Text data.
Training text database is the database for storing training text data.It is to be appreciated that each training text Database is corresponding with a channel identication, so that the training text database only stores the pure text of target corresponding with channel identication Notebook data.It is to be appreciated that server is after obtaining model training request, the channel mark in train request based on this model Know, corresponding target plain text data is obtained from corresponding training text database, to utilize the target plain text data Carry out model training.It is to be appreciated that the target plain text data is according to plain text number accessed by step S201-S206 According to.
S703: carrying out word segmentation processing to target plain text data, obtains at least two targets participle.
In the present embodiment, server carries out Chinese point to target plain text data using pre-set Chinese word segmentation tool Word, to obtain at least two targets participle.Wherein, Chinese word segmentation tool includes but is not limited to jieba participle tool, SnowNLP Participle tool, THULAC (THU Lexical Analyzer for Chinese) participle tool and NLPIR segment tool.Example Such as, using SnowNLP participle tool to " Hangzhou West Lake landscape is fine, is tourist attraction, annual to attract the trip for largely coming to play Visitor!" segmented, it can get that "/very/good/to be/travel/famous scenic spot/annual/, which attracts ,/a large amount of/come/swim in the Hangzhou/West Lake/landscape Play// tourist " etc. targets participle.
S704: segmenting at least two targets using N-gram model and carry out model training, and target Chinese language mould is obtained Type.
N-gram is to utilize phase in context commonly based on statistical language model algorithm in large vocabulary continuous speech recognition Collocation information between adjacent word can be calculated and be provided when needing the phonetic continuously without space to be converted into Chinese character string (i.e. sentence) There is the sentence of maximum probability, to realize the automatic conversion for arriving Chinese character, manually selected without user, avoids many Chinese characters corresponding one A identical phonetic and lead to coincident code problem.N-gram is assumed based on Markov: the appearance of n-th word and the word of front N-1 Correlation, and it is all uncorrelated to other any words, and the probability of whole sentence is exactly the product of each word probability of occurrence.Wherein, maximum likelihood Estimation (Maximum Likelihood Estimate) refers to a kind of method of estimation, this can be made in some known parameter by establishing The maximum probability that a sample occurs, therefore will not remove to select the sample of other small probabilities again, so clear-cut just make this parameter For estimation true value inwardly.
Specifically, server first uses maximal possibility estimation (Maximum Likelihood Estimate) to calculate each The word sequence probability of target participle, i.e.,Calculate the word order of each target participle Column probability, wherein WnIt is segmented for n-th of target, (W1W2…Wn) it is that n target segments the word sequence to be formed;C(W1W2…Wn) be (W1W2…Wn) word sequence frequency of this word sequence in target plain text data;(W1W2…Wn-1) it is that n-1 target segments shape At word sequence;C(W1W2…Wn-1) it is (W1W2…Wn-1) word sequence frequency of this word sequence in target plain text data;P (Wn|W1W2…Wn-1) refer to that n-th of target segments the word sequence that the word sequence to be formed appears in n-1 target participle composition after Word sequence probability.Then, based on Markov it is assumed that handling the word sequence probability of each target participle, to be formed Target Chinese language model.In the target Chinese language model formed, the appearance of n-th of target participle is only a with front n-1 Target participle is related, and all uncorrelated to other any words, and the probability of whole sentence is exactly the product of each target participle probability of occurrence In the present embodiment, the product of the word sequence probability based on each target participle forms target Chinese language model, i.e. P (T)=P (W1W2W3…Wn)=P (W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1), so that in following model identification process, base Corresponding word sequence probability, which is segmented, in each target obtains corresponding recognition result.
In the text data processing method of the present embodiment offer, first obtained according to the channel identication in model training request Corresponding target plain text data, so that subsequent according to the resulting target Chinese language model of target plain text data training Recognition result it is more acurrate.The reason is that the training process due to target Chinese language model is calculated based on maximal possibility estimation Method and Markov are it is assumed that segment the appearance that nth object segments in target plain text data only with the target of front N-1 Correlation, and it is all uncorrelated to other any words, and this characteristic makes each in the corresponding target plain text data of different channel Target participle and the target participle before it to be formed by word sequence frequency completely not identical so that being formed by target Chinese Language model is higher in the recognition accuracy of corresponding channel.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.
In one embodiment, a kind of text data processing device, text data processing equipment and above-described embodiment are provided Text data processing method corresponds.As shown in figure 8, text data processing equipment includes data cleansing request mould Block 801, urtext data acquisition module 802, object time section obtains module 803, text data to be cleaned obtains module 804, target cleaning rule obtains module 805 and target plain text data obtains module 806.Each functional module is described in detail such as Under:
Data cleansing request module 801, acquisition data cleansing request, data cleansing request is including channel identication and clearly Wash the time.
Urtext data acquisition module 802, for determining target language corresponding with channel identication based on channel identication Expect library, target corpus includes including at least one urtext data, and each urtext data carry a time identifier.
Object time section obtains module 803, for inquiring data cleansing record sheet according to channel identication and scavenging period, Determine object time section.
Text data to be cleaned obtains module 804, for the urtext number by time identifier in object time section According to being determined as text data to be cleaned.
Target cleaning rule obtains module 805, for being based on channel identication rule searching database, acquisition and channel identication Corresponding target cleaning rule.
Target plain text data obtains module 806, clear for being carried out using target cleaning rule to text data to be cleaned It washes, obtains target plain text data.
Preferably, target cleaning rule includes special tag cleaning rule, digital cleaning rule, punctuation mark cleaning rule With foreign language cleaning rule.
Target plain text data obtain module 806 include label cleaning unit, digital cleaning unit, symbol cleaning unit and Foreign language cleaning unit.
Label cleaning unit is obtained for carrying out label cleaning to text data to be cleaned using special tag cleaning rule Take the first text data.
Digital cleaning unit obtains second for carrying out digital cleaning to the first text data using digital cleaning rule Text data.
Symbol cleaning unit is obtained for carrying out symbol cleaning to the second text data using punctuation mark cleaning rule Third text data.
Foreign language cleaning unit obtains target for carrying out foreign language cleaning to third text data using foreign language cleaning rule Plain text data.
Preferably, after digital cleaning unit, text data processing device further includes branding data acquiring unit, data Matching treatment unit, the first matching result processing unit and the first matching result processing unit.
Branding data acquiring unit obtains corresponding with channel identication for inquiring brand database based on channel identication Target branding data.
Data Matching processing unit, for the first text data and target branding data to be carried out matching treatment.
First matching result processing unit, if for the first text data and target branding data successful match, to It is handled except being carried out with successful first text data, then using treated except digital cleaning rule pair the first text data Digital cleaning is carried out, the second text data is obtained.
First matching result processing unit is adopted if matching with target branding data unsuccessful for the first text data Digital cleaning is carried out to the first text data with digital cleaning rule, obtains the second text data.
Preferably, digital cleaning unit includes text string extracting subelement, thousand quartile numbers cleaning subelement, decimal points Word judgment sub-unit, decimal point number cleaning subelement, Chinese quantifier judgment sub-unit, Chinese quantifier cleaning subelement, number Digital judgement subelement, numerical digit cleaning subelement and non-numerical digit clean subelement.
Text string extracting subelement, for extracting digit strings from the first text data, using regular expression Judge whether digit strings are thousand quartiles number with algorithm.
Thousand quartile numbers clean subelement, if being thousand quartiles number for digit strings, remove in thousand quartiles number Kilobit separator, and to removal kilobit separator after number carry out Chinese-character digital conversion, obtain the second text data.
Decimal point digital judgement subelement uses regular expression if not being thousand quartiles number for digit strings Matching algorithm judges whether digit strings are decimal point number.
Decimal point number cleans subelement, if being decimal point number for digit strings, before decimal deparator Number carry out Chinese-character digital conversion, to after decimal deparator number word for word converted, and to decimal deparator progress Chinese character replacement, obtains the second text data.
Chinese quantifier judgment sub-unit uses regular expression if not being decimal point number for digit strings Judge digit strings for Chinese quantifier with algorithm.
Chinese quantifier cleans subelement, if being Chinese quantifier for digit strings, carries out Chinese character to digit strings Number conversion, obtains the second text data.
Numerical digit judgment sub-unit uses regular expression matching if for digit strings not being Chinese quantifier Algorithm judges whether digit strings are numerical digit.
Numerical digit cleans subelement, if being numerical digit for digit strings, is word for word turned to numerical digit It changes, obtains the second text data.
Non- numerical digit cleans subelement, if not being numerical digit for digit strings, carries out to digit strings Chinese-character digital conversion, obtains the second text data.
Preferably, before data cleansing request module 801, text data processing device further include number crawl appoint Business acquiring unit, the real-time acquiring unit of text data, text data timing acquisition unit and text data store unit.
Number crawls task acquiring unit, crawls task for obtaining data, data crawl task include task type and File identification.
The real-time acquiring unit of text data triggers reptile instrument and executes and text if being real-time task for task type Part identifies corresponding crawler file, obtains urtext data.
Text data timing acquisition unit, if for task type be timed task, triggered time monitoring tools so that When current time in system reaches data and crawls the clocked flip time carried in task, triggering reptile instrument executes and file identification Corresponding crawler file obtains urtext data.
Text data store unit, for foundation classification storage folder corresponding with crawler file, by urtext Data are stored on the afterbody file of classification storage folder.
Preferably, after target plain text data obtains module 806, text data processing device further includes model training Request unit, plain text data acquiring unit, target participle acquiring unit and language model acquiring unit.
Model training request unit, for obtaining model training request, model training request includes channel identication.
Plain text data acquiring unit, for from training text database corresponding with channel identication, obtaining and frequency Road identifies corresponding target plain text data.
Target segments acquiring unit, for carrying out word segmentation processing to target plain text data, obtains at least two targets point Word.
Language model acquiring unit is carried out model training for being segmented using N-gram model at least two targets, obtained Take target Chinese language model.
Specific about text data processing device limits the limit that may refer to above for text data processing method Fixed, details are not described herein.Modules in above-mentioned text data processing device can fully or partially through software, hardware and its Combination is to realize.Above-mentioned each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also be with It is stored in the memory in computer equipment in a software form, in order to which processor calls the above modules of execution corresponding Operation.
In one embodiment, a kind of computer equipment is provided, which can be server, internal junction Composition can be as shown in Figure 9.The computer equipment include by system bus connect processor, memory, network interface and Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating The database of machine equipment is executed for processor can be achieved the corresponding computer journey of above-described embodiment text data processing method The data formed in program process, including but not limited to target plain text data.The network interface of the computer equipment be used for it is outer The terminal in portion passes through network connection communication.To realize a kind of text data processing side when the computer program is executed by processor Method.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory And the computer program that can be run on a processor, processor realize above-described embodiment text data when executing computer program The step of processing method, such as step S201-S206 or Fig. 3 shown in Fig. 2 is to step shown in fig. 7.Alternatively, processing Device realizes the function of each module/unit in this embodiment of text data processing device, such as Fig. 8 when executing computer program Shown in data cleansing request module 801, urtext data acquisition module 802, object time section obtain module 803, text data to be cleaned obtains module 804, target cleaning rule obtains module 805 and target plain text data obtains module 806 function, to avoid repeating, which is not described herein again.
In one embodiment, a computer readable storage medium is provided, meter is stored on the computer readable storage medium The step of calculation machine program, which realizes above-described embodiment text data processing method when being executed by processor, example Step S201-S206 or Fig. 3 as shown in Figure 2 is to step shown in fig. 7, and to avoid repeating, which is not described herein again.Or Person, the computer program realize each module in this embodiment of above-mentioned text data processing device/mono- when being executed by processor The function of member, such as when data cleansing request module 801 shown in Fig. 8, urtext data acquisition module 802, target Between section obtains module 803, text data to be cleaned obtains module 804, target cleaning rule obtains module 805 and the pure text of target Notebook data obtains the function of module 806, and to avoid repeating, which is not described herein again.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims (10)

1. a kind of text data processing method characterized by comprising
Data cleansing request is obtained, the data cleansing request includes channel identication and scavenging period;
Determine that target corpus corresponding with the channel identication, the target corpus include extremely based on the channel identication Few urtext data, each urtext data carry a time identifier;
Data cleansing record sheet is inquired according to the channel identication and the scavenging period, determines object time section;
Urtext data of the time identifier in the object time section are determined as text data to be cleaned;
Based on the channel identication rule searching database, target cleaning rule corresponding with the channel identication is obtained;
The text data to be cleaned is cleaned using the target cleaning rule, obtains target plain text data.
2. text data processing method as described in claim 1, which is characterized in that the target cleaning rule includes special mark Sign cleaning rule, digital cleaning rule, punctuation mark cleaning rule and foreign language cleaning rule;
It is described that the text data to be cleaned is cleaned using the target cleaning rule, target plain text data is obtained, Include:
Label cleaning is carried out to the text data to be cleaned using the special tag cleaning rule, obtains the first textual data According to;
Digital cleaning is carried out to first text data using the digital cleaning rule, obtains the second text data;
Symbol cleaning is carried out to second text data using the punctuation mark cleaning rule, obtains third text data;
Foreign language cleaning is carried out to the third text data using the foreign language cleaning rule, obtains target plain text data.
3. text data processing method as claimed in claim 2, which is characterized in that use special tag cleaning rule described After the step of carrying out label cleaning to the text data to be cleaned, obtain the first text data, and it is described using the number Before the step of word cleaning rule carries out digital cleaning to first text data, obtains the second text data, the text Data processing method further include:
Brand database is inquired based on the channel identication, obtains target branding data corresponding with the channel identication;
First text data and the target branding data are subjected to matching treatment;
If first text data and the target branding data successful match, to the first text data of successful match into It is handled except row, then using treated except the digital cleaning rule pair, the first text data carries out digital cleaning, obtains Second text data;
If first text data matched with the target branding data it is unsuccessful, using the digital cleaning rule to institute It states the first text data and carries out digital cleaning, obtain the second text data.
4. text data processing method as claimed in claim 2, which is characterized in that described to use digital cleaning rule to described First text data carries out digital cleaning, obtains the second text data, comprising:
Digit strings are extracted from first text data, the numerical character is judged using regular expression matching algorithm Whether string is thousand quartiles number;
If the digit strings are thousand quartiles number, the kilobit separator in the thousand quartiles number is removed, and to removal Number after kilobit separator carries out Chinese-character digital conversion, obtains the second text data;
If the digit strings are not thousand quartiles number, the digit strings are judged using regular expression matching algorithm It whether is decimal point number;
If the digit strings are decimal point number, Chinese-character digital is carried out to the number before the decimal deparator and is turned It changes, the number after decimal deparator is word for word converted, and Chinese character replacement is carried out to decimal deparator, obtain the second text Data;
If the digit strings are not decimal point number, the digit strings are judged using regular expression matching algorithm For Chinese quantifier;
If the digit strings are Chinese quantifier, Chinese-character digital conversion is carried out to the digit strings, obtains the second text Notebook data;
If the digit strings are not Chinese quantifier, judge that the digit strings are using regular expression matching algorithm No is numerical digit;
If the digit strings are numerical digit, the numerical digit is word for word converted, obtains the second text data;
If the digit strings are not numerical digit, Chinese-character digital conversion is carried out to the digit strings, obtains second Text data.
5. text data processing method as described in claim 1, which is characterized in that in the step of the acquisition data cleansing request Before rapid, the text data processing method further include:
It obtains data and crawls task, it includes task type and file identification that the data, which crawl task,;
If the task type is real-time task, triggers reptile instrument and execute crawler text corresponding with the file identification Part obtains urtext data;
If the task type is timed task, triggered time monitoring tools, so that the current time in system reaches the data When crawling the clocked flip time carried in task, triggering reptile instrument executes crawler text corresponding with the file identification Part obtains urtext data;
According to classification storage folder corresponding with the crawler file, the urtext data are stored in the classification On the afterbody file of storage folder.
6. text data processing method as described in claim 1, which is characterized in that in the acquisition target plain text data After step, the text data processing method further include:
Model training request is obtained, the model training request includes channel identication;
From training text database corresponding with the channel identication, the pure text of target corresponding with the channel identication is obtained Notebook data;
Word segmentation processing is carried out to the target plain text data, obtains at least two targets participle;
Model training is carried out at least two target participles using N-gram model, obtains target Chinese language model.
7. a kind of text data processing device characterized by comprising
Data cleansing request module obtains data cleansing request, and the data cleansing request includes channel identication and cleaning Time;
Urtext data acquisition module, for determining target language corresponding with the channel identication based on the channel identication Expect library, the target corpus includes including at least one urtext data, and each urtext data carry for the moment Between identify;
Object time section obtains module, for according to the channel identication and scavenging period inquiry data cleansing record Table determines object time section;
Text data to be cleaned obtains module, for the urtext number by the time identifier in the object time section According to being determined as text data to be cleaned;
Target cleaning rule obtains module, for being based on the channel identication rule searching database, obtains and the channel mark Sensible corresponding target cleaning rule;
Target plain text data obtains module, clear for being carried out using the target cleaning rule to the text data to be cleaned It washes, obtains target plain text data.
8. text data processing device as claimed in claim 7, which is characterized in that the target cleaning rule includes special mark Sign cleaning rule, digital cleaning rule, punctuation mark cleaning rule and foreign language cleaning rule;
The target plain text data obtains module
Label cleaning unit is obtained for carrying out label cleaning to the text data to be cleaned using special tag cleaning rule Take the first text data;
Digital cleaning unit obtains second for carrying out digital cleaning to first text data using digital cleaning rule Text data;
Symbol cleaning unit is obtained for carrying out symbol cleaning to second text data using punctuation mark cleaning rule Third text data;
Foreign language cleaning unit obtains target for carrying out foreign language cleaning to the third text data using foreign language cleaning rule Plain text data.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to The step of any one of 6 text data processing method.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In realizing the text data processing method as described in any one of claim 1 to 6 when the computer program is executed by processor Step.
CN201811093274.4A 2018-09-19 2018-09-19 Text data processing method, device, computer equipment and storage medium Active CN109299233B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811093274.4A CN109299233B (en) 2018-09-19 2018-09-19 Text data processing method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811093274.4A CN109299233B (en) 2018-09-19 2018-09-19 Text data processing method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109299233A true CN109299233A (en) 2019-02-01
CN109299233B CN109299233B (en) 2024-03-01

Family

ID=65163361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811093274.4A Active CN109299233B (en) 2018-09-19 2018-09-19 Text data processing method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109299233B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096626A (en) * 2019-03-18 2019-08-06 平安普惠企业管理有限公司 Processing method, device, equipment and the storage medium of contract text data
CN111191421A (en) * 2019-12-30 2020-05-22 出门问问信息科技有限公司 Text processing method and device, computer storage medium and electronic equipment
CN111797078A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Data cleaning method, model training method, device, storage medium and equipment
CN112199364A (en) * 2020-10-16 2021-01-08 平安国际智慧城市科技股份有限公司 Data cleaning method and device, electronic equipment and storage medium
CN112287638A (en) * 2020-10-28 2021-01-29 云账户技术(天津)有限公司 Digital display method and device
CN113064885A (en) * 2020-12-29 2021-07-02 中国移动通信集团贵州有限公司 Data cleaning method and device
CN117648635B (en) * 2024-01-30 2024-05-03 深圳昂楷科技有限公司 Sensitive information classification and classification method and system and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361064A (en) * 2014-11-04 2015-02-18 中国银行股份有限公司 Data cleaning method for data files and data files processing method
WO2016101690A1 (en) * 2014-12-22 2016-06-30 国家电网公司 Time sequence analysis-based state monitoring data cleaning method for power transmission and transformation device
US20180052888A1 (en) * 2016-08-17 2018-02-22 International Business Machines Corporation Result set optimization for a search query
CN107784070A (en) * 2017-09-15 2018-03-09 平安科技(深圳)有限公司 A kind of method, apparatus and equipment for improving data cleansing efficiency
CN108052665A (en) * 2017-12-29 2018-05-18 深圳市中易科技有限责任公司 A kind of data cleaning method and device based on distributed platform
CN108446362A (en) * 2018-03-13 2018-08-24 平安普惠企业管理有限公司 Data cleansing processing method, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361064A (en) * 2014-11-04 2015-02-18 中国银行股份有限公司 Data cleaning method for data files and data files processing method
WO2016101690A1 (en) * 2014-12-22 2016-06-30 国家电网公司 Time sequence analysis-based state monitoring data cleaning method for power transmission and transformation device
US20180052888A1 (en) * 2016-08-17 2018-02-22 International Business Machines Corporation Result set optimization for a search query
CN107784070A (en) * 2017-09-15 2018-03-09 平安科技(深圳)有限公司 A kind of method, apparatus and equipment for improving data cleansing efficiency
CN108052665A (en) * 2017-12-29 2018-05-18 深圳市中易科技有限责任公司 A kind of data cleaning method and device based on distributed platform
CN108446362A (en) * 2018-03-13 2018-08-24 平安普惠企业管理有限公司 Data cleansing processing method, device, computer equipment and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096626A (en) * 2019-03-18 2019-08-06 平安普惠企业管理有限公司 Processing method, device, equipment and the storage medium of contract text data
CN111797078A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Data cleaning method, model training method, device, storage medium and equipment
CN111191421A (en) * 2019-12-30 2020-05-22 出门问问信息科技有限公司 Text processing method and device, computer storage medium and electronic equipment
CN111191421B (en) * 2019-12-30 2023-09-12 出门问问创新科技有限公司 Text processing method and device, computer storage medium and electronic equipment
CN112199364A (en) * 2020-10-16 2021-01-08 平安国际智慧城市科技股份有限公司 Data cleaning method and device, electronic equipment and storage medium
CN112287638A (en) * 2020-10-28 2021-01-29 云账户技术(天津)有限公司 Digital display method and device
CN112287638B (en) * 2020-10-28 2022-12-09 云账户技术(天津)有限公司 Digital display method and device
CN113064885A (en) * 2020-12-29 2021-07-02 中国移动通信集团贵州有限公司 Data cleaning method and device
CN117648635B (en) * 2024-01-30 2024-05-03 深圳昂楷科技有限公司 Sensitive information classification and classification method and system and electronic equipment

Also Published As

Publication number Publication date
CN109299233B (en) 2024-03-01

Similar Documents

Publication Publication Date Title
CN106649818B (en) Application search intention identification method and device, application search method and server
CN109299233A (en) Text data processing method, device, computer equipment and storage medium
CN108287858B (en) Semantic extraction method and device for natural language
CN111222305B (en) Information structuring method and device
CN109766438A (en) Biographic information extracting method, device, computer equipment and storage medium
CN112084381A (en) Event extraction method, system, storage medium and equipment
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN110532563A (en) The detection method and device of crucial paragraph in text
CN111767716A (en) Method and device for determining enterprise multilevel industry information and computer equipment
CN110096572B (en) Sample generation method, device and computer readable medium
CN111325018B (en) Domain dictionary construction method based on web retrieval and new word discovery
CN106030568B (en) Natural language processing system, natural language processing method and natural language processing program
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN110929520A (en) Non-named entity object extraction method and device, electronic equipment and storage medium
CN111078839A (en) Structured processing method and processing device for referee document
Wang et al. Mongolian named entity recognition with bidirectional recurrent neural networks
CN110968664A (en) Document retrieval method, device, equipment and medium
CN111563212A (en) Inner chain adding method and device
CN108345694B (en) Document retrieval method and system based on theme database
CN110222340B (en) Training method of book figure name recognition model, electronic device and storage medium
CN110874408B (en) Model training method, text recognition device and computing equipment
CN110941713B (en) Self-optimizing financial information block classification method based on topic model
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN114842982A (en) Knowledge expression method, device and system for medical information system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant