CN104462396B - Character string processing method and device - Google Patents

Character string processing method and device Download PDF

Info

Publication number
CN104462396B
CN104462396B CN201410758617.XA CN201410758617A CN104462396B CN 104462396 B CN104462396 B CN 104462396B CN 201410758617 A CN201410758617 A CN 201410758617A CN 104462396 B CN104462396 B CN 104462396B
Authority
CN
China
Prior art keywords
character string
dictionary
unique mark
mark
present
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410758617.XA
Other languages
Chinese (zh)
Other versions
CN104462396A (en
Inventor
赵立贺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410758617.XA priority Critical patent/CN104462396B/en
Publication of CN104462396A publication Critical patent/CN104462396A/en
Application granted granted Critical
Publication of CN104462396B publication Critical patent/CN104462396B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of character string processing method and device.Wherein, character string processing method includes:Obtain the character string for recording multiple dimensional informations;Character string is parsed, obtains object corresponding with character string;The unique mark of generation object corresponding with character string;Search and whether there is unique mark in the first dictionary, wherein, the first dictionary is the buffer unit for being cached with default mark, wherein, the mark of object corresponding to the default character string being identified as in stored log information;If unique mark is not present in the first dictionary, object corresponding to unique mark is stored;And if existence anduniquess identifies in the first dictionary, then object corresponding to unique mark is not stored.By the present invention, solve the problems, such as string data wasted storage memory space in the prior art, reached the effect for the data volume for reducing storage.

Description

Character string processing method and device
Technical field
The present invention relates to data processing field, in particular to a kind of character string processing method and device.
Background technology
With the fast development of internet, caused data volume is increasing on the internet, and processing pressure is also increasingly Greatly, how fast and effeciently to handle and analyze these data turns into Internet firm's urgent problem to be solved.
Internet data is generally showed in the form of character string, by taking ad log information as an example, monitors obtained advertisement day Will packet contains multiple dimensions, such as ReferrerUrl, Cookie, IP, UserAgent.Wherein ReferrerUrl character strings The information such as advertising source domain name, source path and source parameter can be parsed, UserAgent character strings can parse operation System information, browser information, facility information and whether the information such as mobile terminal.After these string datas are got, lead to Often first stored, in order to do further data analysis.
However, in existing character string storing process, it will usually store all character strings, either identical character string Or information of the presence without parsing in character string, unified storage, this will certainly waste memory space.For example, following two Information:
Following two UserAgent character strings, for advertisement dimensional analysis, there is provided dimensional information be identical , but because unit type does not belong to different UserAgent.
Mozilla/5.0(iPhone;CPU iPhone OS 7_1 like Mac OS X)AppleWebKit/ 537.51.2(KHTML,like Gecko)Mobile/11D167
Mozilla/5.0(iPhone;CPU iPhone OS 7_1 like Mac OS X)AppleWebKit/ 537.51.1(KHTML,like Gecko)Mobile/11B651
Because the information that UserAgent character strings include is varied, and advertisement dimensional analysis is what is required is simply that therein Operating system, browser, facility information and whether the essential information such as mobile terminal, believe for other in UserAgent character strings (such as browser version .net versions, unit type) is ceased without parsing.
By procedure described above as can be seen that current showing advertisement, redirect monitoring daily record because nothing be present in dimension The information that need to be parsed, this causes the direct preservation of dimensional information to be both unfavorable for advertisement dimensional analysis, will also result in memory space Waste.
The problem of for string data wasted storage memory space in the prior art, effective solve not yet is proposed at present Scheme.
The content of the invention
It is a primary object of the present invention to provide a kind of character string processing method and device, to solve character in the prior art The problem of string data wasted storage memory space.
To achieve these goals, a kind of one side according to embodiments of the present invention, there is provided character string processing method. Included according to the character string processing method of the present invention:Obtain the character string for recording multiple dimensional informations;To the character string Parsed, obtain object corresponding with the character string;The unique mark of generation object corresponding with the character string;Search It whether there is the unique mark in first dictionary, wherein, first dictionary is the buffer unit for being cached with default mark, its In, the mark of object corresponding to the default character string being identified as in stored log information;If first word The unique mark is not present in allusion quotation, then stores object corresponding to the unique mark;And if deposited in first dictionary In the unique mark, then object corresponding to the unique mark is not stored.
Further, after acquisition is used to record the character string of multiple dimensional informations, and the character string is carried out Parsing, before obtaining object corresponding with the character string, the character string processing method also includes:Searching in the second dictionary is It is no the character string to be present;If the character string be present in second dictionary, the character string is filtered out;It is if described The character string is not present in second dictionary, then the character string is cached in second dictionary, the character string is entered Row parsing, obtaining object corresponding with the character string includes:Pair character string that is not present in second dictionary of determination is entered Row parsing, obtain object corresponding with the character string determined.
Further, if the unique mark is not present in first dictionary, store corresponding to the unique mark Object includes:If the unique mark is not present in first dictionary, by the shape of object corresponding to the unique mark State is labeled as newly-increased state;Whether judge mark reaches predetermined threshold value for the object number of the newly-increased state;And if sentence The object number labeled as the newly-increased state of breaking reaches the predetermined threshold value, then by by labeled as the newly-increased state Object inserts database to store the object labeled as the newly-increased state.
Further, before it whether there is the unique mark in searching the first dictionary, the character string processing method Also include:The mark of object in the database is cached in first dictionary, wherein, it is in the first dictionary is searched It is no the unique mark be present after, if the unique mark is not present in first dictionary, by the unique mark It is cached in first dictionary.
Further, generating the unique mark of object corresponding with the character string includes:Calculate more in the character string The hashed value of individual dimensional information;The unique mark is worth to by the hash of the multiple dimensional information.
To achieve these goals, a kind of another aspect according to embodiments of the present invention, there is provided character string processing apparatus. Included according to the character string processing apparatus of the present invention:Acquiring unit, for obtaining the character string for being used for recording multiple dimensional informations; Resolution unit, for being parsed to the character string, obtain object corresponding with the character string;Generation unit, for giving birth to Into the unique mark of object corresponding with the character string;First searching unit, it whether there is institute for searching in the first dictionary Unique mark is stated, wherein, first dictionary is the buffer unit for being cached with default mark, wherein, described preset is identified as The mark of object corresponding to character string in stored log information;Memory cell, if in first dictionary not The unique mark be present, then store object corresponding to the unique mark;If exist in first dictionary described unique Mark, then do not store object corresponding to the unique mark.
Further, the character string processing apparatus also includes:Second searching unit, for multiple for recording in acquisition After the character string of dimensional information, and the character string is parsed, before obtaining object corresponding with the character string, Search and whether there is the character string in the second dictionary;Filter element, if for the character be present in second dictionary String, then filter out the character string;First buffer unit, if for the character string to be not present in second dictionary, The character string is cached in second dictionary, the resolution unit includes:Parsing module, determine described for Dui The character string being not present in two dictionaries is parsed, and obtains object corresponding with the character string determined.
Further, the memory cell includes:Mark module, if for be not present in first dictionary it is described only One mark, then it is newly-increased state by the status indication of object corresponding to the unique mark;Judge module, it is for judge mark Whether the object number of the newly-increased state reaches predetermined threshold value;And memory module, for if it is judged that labeled as described The object number of newly-increased state reaches the predetermined threshold value, then by will insert database labeled as the object of the newly-increased state To store the object labeled as the newly-increased state.
Further, the character string processing apparatus also includes:Second buffer unit, for being in the first dictionary is searched It is no the unique mark be present before, the mark of the object in the database is cached in first dictionary, described Two buffer units are additionally operable to after it whether there is the unique mark in searching the first dictionary, if in first dictionary not The unique mark be present, then the unique mark is cached in first dictionary.
Further, the generation unit includes:Computing module, for calculating multiple dimensional informations in the character string Hashed value;Determining module, for being worth to the unique mark by the hash of the multiple dimensional information.
According to embodiments of the present invention, by obtaining the character string for being used for recording multiple dimensional informations, character string is solved Analysis, obtains object corresponding with character string, generates the unique mark of object corresponding with character string, in the first dictionary of lookup whether Existence anduniquess identifies, if unique mark is not present in the first dictionary, object corresponding to unique mark is stored, if the first word Existence anduniquess identifies in allusion quotation, then does not store object corresponding to unique mark, so, only exists for the character string or difference repeated Character string in the part that need not be parsed, the object parsed is same object, and identical object only stores one It is secondary, solve the problems, such as string data wasted storage memory space in the prior art, reached the data volume of reduction storage Effect.
Brief description of the drawings
The accompanying drawing for forming the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention Apply example and its illustrate to be used to explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of character string processing method according to embodiments of the present invention;
Fig. 2 is the flow chart of preferable character string processing method according to embodiments of the present invention;And
Fig. 3 is the schematic diagram of character string processing apparatus according to embodiments of the present invention.
Embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Describe the present invention in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects Enclose.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments of the invention described herein.In addition, term " comprising " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing series of steps or unit Process, method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include without clear It is listing to Chu or for the intrinsic other steps of these processes, method, product or equipment or unit.
The embodiments of the invention provide a kind of character string processing method.
Fig. 1 is the flow chart of character string processing method according to embodiments of the present invention.As shown in figure 1, the string processing It is as follows that method includes step:
Step S102, obtain the character string for recording multiple dimensional informations.
The character string can be the character string extracted in log information, and log information can be the ad log of detection.Should Record has the i.e. multiple dimensional informations of information for embodying multiple dimension indexs in character string, for example, in ad log information Whether UserAgent character strings, the character string include reflecting operation system information, browser information, facility information and move The character of multiple dimensional informations such as end.
Step S104, is parsed to character string, obtains object corresponding with character string.
Object corresponding with character string to represent to need the class of dimensional information attribute parsed in character string, wherein, if Two character strings differ only in the character information that need not be parsed, then its object parsed is then identical pair As.Equally by taking UserAgent character strings as an example, due to needing to parse operating system therein, browser, equipment in advertisement analysis Information and whether the essential information such as mobile terminal, then, object corresponding with the UserAgent character strings can be represented UserAgent character strings are by including operation system information, browser information, facility information and whether moving client information after parsing Deng a class of attribute.Specifically, such as following two UserAgent character strings, its object parsed is identical pair As:
Mozilla/5.0(iPhone;CPU iPhone OS 7_1 like Mac OS X)AppleWebKit/ 537.51.2(KHTML,like Gecko)Mobile/11D167
Mozilla/5.0(iPhone;CPU iPhone OS 7_1 like Mac OS X)AppleWebKit/ 537.51.1(KHTML,like Gecko)Mobile/11B651
Step S106, generate the unique mark of object corresponding with character string.
The unique mark of object is the unique identity of the object, can be that this that be calculated according to preset algorithm is right Cryptographic Hash or hashed value of elephant etc..
Preferably, generating the unique mark of object corresponding with character string includes:Multiple dimensional informations in calculating character string Hashed value;Unique mark is worth to by the hash of multiple dimensional informations.Wherein, multiple dimensional informations are character string in character string The middle dimensional information for needing to parse, calculate the hashed value of these dimensional informations, the unique mark using hashed value as object.
Step S108, search in the first dictionary and whether there is unique mark.Wherein, the first dictionary is to be cached with default mark Buffer unit, wherein, the mark of object corresponding to the default character string being identified as in stored log information.
The mark of the object stored in database is cached with the first dictionary, wherein, according to object and its in database The corresponding relation storage object of mark, before unique mark is begun look for from the first dictionary, the first dictionary is read from database The mark of object, and the first dictionary local is cached to, so, by searching whether exist and above-mentioned unique mark in the first dictionary Know identical mark, you can determine whether there has been object corresponding to above-mentioned character string in database.
It should be noted that in the embodiment of the present invention, the mode of the unique mark of above-mentioned generation object is with being buffered in first The generating mode of the mark of dictionary or the mark being stored in database is identical.
Step S110, if unique mark is not present in the first dictionary, store object corresponding to unique mark.
Step S112, if existence anduniquess identifies in the first dictionary, object corresponding to unique mark is not stored.
After the unique mark of object corresponding to generation character string, the unique mark is searched from the first dictionary, if the Above-mentioned unique mark in one dictionary be present, then do not store object corresponding to the unique mark;Conversely, then store the unique mark pair The object answered, and the unique mark is cached in the first dictionary, in order to handle successive character string.
According to embodiments of the present invention, by obtaining the character string for being used for recording multiple dimensional informations, character string is solved Analysis, obtains object corresponding with character string, generates the unique mark of object corresponding with character string, in the first dictionary of lookup whether Existence anduniquess identifies, if unique mark is not present in the first dictionary, object corresponding to unique mark is stored, if the first word Existence anduniquess identifies in allusion quotation, then does not store object corresponding to unique mark, so, only exists for the character string or difference repeated Character string in the part that need not be parsed, the object parsed is same object, and identical object only stores one It is secondary, solve the problems, such as string data wasted storage memory space in the prior art, reached the data volume of reduction storage Effect.
In the embodiment of the present invention, only to valuable dimensional information (such as UserAgent browser, operating system, equipment Information and whether mobile terminal) parsed, search come in matching caching with the presence or absence of the mode such as identical dimensional information after parsing Realize the quick duplicate removal of dimensional information.
Preferably, after acquisition is used to record the character string of multiple dimensional informations, and character string is parsed, obtained To before object corresponding with character string, character string processing method also includes:Search and whether there is character string in the second dictionary;Such as Character string be present in the dictionary of fruit second, then filter out character string;If character string is not present in the second dictionary, character string is delayed It is stored in the second dictionary, character string is parsed, obtaining object corresponding with character string includes:Pair determine in the second dictionary The character string being not present is parsed, and obtains object corresponding with the character string determined.
Second dictionary is the buffer unit for being cached with the character string of matched mistake.Because the resolving of character string compares Slowly, the present embodiment carries out matched and searched to character string by the second dictionary, searches second before being parsed to character string With the presence or absence of the character string got in dictionary, if it is present filtering out the character string, that is to say and the character string identical Character string has treated, and the character string is without being parsed and being stored again;If it does not exist, then the character string is cached to In two dictionaries, and the character string is parsed.
According to embodiments of the present invention, the data volume in character string is very big and when the character string repeated be present, using second Dictionary pre-processes to character string, by repeat character string directly remove, that is to say by cache dimensional information character string come The character string into dimension process of analysis is reduced, so as to reach the simple duplicate removal of the fast filtering of outer layer;After being parsed by dimension The object matching of valuable dimensional information reaches real dimension duplicate removal, so as to reducing the data volume of character string parsing, improves Data-handling efficiency.
Preferably, if unique mark is not present in the first dictionary, object corresponding to storage unique mark includes:If the Unique mark is not present in one dictionary, then is newly-increased state by the status indication of object corresponding to unique mark;Judge mark is Whether the object number of newly-increased state reaches predetermined threshold value;And if it is judged that reach labeled as the object number of newly-increased state Predetermined threshold value, then by the way that database will be inserted labeled as the object of newly-increased state to store the object for being labeled as increasing state newly.
When it is determined that unique mark is not present in the first dictionary, it that is to say that object corresponding to the unique mark is not stored in Then it is newly-increased state by the object tag, and whether judge mark reaches pre- for the number of the object of newly-increased state in database If threshold value, if it is, the object batch labeled as newly-increased state is inserted into database, so as to realize that batch stores, keep away Exempt from the problem of storage object causes database access frequent one by one.
Preferably, before it whether there is unique mark in searching the first dictionary, character string processing method also includes:By number It is cached to according to the mark of the object in storehouse in the first dictionary, wherein, after it whether there is unique mark in searching the first dictionary, If unique mark is not present in the first dictionary, unique mark is cached in the first dictionary.
In the embodiment of the present invention, before it whether there is unique mark in searching the first dictionary, first by the institute in database There is mark to be cached in the first dictionary, because the first dictionary is buffer unit, in the first dictionary search unique mark relative to Searched in database, its efficiency searched is greatly improved.
When it is determined that above-mentioned unique mark is not present in the first dictionary, it is cached to by the mark in the first dictionary, so as to In carrying out follow-up matched and searched using the unique mark as the mark in the first dictionary.
Below by taking UserAgent character strings as an example, the character string processing method of the embodiment of the present invention is retouched with reference to Fig. 2 State.
As described in Figure 2, this method comprises the following steps:
Step S202, initialize the second dictionary.Second dictionary is used to cache UserAgent character strings, can be also used for The corresponding relation of UserAgent object informations after caching UserAgent character strings and UserAgent character string parsings, first Before the second dictionary of secondary entrance, the second dictionary is sky.
Step S204, initialize the first dictionary.First dictionary is used for the unique mark for caching UserAgent objects, or The unique mark and UserAgent object corresponding relations of UserAgent objects.Enter for the first time before the first dictionary, the first word The mark of object corresponding to the UserAgent character strings read from database may be cached with allusion quotation.
Step S206, obtain UserAgent character strings.
Step S208, the second dictionary whether there is the character string.If it is present terminate flow;Conversely, then perform step S210。
Step S210, parse UserAgent character strings.UserAgent character strings are parsed, obtain UserAgent UserAgent objects corresponding to character string, specifically, by UserAgent character string parsings into comprising operation system information, browse Device information, facility information, whether the UserAgent objects of the attribute such as mobile terminal.
Step S212, generate the unique mark of UserAgent objects.Wherein UserAgent objects unique mark passes through behaviour Make system information, browser information, facility information and whether the hashed value of four attributes in mobile terminal uniquely determines.
Step S214, it whether there is unique mark in the first dictionary.If it is, terminating flow, the unique mark is deleted Corresponding object;Conversely, then perform step S216.
Step S216, the state of UserAgent objects is arranged to newly-increased, then by unique mark of UserAgent objects Knowledge is added in the first dictionary.
Step S218, judge whether UserAgent Obj States reach insertion for newly-increased object number in the first dictionary The batch of database, such as not up to, then directly return;If reaching batch, step S220 is performed.
Step S220, UserAgent Obj States in the first dictionary are entirely insertable database for newly-increased object.And will Corresponding UserAgent Obj States are changed to existing.
In the present embodiment, the effect of the second dictionary is due to that UserAgent parsings are relatively slow, passes through caching UserAgent character strings can be such that identical UserAgent character strings only parse once, i.e., parsed by reducing UserAgent Journey reaches the first layer duplicate removal of dimensional information.The lookup matching of first dictionary is second layer duplicate removal, and real UserAgent Duplicate removal, so, realize the quick duplicate removal storage of character string.
The embodiment of the present invention additionally provides a kind of character string processing apparatus.The device can realize it by computer equipment Function.There is provided it should be noted that the character string processing apparatus of the embodiment of the present invention can be used for the execution embodiment of the present invention Character string processing method, the word that the character string processing method of the embodiment of the present invention can also be provided by the embodiment of the present invention String manipulation device is accorded with to perform.
Fig. 3 is the schematic diagram of character string processing apparatus according to embodiments of the present invention.As shown in figure 3, the string processing Device includes:Acquiring unit 10, resolution unit 20, generation unit 30, the first searching unit 40 and memory cell 50.
Acquiring unit 10 is used to obtain the character string for being used for recording multiple dimensional informations.
The character string can be the character string extracted in log information, and log information can be the ad log of detection.Should Record has the i.e. multiple dimensional informations of information for embodying multiple dimension indexs in character string, for example, in ad log information Whether UserAgent character strings, the character string include reflecting operation system information, browser information, facility information and move The character of multiple dimensional informations such as end.
Resolution unit 20 is used to parse character string, obtains object corresponding with character string.
Object corresponding with character string to represent to need the class of dimensional information attribute parsed in character string, wherein, if Two character strings differ only in the character information that need not be parsed, then its object parsed is then identical pair As.Equally by taking UserAgent character strings as an example, due to needing to parse operating system therein, browser, equipment in advertisement analysis Information and whether the essential information such as mobile terminal, then, object corresponding with the UserAgent character strings can be represented UserAgent character strings are by including operation system information, browser information, facility information and whether moving client information after parsing Deng a class of attribute.Specifically, such as following two UserAgent character strings, its object parsed is identical pair As:
Mozilla/5.0(iPhone;CPU iPhone OS 7_1 like Mac OS X)AppleWebKit/ 537.51.2(KHTML,like Gecko)Mobile/11D167
Mozilla/5.0(iPhone;CPU iPhone OS 7_1 like Mac OS X)AppleWebKit/ 537.51.1(KHTML,like Gecko)Mobile/11B651
Generation unit 30 is used for the unique mark for generating object corresponding with character string.
The unique mark of object is the unique identity of the object, can be that this that be calculated according to preset algorithm is right Cryptographic Hash or hashed value of elephant etc..
Preferably, generation unit 30 includes:Computing module, the hashed value for multiple dimensional informations in calculating character string; Determining module, for being worth to unique mark by the hash of multiple dimensional informations.Wherein, multiple dimensional informations are word in character string The dimensional information parsed is needed in symbol string, calculates the hashed value of these dimensional informations, the unique mark using hashed value as object.
First searching unit 40, which is used to search in the first dictionary, whether there is unique mark, wherein, the first dictionary is caching There is the buffer unit of default mark, wherein, object corresponding to the default character string being identified as in stored log information Mark.
The mark of the object stored in database is cached with the first dictionary, wherein, according to object and its in database The corresponding relation storage object of mark, before unique mark is begun look for from the first dictionary, the first dictionary is read from database The mark of object, and the first dictionary local is cached to, so, by searching whether exist and above-mentioned unique mark in the first dictionary Know identical mark, you can determine whether there has been object corresponding to above-mentioned character string in database.
It should be noted that in the embodiment of the present invention, the mode of the unique mark of above-mentioned generation object is with being buffered in first The generating mode of the mark of dictionary or the mark being stored in database is identical.
If memory cell 50 is used in the first dictionary unique mark be not present, object corresponding to unique mark is stored; If existence anduniquess identifies in the first dictionary, object corresponding to unique mark is not stored.
After the unique mark of object corresponding to generation character string, the unique mark is searched from the first dictionary, if the Above-mentioned unique mark in one dictionary be present, then do not store object corresponding to the unique mark;Conversely, then store the unique mark pair The object answered, and the unique mark is cached in the first dictionary, in order to handle successive character string.
According to embodiments of the present invention, by obtaining the character string for being used for recording multiple dimensional informations, character string is solved Analysis, obtains object corresponding with character string, generates the unique mark of object corresponding with character string, in the first dictionary of lookup whether Existence anduniquess identifies, if unique mark is not present in the first dictionary, object corresponding to unique mark is stored, if the first word Existence anduniquess identifies in allusion quotation, then does not store object corresponding to unique mark, so, only exists for the character string or difference repeated Character string in the part that need not be parsed, the object parsed is same object, and identical object only stores one It is secondary, solve the problems, such as string data wasted storage memory space in the prior art, reached the data volume of reduction storage Effect.
Preferably, character string processing apparatus also includes:Second searching unit, for obtaining for recording multiple dimensions letters After the character string of breath, and character string is parsed, before obtaining object corresponding with character string, searched in the second dictionary With the presence or absence of character string;Filter element, if for character string be present in the second dictionary, filter out character string;First caching Unit, if for character string to be not present in the second dictionary, character string is cached in the second dictionary, resolution unit includes: Parsing module, the character string being not present for pair determination in the second dictionary parse, and obtain corresponding with the character string determined Object.
Second dictionary is the buffer unit for being cached with the character string of matched mistake.Because the resolving of character string compares Slowly, the present embodiment carries out matched and searched to character string by the second dictionary, searches second before being parsed to character string With the presence or absence of the character string got in dictionary, if it is present filtering out the character string, that is to say and the character string identical Character string has treated, and the character string is without being parsed and being stored again;If it does not exist, then the character string is cached to In two dictionaries, and the character string is parsed.
According to embodiments of the present invention, the data volume in character string is very big and when the character string repeated be present, using second Dictionary pre-processes to character string, by repeat character string directly remove, that is to say by cache dimensional information character string come The character string into dimension process of analysis is reduced, so as to reach the simple duplicate removal of the fast filtering of outer layer;After being parsed by dimension The object matching of valuable dimensional information reaches real dimension duplicate removal, so as to reducing the data volume of character string parsing, improves Data-handling efficiency.
Preferably, memory cell includes:Mark module, if for unique mark to be not present in the first dictionary, will only The status indication of object corresponding to one mark is newly-increased state;Judge module, for the object that judge mark is newly-increased state Whether number reaches predetermined threshold value;And memory module, for if it is judged that reaching pre- labeled as the object number of newly-increased state If threshold value, then by the way that database will be inserted labeled as the object of newly-increased state to store the object for being labeled as increasing state newly.
When it is determined that unique mark is not present in the first dictionary, it that is to say that object corresponding to the unique mark is not stored in Then it is newly-increased state by the object tag, and whether judge mark reaches pre- for the number of the object of newly-increased state in database If threshold value, if it is, the object batch labeled as newly-increased state is inserted into database, so as to realize that batch stores, keep away Exempt from the problem of storage object causes database access frequent one by one.
Preferably, character string processing apparatus also includes:Second buffer unit, for whether there is in the first dictionary is searched Before unique mark, the mark of the object in database is cached in the first dictionary, the second buffer unit is additionally operable to searching After whether there is unique mark in first dictionary, if unique mark is not present in the first dictionary, unique mark is cached Into the first dictionary.
In the embodiment of the present invention, before it whether there is unique mark in searching the first dictionary, first by the institute in database There is mark to be cached in the first dictionary, because the first dictionary is buffer unit, in the first dictionary search unique mark relative to Searched in database, its efficiency searched is greatly improved.
When it is determined that above-mentioned unique mark is not present in the first dictionary, it is cached to by the mark in the first dictionary, so as to In carrying out follow-up matched and searched using the unique mark as the mark in the first dictionary.
It should be noted that for foregoing each method embodiment, in order to be briefly described, therefore it is all expressed as a series of Combination of actions, but those skilled in the art should know, the present invention is not limited by described sequence of movement because According to the present invention, some steps can use other orders or carry out simultaneously.Secondly, those skilled in the art should also know Know, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention It is necessary.
In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed device, can be by another way Realize.For example, device embodiment described above is only schematical, such as the division of the unit, it is only one kind Division of logic function, can there is an other dividing mode when actually realizing, such as multiple units or component can combine or can To be integrated into another system, or some features can be ignored, or not perform.Another, shown or discussed is mutual Coupling direct-coupling or communication connection can be by some interfaces, the INDIRECT COUPLING or communication connection of device or unit, Can be electrical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer Equipment (can be personal computer, mobile terminal, server or network equipment etc.) performs side described in each embodiment of the present invention The all or part of step of method.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various to be stored The medium of program code.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims (10)

  1. A kind of 1. character string processing method, it is characterised in that including:
    The character string for recording multiple dimensional informations is obtained, wherein, the multiple dimensional information in the character string is institute State the dimensional information for needing to parse in character string;
    The character string is parsed, obtains object corresponding with the character string;
    The unique mark of generation object corresponding with the character string;
    Search and whether there is the unique mark in the first dictionary, wherein, first dictionary is to be cached with delaying for default mark Memory cell, wherein, the mark of object corresponding to the default character string being identified as in stored log information;
    If the unique mark is not present in first dictionary, object corresponding to the unique mark is stored;And
    If the unique mark be present in first dictionary, object corresponding to the unique mark is not stored.
  2. 2. character string processing method according to claim 1, it is characterised in that
    After obtaining and being used to record the character string of multiple dimensional informations, and the character string is parsed, obtained and institute Before stating object corresponding to character string, the character string processing method also includes:Search and whether there is the word in the second dictionary Symbol string;If the character string be present in second dictionary, the character string is filtered out;If in second dictionary not The character string be present, then the character string be cached in second dictionary,
    The character string is parsed, obtaining object corresponding with the character string includes:Pair determine in second dictionary In the character string that is not present parsed, obtain object corresponding with the character string determined.
  3. 3. character string processing method according to claim 1, it is characterised in that if institute is not present in first dictionary State unique mark, store the unique mark corresponding to object include:
    If the unique mark is not present in first dictionary, by the status indication of object corresponding to the unique mark To increase state newly;
    Whether judge mark reaches predetermined threshold value for the object number of the newly-increased state;And
    If it is judged that reach the predetermined threshold value labeled as the object number of the newly-increased state, then it is described by that will be labeled as The object of newly-increased state inserts database to store the object labeled as the newly-increased state.
  4. 4. character string processing method according to claim 3, it is characterised in that
    Before it whether there is the unique mark in searching the first dictionary, the character string processing method also includes:By described in The mark of object in database is cached in first dictionary,
    Wherein, after it whether there is the unique mark in searching the first dictionary, if institute is not present in first dictionary Unique mark is stated, then the unique mark is cached in first dictionary.
  5. 5. character string processing method according to any one of claim 1 to 4, it is characterised in that generation and the character The unique mark of object corresponding to string includes:
    Calculate the hashed value of multiple dimensional informations in the character string;
    The unique mark is worth to by the hash of the multiple dimensional information.
  6. A kind of 6. character string processing apparatus, it is characterised in that including:
    Acquiring unit, for obtaining the character string for being used for recording multiple dimensional informations, wherein, it is the multiple in the character string Dimensional information is the dimensional information for needing to parse in the character string;
    Resolution unit, for being parsed to the character string, obtain object corresponding with the character string;
    Generation unit, for generating the unique mark of object corresponding with the character string;
    First searching unit, it whether there is the unique mark for searching in the first dictionary, wherein, first dictionary is slow There is the buffer unit of default mark, wherein, corresponding to the default character string being identified as in stored log information The mark of object;
    Memory cell, if for the unique mark to be not present in first dictionary, it is corresponding to store the unique mark Object;If the unique mark be present in first dictionary, object corresponding to the unique mark is not stored.
  7. 7. character string processing apparatus according to claim 6, it is characterised in that
    The character string processing apparatus also includes:Second searching unit, for obtaining the word for being used for recording multiple dimensional informations After symbol string, and the character string is parsed, before obtaining object corresponding with the character string, search the second dictionary In whether there is the character string;Filter element, if for the character string be present in second dictionary, filter out institute State character string;First buffer unit, if for the character string to be not present in second dictionary, the character string is delayed It is stored in second dictionary,
    The resolution unit includes:Parsing module, the character string being not present for pair determination in second dictionary solve Analysis, obtain object corresponding with the character string determined.
  8. 8. character string processing apparatus according to claim 6, it is characterised in that the memory cell includes:
    Mark module, if for the unique mark to be not present in first dictionary, by corresponding to the unique mark The status indication of object is newly-increased state;
    Judge module, whether reach predetermined threshold value for judge mark for the object number of the newly-increased state;And
    Memory module, for if it is judged that reach the predetermined threshold value labeled as the object number of the newly-increased state, then leading to Database will be inserted to store the object labeled as the newly-increased state labeled as the object of the newly-increased state by crossing.
  9. 9. character string processing apparatus according to claim 8, it is characterised in that
    The character string processing apparatus also includes:Second buffer unit, in the first dictionary is searched with the presence or absence of it is described only Before one mark, the mark of the object in the database is cached in first dictionary,
    Second buffer unit is additionally operable to after it whether there is the unique mark in searching the first dictionary, if described the The unique mark is not present in one dictionary, then the unique mark is cached in first dictionary.
  10. 10. the character string processing apparatus according to any one of claim 6 to 9, it is characterised in that the generation unit bag Include:
    Computing module, for calculating the hashed value of multiple dimensional informations in the character string;
    Determining module, for being worth to the unique mark by the hash of the multiple dimensional information.
CN201410758617.XA 2014-12-10 2014-12-10 Character string processing method and device Active CN104462396B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410758617.XA CN104462396B (en) 2014-12-10 2014-12-10 Character string processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410758617.XA CN104462396B (en) 2014-12-10 2014-12-10 Character string processing method and device

Publications (2)

Publication Number Publication Date
CN104462396A CN104462396A (en) 2015-03-25
CN104462396B true CN104462396B (en) 2017-12-19

Family

ID=52908431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410758617.XA Active CN104462396B (en) 2014-12-10 2014-12-10 Character string processing method and device

Country Status (1)

Country Link
CN (1) CN104462396B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765833B (en) * 2015-04-13 2018-06-19 天脉聚源(北京)传媒科技有限公司 A kind of generation method and device of word association table
CN104765831B (en) * 2015-04-13 2018-06-19 天脉聚源(北京)传媒科技有限公司 A kind of generation of dictionary sheet and its application process and device
CN106503024A (en) * 2015-09-08 2017-03-15 北京国双科技有限公司 Log information processing method and device
CN108255877B (en) * 2016-12-29 2020-11-24 北京国双科技有限公司 Storage method and device of referee document
CN108255867A (en) * 2016-12-29 2018-07-06 北京国双科技有限公司 Unique mark processing method and processing device
CN108932305A (en) * 2018-06-12 2018-12-04 北京顶象技术有限公司 A kind of data processing method, device, electronic equipment and storage medium
CN110737644B (en) * 2019-10-12 2023-06-23 招商局金融科技有限公司 Method, device and computer readable storage medium for integrating customer information

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620608A (en) * 2008-07-04 2010-01-06 全国组织机构代码管理中心 Information collection method and system
CN102968498A (en) * 2012-12-05 2013-03-13 华为技术有限公司 Method and device for processing data
CN103593440A (en) * 2013-11-15 2014-02-19 北京国双科技有限公司 Method and device for reading and writing log file

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5579195B2 (en) * 2008-12-22 2014-08-27 グーグル インコーポレイテッド Asynchronous distributed deduplication for replicated content addressable storage clusters

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620608A (en) * 2008-07-04 2010-01-06 全国组织机构代码管理中心 Information collection method and system
CN102968498A (en) * 2012-12-05 2013-03-13 华为技术有限公司 Method and device for processing data
CN103593440A (en) * 2013-11-15 2014-02-19 北京国双科技有限公司 Method and device for reading and writing log file

Also Published As

Publication number Publication date
CN104462396A (en) 2015-03-25

Similar Documents

Publication Publication Date Title
CN104462396B (en) Character string processing method and device
CN108334533B (en) Keyword extraction method and device, storage medium and electronic device
Ma et al. An LDA and synonym lexicon based approach to product feature extraction from online consumer product reviews
US11775767B1 (en) Systems and methods for automated iterative population of responses using artificial intelligence
US9600530B2 (en) Updating a search index used to facilitate application searches
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
US20100179948A1 (en) Method and system for querying information
WO2016000555A1 (en) Methods and systems for recommending social network-based content and news
CN110352427B (en) System and method for collecting data associated with fraudulent content in a networked environment
US20170109633A1 (en) Comment-comment and comment-document analysis of documents
EP3392783A1 (en) Similar word aggregation method and apparatus
US20150161278A1 (en) Method and apparatus for identifying webpage type
CN104537341A (en) Human face picture information obtaining method and device
CN110008306A (en) A kind of data relationship analysis method, device and data service system
Bhakuni et al. Evolution and evaluation: Sarcasm analysis for twitter data using sentiment analysis
CN114915468B (en) Intelligent analysis and detection method for network crime based on knowledge graph
CN106844482B (en) Search engine-based retrieval information matching method and device
CN110209659A (en) A kind of resume filter method, system and computer readable storage medium
Zarrad et al. The evaluation of the public opinion-a case study: Mers-cov infection virus in ksa
CN113392329A (en) Content recommendation method and device, electronic equipment and storage medium
CN106844588A (en) A kind of analysis method and system of the user behavior data based on web crawlers
Jaman et al. Sentiment analysis of customers on utilizing online motorcycle taxi service at twitter with the support vector machine
CN110750707A (en) Keyword recommendation method and device and electronic equipment
CN106933798B (en) Information analysis method and device
CN110489740B (en) Semantic analysis method and related product

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Pinyin string processing method and device

Effective date of registration: 20190531

Granted publication date: 20171219

Pledgee: Shenzhen Black Horse World Investment Consulting Co.,Ltd.

Pledgor: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Registration number: 2019990000503

CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Patentee after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Patentee before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20240604

Granted publication date: 20171219