CN108628817A - A kind of data processing method and device - Google Patents

A kind of data processing method and device Download PDF

Info

Publication number
CN108628817A
CN108628817A CN201710153501.7A CN201710153501A CN108628817A CN 108628817 A CN108628817 A CN 108628817A CN 201710153501 A CN201710153501 A CN 201710153501A CN 108628817 A CN108628817 A CN 108628817A
Authority
CN
China
Prior art keywords
data
length
acquisition system
same type
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710153501.7A
Other languages
Chinese (zh)
Other versions
CN108628817B (en
Inventor
冯博渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201710153501.7A priority Critical patent/CN108628817B/en
Publication of CN108628817A publication Critical patent/CN108628817A/en
Application granted granted Critical
Publication of CN108628817B publication Critical patent/CN108628817B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Abstract

An embodiment of the present invention provides a kind of data processing method and device, wherein method includes:Obtain the first pending data acquisition system;Obtain the length for the data that first data acquisition system includes;The comparison that the data that first data acquisition system includes are carried out with the length determines the second data set according to comparison result from first data acquisition system, and there are correspondences between the data that the second data set includes.The data there are correspondence can be accurately and rapidly extracted through the embodiment of the present invention.

Description

A kind of data processing method and device
Technical field
The present invention relates to Internet technical field more particularly to a kind of data processing method and devices.
Background technology
In the practical application scene for being related to big data processing, it is often necessary to sort out the number of same type from mass data According to, the hash wherein mixed is filtered out, for example, when carrying out the collection of translation source material, since translation source material is for carrying For having correspondence between the data of translation contrast relationship namely translation source material between different language, if from webpage It is mingled with hash in the source data of middle extraction, then can carrys out severe jamming to the collection belt of translation source material, lead to not standard Really, the data there are correspondence are rapidly extracted from source data, to complete the collection of translation source material.As it can be seen that how Accurately and rapidly extract urgent problem to be solved when having become the collection of translation source material there are the data of correspondence.
Invention content
An embodiment of the present invention provides a kind of data processing method and devices, can accurately and rapidly extract presence pair The data that should be related to.
First aspect of the embodiment of the present invention provides a kind of data processing method, including:
Obtain the first pending data acquisition system.
Obtain the length for the data that first data acquisition system includes.
The comparison that the data that first data acquisition system includes are carried out with the length, according to comparison result from described first Determine the second data set in data acquisition system, there are correspondences between the data that the second data set includes.
Second aspect of the embodiment of the present invention provides a kind of data processing equipment, including:
Acquisition module, for obtaining the first pending data acquisition system.
The acquisition module is additionally operable to obtain the length for the data that first data acquisition system includes.
Processing module, the data for including to first data acquisition system carry out the comparison of the length, according to comparing As a result the second data set is determined from first data acquisition system, is existed between the data that the second data set includes Correspondence.
The first pending data acquisition system can be obtained through the embodiment of the present invention and first data acquisition system includes The length of data, the data for including to first data acquisition system carry out the comparison of length, and according to comparison result from first number According to determining the second data set in set, there are correspondence between the data which includes, so as to Accurately and rapidly extract the data there are correspondence.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Obtain other attached drawings according to these attached drawings.
Fig. 1 is a kind of flow diagram of data processing method provided in an embodiment of the present invention;
Fig. 2 is a kind of schematic diagram of web page contents provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of data processing equipment provided in an embodiment of the present invention;
Fig. 4 is a kind of structural schematic diagram of terminal provided in an embodiment of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts Embodiment shall fall within the protection scope of the present invention.
Terminal described in the embodiment of the present invention can specifically include:Desktop computer, smart mobile phone, is put down at laptop Plate computer etc..
Referring to Fig. 1, being a kind of flow diagram of data processing method provided in an embodiment of the present invention.In the present embodiment Described data processing method, includes the following steps:
101, terminal obtains the first pending data acquisition system.
In the specific implementation, terminal can by the first pending data acquisition system of web crawlers technical limit spacing, including:Terminal Hypertext markup language (Hyper Text Markup Language, HTML) is sent to webpage web server to ask, and is received The html data that web server returns, and using third party library (such as BeautifulSoup etc.) to html data into row format Neutralizing analysis, to obtain the first pending data acquisition system.
Wherein, html data can specifically be to provide the web data of translation source material, and translation source material is for providing not Can be the forms such as paragraph or short sentence with the translation contrast relationship between language.The first pending data set that terminal obtains Data in conjunction have translation source material, it is also possible to some hashes are had, as shown in Fig. 2, the first pending data acquisition system Including the hash in 1,2,3,4 totally four groups of translation source materials and rectangle frame.
102, the terminal obtains the length for the data that first data acquisition system includes.
103, the data that the first data acquisition system described in the terminal-pair includes carry out the comparison of the length, are tied according to comparing Fruit determines the second data set from first data acquisition system, exists between the data that the second data set includes pair It should be related to.
Wherein, the length of data can be the length for the character string that data packet includes.
In the specific implementation, be used to provide the translation contrast relationship between different language based on translation source material, if first Two adjacent data are translation source materials in data acquisition system, belong to same type data, then their length should be differed not Big, and translation contrast relationship is not present with other data in the hash in the first data acquisition system, is not belonging to same type data, It is considered that the length of hash differs larger with the length of adjacent data, to which terminal can be to every in the first data acquisition system Two adjacent data carry out length comparison, and the length that length and adjacent data are obtained according to comparison result meets preset condition Target data, target data form the second data set, and there are correspondences between the data that the second data set includes, specifically The translation contrast relationship that be between each two adjacent data be between different language, to which terminal can when collecting translation source material Accurately and rapidly to extract the data there are correspondence, hash is effective filtered out.
Further, terminal can also export the data in the second data set to terminal user according to correspondence, Handling result is obtained in time convenient for terminal user, and handling result is verified etc..
In some feasible embodiments, it is assumed that the first data acquisition system includes n data, and n is positive integer, then terminal-pair First data acquisition system carries out the comparison of length between adjacent data, and the length for obtaining length and adjacent data meets preset condition The realization method of target data can be:
Terminal since the 1st data (i.e. i=1) in the first data acquisition system, according to the length of i-th data and i-th+ The length of 1 data judges whether i-th of data and i+1 data are same type data, and i is positive integer, and i≤n-1, If i-th of data and i+1 data are same type data (there is translation contrast relationship i.e. between them), by i-th Data and i+1 data are determined as target data, and the length of the length and the i-th+3 data according to the i-th+2 data, sentence Whether the i-th+2 data of breaking and the i-th+3 data are same type data, and determine target data according to judging result, until i =n-1 finally traverses entire first data acquisition system, so that it is determined that going out in the first data acquisition system to exist all of translation contrast relationship Target data, namely filtered out the hash in the first data acquisition system.
Further, if i-th of data and i+1 data are not that (there is no translations i.e. between them for same type data Contrast relationship), then terminal determines that i-th of data is hash, not as target data, and continues according to i+1 The length of the length of data and the i-th+2 data judges whether i+1 data and the i-th+2 data are same type data, with And target data is determined according to judging result, until i=n-1, entire first data acquisition system is finally traversed, so that it is determined that going out first There are all target datas of translation contrast relationship in data acquisition system, namely has filtered out the useless number in the first data acquisition system According to.
In some feasible embodiments, it is assumed that the length of i-th of data is Li, the length of i+1 data is Li+1, then:
In the case where the length of i-th of data and i+1 data (i.e. two adjacent datas) is shorter, same type number According to can be defined as follows:
LiAnd Li+1Respectively less than default first numerical value, and LiAnd Li+1The absolute value of difference is less than default second value.
In the case that i-th of data and i+1 data (i.e. two adjacent datas) length have it is at least one longer, Same type data can be defined as follows:
LiAnd Li+1In at least one be greater than or equal to default first numerical value, and LiAnd Li+1The absolute value and L of differenceiWith Li+1The ratio of middle higher value is less than default third value.
Wherein, 10 can be taken by presetting the first numerical value, and default second value can take 4, and default third value can take 0.22, I.e. in the case of the length of two adjacent datas shorter (being less than 10), if their length difference is less than 4, then it is assumed that They are same type data.In the length of two adjacent datas in the case of at least one longer (being greater than or equal to 10), If the ratio of higher value is less than 0.22 in the length of the absolute value of the length difference of two adjacent datas and two adjacent datas, Then think that they are same type data.Determination for presetting the first numerical value, default second value and default third value can be with It completes through a large number of experiments, for example, for presetting third value, if default third value is too small, judges two phases Whether adjacent data are that the results of same type data can be very accurate, but may result in should be included in the second data set Data be rejected, and if default third value is too big, may result in some hashes and be mistaken for target data And be included in the second data set, therefore, in order to ensure that the filter effect of hash and as much as possible getting target Data can choose a desired value as default third value in a large amount of experimentation.
For example, as shown in Fig. 2, the first data acquisition system includes 9 data, in accordance with the order from top to bottom number be 1, 2、3、4、5、6、7、8、9.Wherein, the 1st data and the 2nd data correspond to translation source material 1, the 3rd data and the 4th data Corresponding the 2, the 5th data of translation source material and the 6th data correspond to translation source material 3, and the 8th data and the 9th data correspond to Translation source material 4, the 7th data are the hash in rectangle frame, then terminal according to above-mentioned same type data definition rule, It can determine that the 1st data and the 2nd data, the 3rd data and the 4th data, the 5th data and the 6th data are similar Type data, and the 7th data and the 8th data are unsatisfactory for the definition rule of above-mentioned same type data since length differs too big, Then terminal determines that the 7th data are hash, and the 8th data and the 9th data are same type data, so that it is determined that go out The second data set includes 1,2,3,4 this four groups of translation source materials, and realizes and filter hash (i.e. the 7th data) Fall.
It is possible to further which obtained the second data set (i.e. translation source material) to be applied to the training before machine translation In learning process, so as to provide accurate interpretative function in practical applications.
In some feasible embodiments, concrete implementation process can be described as follows:
Assuming that array list corresponds to the first data acquisition system, array newlist corresponds to the second data set, and list=[' Abcdefg', ' higklmno', ' occupy-place ', ' uproot midday standing grain day, soil under droplet standing grain, who knows surve on human's plate, Every single grain is the fruit of hard work ', ' white Day, the Yellow River entered ocean current, and ascend another storey to see a thousand miles further .' near the mountain to the greatest extent, ' translate', ' happyness', ' I am exactly one A troublesome character string that comes here, please weeds out me OK', ' modesty helps one to go forward ', ' Pride goes before, and shame comes after '];
Flag is initialized as 0;
Since i=0, following flows are executed repeatedly, and has often executed primary following flows and the value of i is then added 1, until i Value increase to the element number that array list includes:
If i is equal to flag, the value of strlen is set as to the data length of list [i], this flow terminates;
If i is more than flag, judge whether the corresponding elements of strlen and list [i] they are same type data, if so, The value of flag is then set as i+1, and list [i-1] and list [i] is stored in array newlist, if it is not, then by flag Value be set as i, this flow terminates.
Further, the array newlist that can be exported after above-mentioned flow is finished.
Wherein, judge whether the corresponding elements of strlen and list [i] are the detailed processes of same type data and can describe It is as follows:
It determines the maximum value in the two numbers of the data length of strlen and list [i], calculates strlen's and list [i] The ratio of the absolute value of difference and maximum value between data length, if maximum value is less than default first numerical value (such as 10), and The absolute value of difference is less than default second value (such as 4) between strlen and the data length of list [i], alternatively, if the ratio Value is less than default third value (such as 0.22), it is determined that the corresponding elements of strlen and list [i] are same type data;Otherwise, Then determine that the corresponding elements of strlen and list [i] are not same type data.
Wherein, the result of above-mentioned realization process be the 3rd data (' occupy-place ') in list and the 8th data (' I am exactly One troublesome character string that comes here, please weeds out me OK') it is hash, from without being placed in newlist, The result formats for exporting newlist can be as follows:
Abcdefg higklmno are same group
It uprooted midday standing grain day, soil under droplet standing grain, who knows surve on human's plate, and Every single grain is the fruit of hard work, and the daytime is most near the mountain, and the Yellow River enters ocean current, is intended to poor A thousand li mesh, that attains a yet higher goal are same group
Translate happiness are same group
Modesty helps one to go forward Pride goes before, and shame comes after be same group
In the embodiment of the present invention, terminal can obtain the first pending data acquisition system and first data acquisition system includes Data length, the data for including to first data acquisition system carry out the comparison of length, and according to comparison result from this first The second data set is determined in data acquisition system, there are correspondence between the data which includes, so as to Accurately and rapidly to extract the data there are correspondence when collecting translation source material, hash is effective filtered out.
Referring to Fig. 3, being a kind of structural schematic diagram of data processing equipment provided in an embodiment of the present invention.In the present embodiment Described data processing equipment, including:
Acquisition module 301, for obtaining the first pending data acquisition system.
The acquisition module 301 is additionally operable to obtain the length for the data that first data acquisition system includes.
Processing module 302, the data for including to first data acquisition system carry out the comparison of the length, according to than The second data set is determined from first data acquisition system, deposit between the data that the second data set includes compared with result In correspondence.
In some feasible embodiments, the acquisition module 301 is specifically used for:
Web page contents are parsed, the first pending data acquisition system is extracted.
In some feasible embodiments, the processing module 302 is specifically used for:
The comparison of length between adjacent data is carried out to first data acquisition system, obtains the length of length and adjacent data Meet the target data of preset condition, the target data forms the second data set.
In some feasible embodiments, first data acquisition system includes n data, and the n is positive integer, described Processing module 302 includes:
Judging unit 3020, for since i=1, according to the length of the length and i+1 data of i-th of data, sentencing Whether i-th of data and the i+1 data of breaking are same type data, and the i is positive integer, and i≤n-1.
Determination unit 3021, if judging that i-th of data and the i+1 data are for the judging unit I-th of data and the i+1 data are then determined as target data by the same type data.
The judging unit 3020 is additionally operable to the length of the length and the i-th+3 data according to the i-th+2 data, judges Whether the i-th+2 data and the i-th+3 data are the same type data, until i=n-1.
In some feasible embodiments, the judging unit 3020, if be additionally operable to judge i-th of data and The i+1 data are not the same type data, then according to the length of the i+1 data and the i-th+2 data Length judges whether the i+1 data and the i-th+2 data are the same type data, until i=n-1.
In some feasible embodiments, the length of i-th of data is Li, the length of the i+1 data For Li+1, wherein:
I-th of data and the i+1 data are the same type data, including:
The LiWith the Li+1Respectively less than default first numerical value, and the LiWith the Li+1The absolute value of difference is less than pre- If second value.
Alternatively,
The LiWith the Li+1In at least one be greater than or equal to default first numerical value, and the LiWith it is described Li+1The absolute value of difference and the LiWith the Li+1The ratio of middle higher value is less than default third value.
It is understood that each function module of the data processing equipment of the present embodiment, the function of unit can be according to above-mentioned Method specific implementation in embodiment of the method, specific implementation process are referred to the associated description of above method embodiment, this Place repeats no more.
In the embodiment of the present invention, acquisition module 301 obtains the first pending data acquisition system and first data acquisition system Including data length, the data that processing module 302 includes to first data acquisition system carry out the comparison of length, and according to than Determine the second data set from first data acquisition system compared with result, between the data which includes exist pair It should be related to, so as to accurately and rapidly extract the data there are correspondence when collecting translation source material, effectively filter Except hash.
Referring to Fig. 4, being a kind of structural schematic diagram of terminal provided in an embodiment of the present invention.Described in the present embodiment Terminal, including:Processor 401, network interface 402 and memory 403.Wherein, processor 401, network interface 402 and memory 403 can be connected by bus or other modes, and the embodiment of the present invention by bus for being connected.
Wherein, processor 401 (or central processing unit (Central Processing Unit, CPU)) is the meter of terminal Core and control core are calculated, the Various types of data of all kinds of instructions and processing terminal in terminal can be parsed, such as:CPU It can be used for parsing user to the switching on and shutting down instruction transmitted by terminal, and control terminal carries out switching on and shutting down operation;For another example:CPU can To transmit all kinds of interaction datas, etc. between terminal inner structure.Network interface 402 may include optionally the wired of standard Interface, wireless interface (such as WI-FI, mobile communication interface) are used for transceiving data by the control of processor 401.Memory 403 (Memory) it is memory device in terminal, for storing program and data.It is understood that memory 403 herein was both It may include the internal memory of terminal, naturally it is also possible to the extended menory supported including terminal.The offer of memory 403 is deposited Space is stored up, which stores the operating system of terminal, it may include but be not limited to:A kind of Windows systems (operation system System), android system, IOS systems etc., the present invention is to this and is not construed as limiting.
In embodiments of the present invention, processor 401 is executed such as by the executable program code in run memory 403 Lower operation:
Processor 401, for obtaining the first pending data acquisition system by network interface 402.
The processor 401 is additionally operable to obtain the length for the data that first data acquisition system includes.
The processor 401 is additionally operable to carry out the data that first data acquisition system includes the comparison of the length, root Determine the second data set from first data acquisition system according to comparison result, the data that the second data set includes it Between there are correspondences.
In some feasible embodiments, the processor 401 is specifically used for:
Web page contents are parsed, the first pending data acquisition system is extracted.
In some feasible embodiments, the processor 401 is specifically used for:
The comparison of length between adjacent data is carried out to first data acquisition system, obtains the length of length and adjacent data Meet the target data of preset condition, the target data forms the second data set.
In some feasible embodiments, first data acquisition system includes n data, and the n is positive integer, described Processor 401 is specifically used for:
Since i=1, according to the length of the length and i+1 data of i-th of data, judge i-th of data and Whether the i+1 data are same type data, and the i is positive integer, and i≤n-1.
If i-th of data and the i+1 data are the same type data, will i-th of data with The i+1 data are determined as target data, and the length of the length and the i-th+3 data according to the i-th+2 data, judge Whether the i-th+2 data and the i-th+3 data are the same type data, until i=n-1.
In some feasible embodiments, the processor 401, if being additionally operable to i-th of data and the i+1 A data are not that the same type data judge then according to the length of the length and the i-th+2 data of the i+1 data Whether the i+1 data and the i-th+2 data are the same type data, until i=n-1.
In some feasible embodiments, the length of i-th of data is Li, the length of the i+1 data For Li+1, wherein:
I-th of data and the i+1 data are the same type data, including:
The LiWith the Li+1Respectively less than default first numerical value, and the LiWith the Li+1The absolute value of difference is less than pre- If second value.
Alternatively,
The LiWith the Li+1In at least one be greater than or equal to default first numerical value, and the LiWith it is described Li+1The absolute value of difference and the LiWith the Li+1The ratio of middle higher value is less than default third value.
In the specific implementation, processor 401 described in the embodiment of the present invention, network interface 402 and memory 403 can be held It is real also to can perform the present invention for the realization method gone described in a kind of flow of data processing method provided in an embodiment of the present invention The realization method described in a kind of data processing equipment of example offer is applied, details are not described herein.
In the embodiment of the present invention, processor 401 obtains the first pending data acquisition system by network interface 402 and should The length for the data that first data acquisition system includes, the comparison to the data progress length that first data acquisition system includes, and according to Comparison result determines the second data set from first data acquisition system, exists between the data which includes Correspondence, so as to accurately and rapidly extract the data there are correspondence when collecting translation source material, effectively Filter out hash.
One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer read/write memory medium In, the program is when being executed, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
It is above disclosed to be only a preferred embodiment of the present invention, the power of the present invention cannot be limited with this certainly Sharp range, those skilled in the art can understand all or part of the processes for realizing the above embodiment, and is weighed according to the present invention Equivalent variations made by profit requirement, still belong to the scope covered by the invention.

Claims (12)

1. a kind of data processing method, which is characterized in that including:
Obtain the first pending data acquisition system;
Obtain the length for the data that first data acquisition system includes;
The comparison that the data that first data acquisition system includes are carried out with the length, according to comparison result from first data Determine the second data set in set, there are correspondences between the data that the second data set includes.
2. according to the method described in claim 1, it is characterized in that, described obtain pending the first data acquisition system, including:
Web page contents are parsed, the first pending data acquisition system is extracted.
3. method according to claim 1 or 2, which is characterized in that the data for including to first data acquisition system The comparison for carrying out the length determines the second data set according to comparison result from first data acquisition system, including:
The comparison of length between adjacent data is carried out to first data acquisition system, the length for obtaining length and adjacent data meets The target data of preset condition, the target data form the second data set.
4. according to the method described in claim 3, it is characterized in that, first data acquisition system includes n data, the n is Positive integer, the comparison that length between adjacent data is carried out to first data acquisition system, acquisition length and adjacent data Length meets the target data of preset condition, including:
Since i=1, according to the length of the length and i+1 data of i-th of data, i-th of data and described are judged Whether i+1 data are same type data, and the i is positive integer, and i≤n-1;
If i-th of data and the i+1 data are the same type data, by i-th of data and described I+1 data are determined as target data, and the length of the length and the i-th+3 data according to the i-th+2 data, described in judgement Whether the i-th+2 data and the i-th+3 data are the same type data, until i=n-1.
5. according to the method described in claim 4, it is characterized in that, the method further includes:
If i-th of data and the i+1 data are not the same type data, according to the i+1 data Length and the i-th+2 data length, judge whether the i+1 data and the i-th+2 data are described similar Type data, until i=n-1.
6. according to the method described in claim 4, it is characterized in that, the length of i-th of data is Li, the i+1 number According to length be Li+1, wherein:
I-th of data and the i+1 data are the same type data, including:
The LiWith the Li+1Respectively less than default first numerical value, and the LiWith the Li+1The absolute value of difference is less than default the Two numerical value;
Alternatively,
The LiWith the Li+1In at least one be greater than or equal to default first numerical value, and the LiWith the Li+1Difference The absolute value of value and the LiWith the Li+1The ratio of middle higher value is less than default third value.
7. a kind of data processing equipment, which is characterized in that including:
Acquisition module, for obtaining the first pending data acquisition system;
The acquisition module is additionally operable to obtain the length for the data that first data acquisition system includes;
Processing module, the data for including to first data acquisition system carry out the comparison of the length, according to comparison result The second data set is determined from first data acquisition system, is existed between the data that the second data set includes and is corresponded to Relationship.
8. device according to claim 7, which is characterized in that the acquisition module is specifically used for:
Web page contents are parsed, the first pending data acquisition system is extracted.
9. device according to claim 7 or 8, which is characterized in that the processing module is specifically used for:
The comparison of length between adjacent data is carried out to first data acquisition system, the length for obtaining length and adjacent data meets The target data of preset condition, the target data form the second data set.
10. device according to claim 9, which is characterized in that first data acquisition system includes n data, and the n is Positive integer, the processing module include:
Judging unit, for since i=1, according to the length of the length and i+1 data of i-th of data, judging described Whether i data and the i+1 data are same type data, and the i is positive integer, and i≤n-1;
Determination unit, if judging that i-th of data and the i+1 data are described similar for the judging unit I-th of data and the i+1 data are then determined as target data by type data;
The judging unit is additionally operable to the length of the length and the i-th+3 data according to the i-th+2 data, judges described i-th+2 Whether a data and the i-th+3 data are the same type data, until i=n-1.
11. device according to claim 10, which is characterized in that
The judging unit, if it is the same type number to be additionally operable to judge i-th of data and the i+1 data not According to then according to the length of the length and the i-th+2 data of the i+1 data, judging the i+1 data and described Whether i+2 data are the same type data, until i=n-1.
12. device according to claim 10, which is characterized in that the length of i-th of data is Li, the i+1 The length of data is Li+1, wherein:
I-th of data and the i+1 data are the same type data, including:
The LiWith the Li+1Respectively less than default first numerical value, and the LiWith the Li+1The absolute value of difference is less than default the Two numerical value;
Alternatively,
The LiWith the Li+1In at least one be greater than or equal to default first numerical value, and the LiWith the Li+1Difference The absolute value of value and the LiWith the Li+1The ratio of middle higher value is less than default third value.
CN201710153501.7A 2017-03-15 2017-03-15 Data processing method and device Active CN108628817B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710153501.7A CN108628817B (en) 2017-03-15 2017-03-15 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710153501.7A CN108628817B (en) 2017-03-15 2017-03-15 Data processing method and device

Publications (2)

Publication Number Publication Date
CN108628817A true CN108628817A (en) 2018-10-09
CN108628817B CN108628817B (en) 2022-07-26

Family

ID=63686575

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710153501.7A Active CN108628817B (en) 2017-03-15 2017-03-15 Data processing method and device

Country Status (1)

Country Link
CN (1) CN108628817B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102810097A (en) * 2011-06-02 2012-12-05 高德软件有限公司 Method and device for extracting webpage text content
CN104572946A (en) * 2014-12-30 2015-04-29 小米科技有限责任公司 Method and device for processing data of yellow pages
CN104573097A (en) * 2015-01-30 2015-04-29 湖南蚁坊软件有限公司 Method for extracting webpage content
WO2015165245A1 (en) * 2014-04-30 2015-11-05 广州市动景计算机科技有限公司 Webpage data processing method and device
WO2015176689A1 (en) * 2014-05-23 2015-11-26 华为技术有限公司 Data processing method and device
CN105447167A (en) * 2015-12-04 2016-03-30 北京奇虎科技有限公司 Processing method and apparatus for node cache data in distributed system
CN106484730A (en) * 2015-08-31 2017-03-08 北京国双科技有限公司 Character string matching method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102810097A (en) * 2011-06-02 2012-12-05 高德软件有限公司 Method and device for extracting webpage text content
WO2015165245A1 (en) * 2014-04-30 2015-11-05 广州市动景计算机科技有限公司 Webpage data processing method and device
WO2015176689A1 (en) * 2014-05-23 2015-11-26 华为技术有限公司 Data processing method and device
CN104572946A (en) * 2014-12-30 2015-04-29 小米科技有限责任公司 Method and device for processing data of yellow pages
CN104573097A (en) * 2015-01-30 2015-04-29 湖南蚁坊软件有限公司 Method for extracting webpage content
CN106484730A (en) * 2015-08-31 2017-03-08 北京国双科技有限公司 Character string matching method and device
CN105447167A (en) * 2015-12-04 2016-03-30 北京奇虎科技有限公司 Processing method and apparatus for node cache data in distributed system

Also Published As

Publication number Publication date
CN108628817B (en) 2022-07-26

Similar Documents

Publication Publication Date Title
CN105447204B (en) Network address recognition methods and device
EP2472428B1 (en) Response determining device, response determining method, response determining program, recording medium and response determining system
CN108134784A (en) web page classification method and device, storage medium and electronic equipment
CN109388943A (en) A kind of method, apparatus and computer readable storage medium identifying XSS attack
CN107153716B (en) Webpage content extraction method and device
CN104298780B (en) A kind of pre-acquiring method and system of browsing device net page information
US20210064453A1 (en) Automated application programming interface (api) specification construction
CN107341399A (en) Assess the method and device of code file security
CN103473107B (en) A kind of method that interactive interface based on mobile middleware dynamically updates
CN108763274A (en) Recognition methods, device, electronic equipment and the storage medium of access request
CN107463879A (en) Human bodys' response method based on deep learning
CN107291778A (en) The collection method and device of data
CN109299448A (en) Resume intelligence filling method, system, server and storage medium
WO2020082763A1 (en) Decision trees-based method and apparatus for detecting phishing website, and computer device
CN104462242B (en) Webpage capacity of returns statistical method and device
CN110083755A (en) A kind of high emulation parsing web-page approach, device and electronic equipment
CN102054040A (en) Knowledge information interaction service method and site and questioning and answering interaction platform
CN109657125A (en) Data processing method, device, equipment and storage medium based on web crawlers
CN104572787B (en) The recognition methods of pseudo- original website and device
CN107784107A (en) Dark chain detection method and device based on flight behavior analysis
CN107682350A (en) Active defense method, device and electronic equipment based on web portal security scoring
CN108234441A (en) Determine method, apparatus, electronic equipment and the storage medium of forgery access request
CN110110179A (en) House market heating power ground drawing generating method, device, equipment and storage medium
CN108628817A (en) A kind of data processing method and device
CN110781497B (en) Method for detecting web page link and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant