WO2014071850A1 - Method and apparatus for storing webpage access records - Google Patents

Method and apparatus for storing webpage access records Download PDF

Info

Publication number
WO2014071850A1
WO2014071850A1 PCT/CN2013/086663 CN2013086663W WO2014071850A1 WO 2014071850 A1 WO2014071850 A1 WO 2014071850A1 CN 2013086663 W CN2013086663 W CN 2013086663W WO 2014071850 A1 WO2014071850 A1 WO 2014071850A1
Authority
WO
WIPO (PCT)
Prior art keywords
scanned
client terminals
file
files
numbers
Prior art date
Application number
PCT/CN2013/086663
Other languages
French (fr)
Inventor
Jiwen ZHOU
Yang Yu
Original Assignee
Tencent Technology (Shenzhen) Company Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology (Shenzhen) Company Limited filed Critical Tencent Technology (Shenzhen) Company Limited
Publication of WO2014071850A1 publication Critical patent/WO2014071850A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Definitions

  • the present disclosure relates to information security technology, and more particularly to a file recognition method, device and server.
  • VDC virtual data center
  • the server can only perform a separate analysis and determination for each file. In the determination process, the server does not consider parent-child relationship or dependency relationship between one file and other files. For example, a new virus parent file A.exe releases two virus progeny files B.exe and B.dll in one directory when the new virus parent file A.exe runs. B.dll is a virus module with harmful behavior, while B.exe itself has no substantial harm behavior and is only responsible to run and load B.dll after the system is started.
  • One example of the present disclosure provides a file recognition method, which can solve the problem that a server's recognition accuracy of files reported by a client terminal is low in the related art.
  • a file recognition method includes: establishing a database according to scanned results reported by client terminals; wherein the database records a globally unique identifier (GUID) of each of the client terminals and checksums of scanned files reported by each of the client terminals extracted from the scanned results; wherein each of the scanned files corresponds to one checksum; for each of the scanned files, querying GUID of the client terminals that report the scanned file from the database, respectively, according to the checksums of the scanned files; obtaining a coexistence rate between one scanned file and each of the other scanned files reported by each of the client terminals, according to the queried out GUID; determining an attribute of the one scanned file according to attributes of the scanned files each having a coexistence rate higher than a preset threshold.
  • GUID globally unique identifier
  • the device includes: a database establishment unit configured to establish a database according to scanned results reports by client terminals; wherein the database records a globally unique identifier (GUID) of each of the client terminals and checksums of the scanned files reported by each of the client terminals extracted from the scanned results; wherein each of the scanned files corresponds to one checksum; a query unit configured to, according to the checksums of the scanned files, for each of the scanned files, query GUID of the client terminals that report the scanned file from the database, respectively; an obtaining unit configured to, according to the queried out GUID, obtain a coexistence rate between one scanned file and each of the other scanned files reported by each of the client terminals; a determination unit configured to, determine an attribute of the one scanned file according to attributes of the scanned files each having a coexistence rate higher than a preset threshold.
  • GUID globally unique identifier
  • Still another example of the present disclosure provides a server which includes the above file recognition device.
  • a server when a server recognizes an unknown scanned file or a scanned file with suspicious behavior reported by a client terminal, by examining a situation that the scanned file and other files coexist on a single machine, the server determines an attribute of the scanned file according to an attribute of a file with a highest coexistence rate, thus, the server's recognition accuracy of the scanned file can be further improved and the client terminal's information security can be strengthened.
  • Fig. 1 is a flowchart of a file recognition method according to one example of the present disclosure
  • Fig. 2 is a specific flow chart of a step S101 of the file recognition method according to one example of the present disclosure
  • Fig. 3 is a schematic principle diagram of establishing database in the step S101 of the file recognition method according to one example of the present disclosure
  • Fig. 4 is a specific flow chart of a step SI 03 of the file recognition method according to one example of the present disclosure
  • Fig. 5 is a block diagram of a file recognition device according to one example of the present disclosure.
  • Fig. 6 is a block diagram of a computing device according to one example of the present disclosure. Detailed Description
  • a server when a server recognizes an unknown scanned file or a scanned file with suspicious behavior reported by a client terminal, by examining a situation that the scanned file and other files coexist on a single machine, the server determines an attribute of the scanned file according to an attribute of a file with a highest coexistence rate, thus, the server's recognition accuracy of the scanned file can be further improved and the client terminal's information security can be strengthened.
  • Fig. 1 is a flowchart of a file recognition method according to one example of the present disclosure, and details are as follows: Step S101: establishing a database according to scanned results reported by client terminals; the database recording a globally unique identifier (GUID) of each client terminal and checksums of scanned files reported by the each client terminal extracted from the scanned results, each of the scanned files corresponding to one checksum.
  • GUID globally unique identifier
  • the server when receiving a scanned result reported by a client terminal which performs virus or Trojan killing, can extract GUID of the client terminal which reports the scanned result and a checksum of each scanned file reported by the client terminal from the scanned results.
  • GUID as a unique identifier of the client terminal, can be used to distinguish the client terminal from other client terminals, and can further be used to distinguish computer equipment in which different client terminals are.
  • the checksum includes but not limited to, file's Message-Digest Algorithm fifth edition (MD5) checksum or file's Hash checksum, which can be used herein as a unique identifier for distinguishing different scanned files.
  • establishment of the database can refer to the flowchart shown in Fig. 2:
  • Step S201 obtaining and storing log information of scanning file every time performed by each client terminal.
  • a bypass procedure can be deployed in the client terminal, and the bypass procedure can be configured to record log information of each scanning of the client terminal and store the log information in a mass storage device such as a file transfer protocol (FTP) server, etc.
  • the log information includes GUID of each client terminal, checksums and file attributes of all scanned files, such as PE structure information of the scanned files, path information of the scanned files in user environment, attribute information of PE resources of the scanned files or digital signatures of the scanned files, etc., and these will be not defined here one by one.
  • the log information can also indicate conditions for generating the log information, such as generated through a full scan, or generated through scanning specified location, etc.
  • Step S202 after performing statistics and duplicate removal process on the stored log information at a preset time point, extracting GUID of each client terminal and a checksum of each scanned file having been reported by each client terminal from the log information, and establishing a database according to the extracted result.
  • each client terminal may repeatedly perform several times file scanning in a short time, thus, there is data duplication in the log information stored in the mass storage device.
  • the establishment of the database can be completed.
  • the extracted data can be stored in four K-V relationship NoSQL databases.
  • the four databases can include: checksum information database, GUID information database, checksum index database and GUID index database.
  • Checksum information database GUID information database
  • checksum index database GUID index database.
  • Related principles for establishing database is not used to limit the present disclosure; as an implementation manner of establishing database, details can refer to the schematic principle diagram of establishing database shown in Fig. 3, and will not be repeated here.
  • Step SI 02 for each of the scanned files, querying GUID of the client terminals that report the scanned file from the database, respectively, according to the checksums of the scanned files.
  • reporting client terminals of one scanned file are clients which have reported the scanned file.
  • GUID of each client terminal which has reported the scanned file can be found, i.e., each scanned file existed in which computer equipment the client terminals are can be learned.
  • Step SI 03 according to the queried out GUID of the reporting client terminals of each scanned file, obtaining a coexistence rate between a first scanned file and each second scanned file, respectively.
  • the first scanned file is a scanned file which is currently needed to be recognized, and can be an unknown scanned file or a scanned file with suspicious behavior reported by a client terminal which currently reports scanned result, and can also be a grey file which has been stored in the file information database of the server and has an unknown attribute.
  • the second scanned files refers to the other scanned files reported by each of the client terminals; in other words, the other scanned files reported by each of the client terminals can also be called as second scanned files.
  • the coexistence rate of the two scanned files can reflect genetic relationship between the two scanned files, i.e., the higher the coexistence rate, the closer the genetic relationship between the two scanned files, and their attributes may be closer; on the contrary, if the coexistence rate is lower, it means the possibility that there is no direct link between the two scanned files is greater.
  • coexistence rate can be determined according to the number of computer equipment which simultaneously has two scanned files, and can also be determined according to both of the number of computer equipment which simultaneously has two scanned files and the number of computer equipment which has one of the two scanned files.
  • FIG. 4 shows a specific flow chart of the step SI 03 of the file recognition method according to one example of the present disclosure, and details are as follows:
  • Step S401 obtaining a count number of reporting client terminals of the first scanned file and determining as a first number
  • the count number of reporting client terminals of the first scanned file can be determined, i.e., a first number of computer equipment which has the first scanned file.
  • Step S402 obtaining count numbers of reporting client terminals of each second scanned file and determining as second numbers.
  • Step S403 according to the queried out GUID of the reporting client terminals of each scanned file, obtaining count numbers of reporting client terminals which simultaneously reports the first scanned file and each second scanned file, and determining as third numbers.
  • the count numbers of reporting client terminals which simultaneously reports the first scanned file and each second scanned file can be determined, i.e., third numbers of computer equipment which simultaneously has the first scanned file and the second scanned file.
  • Step S404 determining the coexistence rate between the first scanned file and each second file according to the first number, the second numbers and the third numbers.
  • the coexistence rate of the first scanned file and each second scanned file coexisted in a same machine can be calculated.
  • the coexistence rate of the first scanned file and each second scanned file can be calculated through the following formula:
  • A represents the coexistence rate of the first scanned file and the second scanned file
  • represents a constant and can be determined by one skilled in the art according to actual situation; as an implementation example of the present disclosure, a value of I can be 15; a represents the first number, b represents the second number, d represents the third number.
  • the calculation formula for the coexistence rate includes but not limited to the above form, and this is not used to limit the present disclosure.
  • Step SI 04 determining an attribute of the first scanned file according to attributes of the second scanned files each having a coexistence rate higher than a preset threshold.
  • a plurality of second scanned files having highest coexistence rate with the first scanned file can be determined via ranking the obtained coexistence rates in a descending order, i.e., a plurality of second scanned files having closet genetic relationship with the first scanned file can be determined, and then the attribute of the first scanned file can be determined according to the attributes of the plurality of second scanned files.
  • determining an attribute of the first scanned file according to attributes of the second scanned files each having a coexistence rate higher than a preset threshold specifically can include: when it is unable to determine whether the first scanned file is black file or white file according to contents of the first scanned file or a series of identification logic such as program behavior, the attribute of the first scanned file can be recognized according to attribute distribution of the second scanned files each having a coexistence rate higher than the preset threshold through a classification algorithm such as k-Nearest Neighbor (KNN) classification algorithm, etc.
  • KNN k-Nearest Neighbor
  • determining an attribute of the first scanned file according to attributes of the second scanned files each having a highest coexistence rate specifically can include: when determining whether the first scanned file is black file or white file according to contents of the first scanned file or a series of identification logic such as program behavior, taking the attribute distribution of the second scanned files each having a coexistence rate higher than the preset threshold as one of determination factors for accurately determining an attribute of the first scanned file in combination with the determination results of the identification logic.
  • determining an attribute of the first scanned file according to attributes of the second scanned files can follow the following principles:
  • High scope white file certainly attracts white file.
  • the high scope file means that a number of reporting client terminals corresponding to this file is very high, such as system software, commonly used software and other formal white files.
  • a main program file of one widely used application program certainly has a highest coexistence rate with related component files of the application program in a same machine, thus, the probability that an attribute of one grey file having a highest coexistence rate with one white file is white file, is also highest.
  • the probability that an attribute of one grey file having a highest coexistence rate with one black file is black file is also highest.
  • some virtus files may maliciously promote some normal application software, resulting in that the normal application software is recognized as black file due to having a high coexistence rate with the virtus file, thus, in actual application, filtering rules can be further set according to the file's digital signature, thereby further making the recognition result accurate.
  • the server when the server recognizes an unknown scanned file or a scanned file with suspicious behavior reported by a client terminal, by examining a situation that the scanned file and other files coexist on a single machine, the server determines an attribute of the scanned file according to an attribute of a file with a highest coexistence rate, thus, the server's recognition accuracy of the scanned file can be further improved and the client terminal's information security can be strengthened.
  • Fig. 5 is a block diagram of a file recognition device according to one example of the present disclosure.
  • the device can run in a server side.
  • the device can be distributed in a cloud server.
  • the device is configured to run the file recognition method described in the example shown in Figs. 1-4. For convenience of description, only portions related to this example are shown,
  • the device includes: a database establishment unit 51 configured to establish a database according to scanned results reported by client terminals; the database recording a globally unique identifier (GUID) of each client terminal and checksums of scanned files reported by the each of the client terminals extracted from the scanned results; each of the scanned files corresponding to one checksum; a query unit 52 configured to, according to the checksums of the scanned files, query GUID of reporting client terminals of each scanned file in the database, respectively; an obtaining unit 53 configured to, according to the queried out GUID of the reporting client terminals of each scanned file, obtain a coexistence rate between a first scanned file and each second scanned file, respectively; a determination unit 54 configured to, determine an attribute of the first scanned file according to attributes of the second scanned files each having a coexistence rate higher than a preset threshold.
  • the checksum includes MD5 checksum or Hash checksum.
  • the obtaining unit 53 includes: a first obtaining subunit configured to obtain a count number of reporting client terminals of the first scanned file and determine as a first number; a second obtaining subunit configured to obtain count numbers of reporting client terminals of each second scanned file and determine as second numbers; a first determination subunit configured to, according to the queried out GUID of the reporting client terminals of each scanned file, obtain count numbers of reporting client terminals which simultaneously reports the first scanned file and the second scanned file and determine as third numbers; a second determination subunit configured to determine the coexistence rate according to the first number, the second numbers and the third numbers.
  • the second determination subunit is specifically configured to determine the coexistence rate according to a formula: where, A represents the coexistence rate; 1 represents a constant and can be determined by one skilled in the art according to actual situation; as an implementation example of the present disclosure, a value of ⁇ can be 15; a represents the first number, b represents the second number, d represents the third number.
  • the server when the server recognizes an unknown scanned file or a scanned file with suspicious behavior reported by a client terminal, by examining a situation that the scanned file and other files coexist on a single machine, the server determines an attribute of the scanned file according to attributes of files each having a highest coexistence rate, thus, the server's recognition accuracy of the scanned file can be further improved and the client terminal's information security can be strengthened.
  • the computing device (such as a server or other computing device) includes a processor 60 and a memory 70.
  • the processor 60 and the memory 70 are connected with each other via an internal bus.
  • the memory 70 may be a non-transitory computer-readable storage medium, and stores units of machine readable instructions executable by the processor 60, including a database establishment unit 71, a query unit 72, an obtaining unit 73 and a determination unit 74.
  • Functions of the database establishment unit 71, the query unit 72, the obtaining unit 53 and the determination unit 74 are similar with the functions of the database establishment unit 51, the query unit 52, the obtaining unit 53 and the determination unit 54, respectively.
  • the functions may be implemented with the assistance of other modules, and may involve cooperation of multiple modules, e.g., may utilize processing functions of the processor 60, may relay on the internal bus for data transmission, and etc.
  • the methods, units, and device described herein may be implemented by hardware, machine -readable instructions or a combination of hardware and machine-readable instructions.
  • Machine-readable instructions used in the examples disclosed herein may be stored in storage medium readable by multiple processors, such as hard drive, CD-ROM, DVD, compact disk, floppy disk, magnetic tape drive, ROM or other proper storage device. Or, at least part of the machine -readable instructions may be substituted by specific-purpose hardware, such as custom integrated circuits, gate array, FPGA, PLD and specific-purpose computers and so on.
  • a machine-readable storage medium is also provided to store instructions to cause a machine to execute a process as described according to examples herein.
  • a system or apparatus having a storage medium that stores machine-readable program codes for implementing functions of any of the above examples and that may cause the system or the apparatus (or processor such as CPU or MPU) read and execute the program codes stored in the storage medium.
  • the program codes read from the storage medium may implement any one of the above examples, thus the program codes and the storage medium storing the program codes are part of the technical scheme.
  • the storage medium for providing the program codes may include floppy disk, hard drive, magneto-optical disk, compact disk (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), magnetic tape drive, Flash card, ROM and so on.
  • the program code may be downloaded from a server computer via a communication network.
  • program codes implemented from a storage medium are written in a storage in an extension board inserted in the computer or in a storage in an extension unit connected to the computer.
  • a CPU in the extension board or the extension unit executes at least part of the operations according to the instructions based on the program codes to implement any of the above examples.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Storage Device Security (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Disclosed is a file recognition method, device and server. The method includes: establishing a database according to scanned results reported by client terminals; wherein the database records a globally unique identifier (GUID) of each of the client terminals and checksums of scanned files reported by each of the client terminals extracted from the scanned results; wherein each of the scanned files corresponds to one checksum; for each of the scanned files, querying GUID of the client terminals that report the scanned file from the database, respectively, according to the checksums of the scanned files; obtaining a coexistence rate between one scanned file and each of the other scanned files reported by each of the client terminals, according to the queried out GUID; determining an attribute of the one scanned file according to attributes of the scanned files each having a coexistence rate higher than a preset threshold.

Description

METHOD AND APPARATUS FOR STORING WEBPAGE
ACCESS RECORDS
This application claims the benefit of priority from Chinese Patent Application, No. 201210440933.3, filed on November 7, 2012, the entire contents of which are hereby incorporated by reference.
Field of the Disclosure
The present disclosure relates to information security technology, and more particularly to a file recognition method, device and server.
Background In the current cloud killing technology, when a client scans an unknown file or a file with suspicious behavior in a user's machine and status information of this file is not in a file information database of a server, this file is reported by the client. A virtual data center (VDC) system of the server determines an attribute of this file to be black file (virus file) or white file (security file) according to contents of this file or a series of identification logic such as program behavior. At the time the server returns the determination result to the client, the server also records the determination result in the file information database, so that the server hereinafter can directly return the attribute of this file to any client which queries the attribute of this file.
However, when the server determines an unknown file or a file with suspicious behavior, the server can only perform a separate analysis and determination for each file. In the determination process, the server does not consider parent-child relationship or dependency relationship between one file and other files. For example, a new virus parent file A.exe releases two virus progeny files B.exe and B.dll in one directory when the new virus parent file A.exe runs. B.dll is a virus module with harmful behavior, while B.exe itself has no substantial harm behavior and is only responsible to run and load B.dll after the system is started. After the above three virus files are captured by the client and reported by the client to the server, the server cannot learn the relationship among the three virus files, as a result, there is a great possibility that B.exe is identified as secured white file. This reduces server's recognition accuracy of files. Summary
One example of the present disclosure provides a file recognition method, which can solve the problem that a server's recognition accuracy of files reported by a client terminal is low in the related art.
One example of the present disclosure is implemented as follows:
A file recognition method includes: establishing a database according to scanned results reported by client terminals; wherein the database records a globally unique identifier (GUID) of each of the client terminals and checksums of scanned files reported by each of the client terminals extracted from the scanned results; wherein each of the scanned files corresponds to one checksum; for each of the scanned files, querying GUID of the client terminals that report the scanned file from the database, respectively, according to the checksums of the scanned files; obtaining a coexistence rate between one scanned file and each of the other scanned files reported by each of the client terminals, according to the queried out GUID; determining an attribute of the one scanned file according to attributes of the scanned files each having a coexistence rate higher than a preset threshold. Another example of the present disclosure provides a file recognition device, the device includes: a database establishment unit configured to establish a database according to scanned results reports by client terminals; wherein the database records a globally unique identifier (GUID) of each of the client terminals and checksums of the scanned files reported by each of the client terminals extracted from the scanned results; wherein each of the scanned files corresponds to one checksum; a query unit configured to, according to the checksums of the scanned files, for each of the scanned files, query GUID of the client terminals that report the scanned file from the database, respectively; an obtaining unit configured to, according to the queried out GUID, obtain a coexistence rate between one scanned file and each of the other scanned files reported by each of the client terminals; a determination unit configured to, determine an attribute of the one scanned file according to attributes of the scanned files each having a coexistence rate higher than a preset threshold.
Still another example of the present disclosure provides a server which includes the above file recognition device.
In one example of the present disclosure, when a server recognizes an unknown scanned file or a scanned file with suspicious behavior reported by a client terminal, by examining a situation that the scanned file and other files coexist on a single machine, the server determines an attribute of the scanned file according to an attribute of a file with a highest coexistence rate, thus, the server's recognition accuracy of the scanned file can be further improved and the client terminal's information security can be strengthened. Brief Description of Drawings
Fig. 1 is a flowchart of a file recognition method according to one example of the present disclosure;
Fig. 2 is a specific flow chart of a step S101 of the file recognition method according to one example of the present disclosure; Fig. 3 is a schematic principle diagram of establishing database in the step S101 of the file recognition method according to one example of the present disclosure;
Fig. 4 is a specific flow chart of a step SI 03 of the file recognition method according to one example of the present disclosure;
Fig. 5 is a block diagram of a file recognition device according to one example of the present disclosure;
Fig. 6 is a block diagram of a computing device according to one example of the present disclosure. Detailed Description
For simplicity and illustrative purposes, the present disclosure is described by referring mainly to examples thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure. Throughout the present disclosure, the terms "a" and "an" are intended to denote at least one of a particular element. As used herein, the term "includes" means includes but not limited to, the term "including" means including but not limited to. The term "based on" means based at least in part on.
In one example of the present disclosure, when a server recognizes an unknown scanned file or a scanned file with suspicious behavior reported by a client terminal, by examining a situation that the scanned file and other files coexist on a single machine, the server determines an attribute of the scanned file according to an attribute of a file with a highest coexistence rate, thus, the server's recognition accuracy of the scanned file can be further improved and the client terminal's information security can be strengthened.
Fig. 1 is a flowchart of a file recognition method according to one example of the present disclosure, and details are as follows: Step S101: establishing a database according to scanned results reported by client terminals; the database recording a globally unique identifier (GUID) of each client terminal and checksums of scanned files reported by the each client terminal extracted from the scanned results, each of the scanned files corresponding to one checksum.
In this example, when receiving a scanned result reported by a client terminal which performs virus or Trojan killing, the server can extract GUID of the client terminal which reports the scanned result and a checksum of each scanned file reported by the client terminal from the scanned results. GUID as a unique identifier of the client terminal, can be used to distinguish the client terminal from other client terminals, and can further be used to distinguish computer equipment in which different client terminals are. The checksum includes but not limited to, file's Message-Digest Algorithm fifth edition (MD5) checksum or file's Hash checksum, which can be used herein as a unique identifier for distinguishing different scanned files.
In specific implementation, establishment of the database can refer to the flowchart shown in Fig. 2:
Step S201: obtaining and storing log information of scanning file every time performed by each client terminal.
Specifically, a bypass procedure can be deployed in the client terminal, and the bypass procedure can be configured to record log information of each scanning of the client terminal and store the log information in a mass storage device such as a file transfer protocol (FTP) server, etc. The log information includes GUID of each client terminal, checksums and file attributes of all scanned files, such as PE structure information of the scanned files, path information of the scanned files in user environment, attribute information of PE resources of the scanned files or digital signatures of the scanned files, etc., and these will be not defined here one by one. Meanwhile, the log information can also indicate conditions for generating the log information, such as generated through a full scan, or generated through scanning specified location, etc.
Step S202: after performing statistics and duplicate removal process on the stored log information at a preset time point, extracting GUID of each client terminal and a checksum of each scanned file having been reported by each client terminal from the log information, and establishing a database according to the extracted result.
Since each client terminal may repeatedly perform several times file scanning in a short time, thus, there is data duplication in the log information stored in the mass storage device. In this example, through setting a fixed time point, extracting GUID of each client terminal and the checksum of each scanned file having been reported by each client terminal from the log information after performing statistics and duplicate removal process on the stored log information at the time point, the establishment of the database can be completed.
In order to facilitate subsequent looking up, the extracted data can be stored in four K-V relationship NoSQL databases. The four databases can include: checksum information database, GUID information database, checksum index database and GUID index database. Related principles for establishing database is not used to limit the present disclosure; as an implementation manner of establishing database, details can refer to the schematic principle diagram of establishing database shown in Fig. 3, and will not be repeated here.
Step SI 02: for each of the scanned files, querying GUID of the client terminals that report the scanned file from the database, respectively, according to the checksums of the scanned files.
For ease of description, for each of the scanned files, the client terminals that report the scanned file can also be called as reporting client terminals. In this example, reporting client terminals of one scanned file are clients which have reported the scanned file. By the database established in the step S101, for each scanned file reported by the clients, GUID of each client terminal which has reported the scanned file can be found, i.e., each scanned file existed in which computer equipment the client terminals are can be learned.
Step SI 03: according to the queried out GUID of the reporting client terminals of each scanned file, obtaining a coexistence rate between a first scanned file and each second scanned file, respectively.
The first scanned file is a scanned file which is currently needed to be recognized, and can be an unknown scanned file or a scanned file with suspicious behavior reported by a client terminal which currently reports scanned result, and can also be a grey file which has been stored in the file information database of the server and has an unknown attribute. The second scanned files refers to the other scanned files reported by each of the client terminals; in other words, the other scanned files reported by each of the client terminals can also be called as second scanned files. In this example, the coexistence rate of the two scanned files can reflect genetic relationship between the two scanned files, i.e., the higher the coexistence rate, the closer the genetic relationship between the two scanned files, and their attributes may be closer; on the contrary, if the coexistence rate is lower, it means the possibility that there is no direct link between the two scanned files is greater.
As one example of the present disclosure, coexistence rate can be determined according to the number of computer equipment which simultaneously has two scanned files, and can also be determined according to both of the number of computer equipment which simultaneously has two scanned files and the number of computer equipment which has one of the two scanned files. Preferably, FIG. 4 shows a specific flow chart of the step SI 03 of the file recognition method according to one example of the present disclosure, and details are as follows:
Step S401: obtaining a count number of reporting client terminals of the first scanned file and determining as a first number;
According to the GUID of reporting client terminals of the first scanned file found in the step SI 02, the count number of reporting client terminals of the first scanned file can be determined, i.e., a first number of computer equipment which has the first scanned file.
Step S402: obtaining count numbers of reporting client terminals of each second scanned file and determining as second numbers.
According to the GUID of reporting client terminals of each second scanned file found in the step SI 02, the count numbers of reporting client terminals of each second scanned file can be determined, i.e., second numbers of computer equipment which has each second scanned file. Step S403: according to the queried out GUID of the reporting client terminals of each scanned file, obtaining count numbers of reporting client terminals which simultaneously reports the first scanned file and each second scanned file, and determining as third numbers.
Since different GUID each uniquely identifies one client terminal, thus, which client terminals simultaneously reports the first scanned file and each second scanned file can be known according to the GUID of reporting client terminals of the first scanned file and the GUID of reporting client terminals of each second scanned file found in the step SI 02. Thus, the count numbers of reporting client terminals which simultaneously reports the first scanned file and each second scanned file can be determined, i.e., third numbers of computer equipment which simultaneously has the first scanned file and the second scanned file.
Step S404: determining the coexistence rate between the first scanned file and each second file according to the first number, the second numbers and the third numbers.
According to the above three number parameters, the coexistence rate of the first scanned file and each second scanned file coexisted in a same machine can be calculated. As an implementation example of the present disclosure, the coexistence rate of the first scanned file and each second scanned file can be calculated through the following formula:
l + / * Vd ^a * (a + b - d) where, A represents the coexistence rate of the first scanned file and the second scanned file; ^ represents a constant and can be determined by one skilled in the art according to actual situation; as an implementation example of the present disclosure, a value of I can be 15; a represents the first number, b represents the second number, d represents the third number. In specific implementation, the calculation formula for the coexistence rate includes but not limited to the above form, and this is not used to limit the present disclosure.
Step SI 04: determining an attribute of the first scanned file according to attributes of the second scanned files each having a coexistence rate higher than a preset threshold.
In the example of the present disclosure, after obtaining a coexistence rate between the first scanned file and each second scanned file in the step SI 03, a plurality of second scanned files having highest coexistence rate with the first scanned file can be determined via ranking the obtained coexistence rates in a descending order, i.e., a plurality of second scanned files having closet genetic relationship with the first scanned file can be determined, and then the attribute of the first scanned file can be determined according to the attributes of the plurality of second scanned files.
In one example of the present disclosure, determining an attribute of the first scanned file according to attributes of the second scanned files each having a coexistence rate higher than a preset threshold specifically can include: when it is unable to determine whether the first scanned file is black file or white file according to contents of the first scanned file or a series of identification logic such as program behavior, the attribute of the first scanned file can be recognized according to attribute distribution of the second scanned files each having a coexistence rate higher than the preset threshold through a classification algorithm such as k-Nearest Neighbor (KNN) classification algorithm, etc. The above approach can help the server to perform grey removal process on files stored in the file information database, i.e., determining unknown grey files in the file information database as known black file or white file.
As another example of the present disclosure, determining an attribute of the first scanned file according to attributes of the second scanned files each having a highest coexistence rate specifically can include: when determining whether the first scanned file is black file or white file according to contents of the first scanned file or a series of identification logic such as program behavior, taking the attribute distribution of the second scanned files each having a coexistence rate higher than the preset threshold as one of determination factors for accurately determining an attribute of the first scanned file in combination with the determination results of the identification logic.
In one example of the present disclosure, determining an attribute of the first scanned file according to attributes of the second scanned files can follow the following principles:
1. High scope white file certainly attracts white file. The high scope file means that a number of reporting client terminals corresponding to this file is very high, such as system software, commonly used software and other formal white files. Generally, a main program file of one widely used application program certainly has a highest coexistence rate with related component files of the application program in a same machine, thus, the probability that an attribute of one grey file having a highest coexistence rate with one white file is white file, is also highest.
2. Most black files attract black files, but the probability that one black file attracts white files is small.
Based on same theory of the principle 1, the probability that an attribute of one grey file having a highest coexistence rate with one black file is black file, is also highest. It should be noted, some virtus files may maliciously promote some normal application software, resulting in that the normal application software is recognized as black file due to having a high coexistence rate with the virtus file, thus, in actual application, filtering rules can be further set according to the file's digital signature, thereby further making the recognition result accurate. In one example of the present disclosure, when the server recognizes an unknown scanned file or a scanned file with suspicious behavior reported by a client terminal, by examining a situation that the scanned file and other files coexist on a single machine, the server determines an attribute of the scanned file according to an attribute of a file with a highest coexistence rate, thus, the server's recognition accuracy of the scanned file can be further improved and the client terminal's information security can be strengthened.
Fig. 5 is a block diagram of a file recognition device according to one example of the present disclosure. The device can run in a server side. Preferably, the device can be distributed in a cloud server. The device is configured to run the file recognition method described in the example shown in Figs. 1-4. For convenience of description, only portions related to this example are shown,
Referring to Fig. 5, the device includes: a database establishment unit 51 configured to establish a database according to scanned results reported by client terminals; the database recording a globally unique identifier (GUID) of each client terminal and checksums of scanned files reported by the each of the client terminals extracted from the scanned results; each of the scanned files corresponding to one checksum; a query unit 52 configured to, according to the checksums of the scanned files, query GUID of reporting client terminals of each scanned file in the database, respectively; an obtaining unit 53 configured to, according to the queried out GUID of the reporting client terminals of each scanned file, obtain a coexistence rate between a first scanned file and each second scanned file, respectively; a determination unit 54 configured to, determine an attribute of the first scanned file according to attributes of the second scanned files each having a coexistence rate higher than a preset threshold. Preferably, the checksum includes MD5 checksum or Hash checksum.
Preferably, the obtaining unit 53 includes: a first obtaining subunit configured to obtain a count number of reporting client terminals of the first scanned file and determine as a first number; a second obtaining subunit configured to obtain count numbers of reporting client terminals of each second scanned file and determine as second numbers; a first determination subunit configured to, according to the queried out GUID of the reporting client terminals of each scanned file, obtain count numbers of reporting client terminals which simultaneously reports the first scanned file and the second scanned file and determine as third numbers; a second determination subunit configured to determine the coexistence rate according to the first number, the second numbers and the third numbers.
Preferably, the second determination subunit is specifically configured to determine the coexistence rate according to a formula:
Figure imgf000012_0001
where, A represents the coexistence rate; 1 represents a constant and can be determined by one skilled in the art according to actual situation; as an implementation example of the present disclosure, a value of ^ can be 15; a represents the first number, b represents the second number, d represents the third number.
In one example of the present disclosure, when the server recognizes an unknown scanned file or a scanned file with suspicious behavior reported by a client terminal, by examining a situation that the scanned file and other files coexist on a single machine, the server determines an attribute of the scanned file according to attributes of files each having a highest coexistence rate, thus, the server's recognition accuracy of the scanned file can be further improved and the client terminal's information security can be strengthened.
The above device can run in a computing device shown in Fig. 6. As shown in Fig. 6, the computing device (such as a server or other computing device) includes a processor 60 and a memory 70. The processor 60 and the memory 70 are connected with each other via an internal bus. The memory 70 may be a non-transitory computer-readable storage medium, and stores units of machine readable instructions executable by the processor 60, including a database establishment unit 71, a query unit 72, an obtaining unit 73 and a determination unit 74. Functions of the database establishment unit 71, the query unit 72, the obtaining unit 53 and the determination unit 74 are similar with the functions of the database establishment unit 51, the query unit 52, the obtaining unit 53 and the determination unit 54, respectively. The functions may be implemented with the assistance of other modules, and may involve cooperation of multiple modules, e.g., may utilize processing functions of the processor 60, may relay on the internal bus for data transmission, and etc. The methods, units, and device described herein may be implemented by hardware, machine -readable instructions or a combination of hardware and machine-readable instructions. Machine-readable instructions used in the examples disclosed herein may be stored in storage medium readable by multiple processors, such as hard drive, CD-ROM, DVD, compact disk, floppy disk, magnetic tape drive, ROM or other proper storage device. Or, at least part of the machine -readable instructions may be substituted by specific-purpose hardware, such as custom integrated circuits, gate array, FPGA, PLD and specific-purpose computers and so on.
A machine-readable storage medium is also provided to store instructions to cause a machine to execute a process as described according to examples herein. Specifically, a system or apparatus having a storage medium that stores machine-readable program codes for implementing functions of any of the above examples and that may cause the system or the apparatus (or processor such as CPU or MPU) read and execute the program codes stored in the storage medium.
In this situation, the program codes read from the storage medium may implement any one of the above examples, thus the program codes and the storage medium storing the program codes are part of the technical scheme.
The storage medium for providing the program codes may include floppy disk, hard drive, magneto-optical disk, compact disk (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), magnetic tape drive, Flash card, ROM and so on. The program code may be downloaded from a server computer via a communication network.
It should be noted that, alternatively to the program codes being executed by a computer, at least part of the operations performed by the program codes may be implemented by an operation system running in a computer following instructions based on the program codes to implement any of the above examples.
In addition, the program codes implemented from a storage medium are written in a storage in an extension board inserted in the computer or in a storage in an extension unit connected to the computer. In this example, a CPU in the extension board or the extension unit executes at least part of the operations according to the instructions based on the program codes to implement any of the above examples. Although described specifically throughout the entirety of the instant disclosure, representative examples of the present disclosure have utility over a wide range of applications, and the above discussion is not intended and should not be construed to be limiting, but is offered as an illustrative discussion of aspects of the disclosure.
What has been described and illustrated herein is an example along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims— and their equivalents— in which all terms are meant in their broadest reasonable sense unless otherwise indicated.

Claims

What is Claimed is:
1. A file recognition method comprising establishing a database according to scanned results reported by client terminals; wherein the database records a globally unique identifier (GUID) of each of the client terminals and checksums of scanned files reported by each of the client terminals extracted from the scanned results; wherein each of the scanned files corresponds to one checksum; for each of the scanned files, querying GUID of the client terminals that report the scanned file from the database, respectively, according to the checksums of the scanned files; obtaining a coexistence rate between one scanned file and each of the other scanned files reported by each of the client terminals, according to the queried out GUID; determining an attribute of the one scanned file according to attributes of the scanned files each having a coexistence rate higher than a preset threshold.
2. The method of claim 1, wherein the checksum comprises a message-digest algorithm fifth edition (MD5) checksum or a hash checksum.
3. The method of claim 1, wherein the obtaining a coexistence rate between one scanned file and each of the other scanned files reported by each of the client terminals, according to the queried out GUID, comprises: obtaining a count number of the client terminal reporting the one scanned file, and determining as a first number; obtaining count numbers of the client terminals reporting the each of the other scanned files reported by each of the client terminals, and determining as second numbers; obtaining count numbers of the client terminals which simultaneously report the one scanned file and the each of the other scanned files reported by each of the client terminals, according to the queried out GUID, and determining as third numbers; determining the coexistence rate between the one scanned file and each of the other scanned files reported by each of the client terminals according to the first number, the second numbers and the third numbers.
4. The method of claim 3, wherein the determining the coexistence rate between the one scanned file and each of the other scanned files reported by each of the client terminals according to the first number, the second numbers and the third numbers comprises: determining the coexistence rate according to a formula:
Figure imgf000016_0001
where, A represents the coexistence rate, I represents a constant, a represents the first number, b represents the second number, and d represents the third number.
5. A file recognition device comprising: a database establishment unit configured to establish a database according to scanned results reports by client terminals; wherein the database records a globally unique identifier (GUID) of each of the client terminals and checksums of the scanned files reported by each of the client terminals extracted from the scanned results; wherein each of the scanned files corresponds to one checksum; a query unit configured to, according to the checksums of the scanned files, for each of the scanned files, query GUID of the client terminals that report the scanned file from the database, respectively; an obtaining unit configured to, according to the queried out GUID, obtain a coexistence rate between one scanned file and each of the other scanned files reported by each of the client terminals; a determination unit configured to, determine an attribute of the one scanned file according to attributes of the scanned files each having a coexistence rate higher than a preset threshold.
6. The device of claim 5, wherein the checksum comprises a message-digest algorithm fifth edition (MD5) checksum or a hash checksum.
7. The device of claim 5, wherein the obtaining unit comprises: a first obtaining subunit configured to obtain a count number of the client terminal reporting the one scanned file and determine the obtained count number of the client terminal reporting the one scanned file as a first number; a second obtaining subunit configured to obtain count numbers of the client terminals reporting the each of the other scanned files reported by each of the client terminals, and determine the obtained count numbers of the client terminals reporting the each of the other scanned files reported by each of the client terminals as second numbers; a first determination subunit configured to, according to the queried out GUID, obtain count numbers of the client terminals which simultaneously report the one scanned file and the each of the other scanned files reported by each of the client terminals, and determine the obtained count numbers of the client terminals which simultaneously report the one scanned file and the each of the other scanned files reported by each of the client terminals as third numbers; a second determination subunit configured to determine the coexistence rate between the one scanned file and each of the other scanned files reported by each of the client terminals according to the first number, the second numbers and the third numbers.
8. The device of claim 7, wherein the second determination subunit is specifically configured to determine the coexistence rate between the one scanned file and each of the other scanned files reported by each of the client terminals according to a formula:
Figure imgf000017_0001
where, A represents the coexistence rate, 1 represents a constant, a represents the first number, b represents the second number, d represents the third number.
9. A server comprising a file recognition device; wherein the file recognition device comprises: a database establishment unit configured to establish a database according to scanned results reports by client terminals; wherein the database records a globally unique identifier (GUID) of each of the client terminals and checksums of the scanned files reported by each of the client terminals extracted from the scanned results; wherein each of the scanned files corresponds to one checksum; a query unit configured to, according to the checksums of the scanned files, for each of the scanned files, query GUID of the client terminals that report the scanned file from the database, respectively; an obtaining unit configured to, according to the queried out GUID, obtain a coexistence rate between one scanned file and each of the other scanned files reported by each of the client terminals; a determination unit configured to, determine an attribute of the one scanned file according to attributes of the scanned files each having a coexistence rate higher than a preset threshold.
10. The server of claim 9, wherein the checksum comprises a message-digest algorithm fifth edition (MD5) checksum or a hash checksum.
11. The server of claim 9, wherein the obtaining unit comprises: a first obtaining subunit configured to obtain a count number of the client terminal reporting the one scanned file and determine the obtained count number of the client terminal reporting the one scanned file as a first number; a second obtaining subunit configured to obtain count numbers of the client terminals reporting the each of the other scanned files reported by each of the client terminals, and determine the obtained count numbers of the client terminals reporting the each of the other scanned files reported by each of the client terminals as second numbers; a first determination subunit configured to, according to the queried out GUID, obtain count numbers of the client terminals which simultaneously report the one scanned file and the each of the other scanned files reported by each of the client terminals, and determine the obtained count numbers of the client terminals which simultaneously report the one scanned file and the each of the other scanned files reported by each of the client terminals as third numbers; a second determination subunit configured to determine the coexistence rate between the one scanned file and each of the other scanned files reported by each of the client terminals according to the first number, the second numbers and the third numbers.
12. The server of claim 11, wherein the second determination subunit is specifically configured to determine the coexistence rate between the one scanned file and each of the other scanned files re orted by each of the client terminals according to a formula:
Figure imgf000019_0001
where, A represents the coexistence rate, ^ represents a constant, a represents first number, b represents the second number, d represents the third number.
13. The server of claim 9, wherein the server comprises a cloud server.
PCT/CN2013/086663 2012-11-07 2013-11-07 Method and apparatus for storing webpage access records WO2014071850A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210440933.3A CN103812825B (en) 2012-11-07 2012-11-07 File identification method, device thereof and server
CN201210440933.3 2012-11-07

Publications (1)

Publication Number Publication Date
WO2014071850A1 true WO2014071850A1 (en) 2014-05-15

Family

ID=50684059

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/086663 WO2014071850A1 (en) 2012-11-07 2013-11-07 Method and apparatus for storing webpage access records

Country Status (2)

Country Link
CN (1) CN103812825B (en)
WO (1) WO2014071850A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040117401A1 (en) * 2002-12-17 2004-06-17 Hitachi, Ltd. Information processing system
CN101908116A (en) * 2010-08-05 2010-12-08 潘燕辉 Computer safeguard system and method
CN102713905A (en) * 2010-01-08 2012-10-03 瑞典爱立信有限公司 A method and apparatus for social tagging of media files

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9424266B2 (en) * 2007-10-01 2016-08-23 Microsoft Technology Licensing, Llc Efficient file hash identifier computation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040117401A1 (en) * 2002-12-17 2004-06-17 Hitachi, Ltd. Information processing system
CN102713905A (en) * 2010-01-08 2012-10-03 瑞典爱立信有限公司 A method and apparatus for social tagging of media files
CN101908116A (en) * 2010-08-05 2010-12-08 潘燕辉 Computer safeguard system and method

Also Published As

Publication number Publication date
CN103812825B (en) 2017-02-08
CN103812825A (en) 2014-05-21

Similar Documents

Publication Publication Date Title
US9069956B2 (en) Method for scanning file, client and server thereof
US10574681B2 (en) Detection of known and unknown malicious domains
EP3506141A1 (en) System for query injection detection using abstract syntax trees
US8738721B1 (en) System and method for detecting spam using clustering and rating of E-mails
EP2593893B1 (en) Identifying polymorphic malware
CN107368856B (en) Malicious software clustering method and device, computer device and readable storage medium
CN107547490B (en) Scanner identification method, device and system
US10915534B2 (en) Extreme value computation
CN110659484B (en) System and method for generating a request for file information to perform an anti-virus scan
KR102095853B1 (en) Virus database acquisition method and device, equipment, server and system
US20180124084A1 (en) Network monitoring device and method
CN111869176A (en) System and method for malware signature generation
US11157620B2 (en) Classification of executable files using a digest of a call graph pattern
US11308212B1 (en) Adjudicating files by classifying directories based on collected telemetry data
CN116318800A (en) BGP route data monitoring method and device and electronic equipment
CN113792291B (en) Host recognition method and device infected by domain generation algorithm malicious software
JP6359227B2 (en) Process search device and process search program
WO2014071850A1 (en) Method and apparatus for storing webpage access records
CN115048272A (en) Container monitoring processing method, device, host, system, storage medium and program product
US10606844B1 (en) Method and apparatus for identifying legitimate files using partial hash based cloud reputation
CN111368294B (en) Virus file identification method and device, storage medium and electronic device
CN113051498A (en) URL duplicate removal method and system based on multiple bloom filtering
CN113992364A (en) Network data packet blocking optimization method and system
CN114070819B (en) Malicious domain name detection method, device, electronic device and storage medium
CN104573519A (en) File scanning method and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13854001

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 23.09.2015)

122 Ep: pct application non-entry in european phase

Ref document number: 13854001

Country of ref document: EP

Kind code of ref document: A1