WO2018068664A1 - 网络信息识别方法和装置 - Google Patents

网络信息识别方法和装置 Download PDF

Info

Publication number
WO2018068664A1
WO2018068664A1 PCT/CN2017/104275 CN2017104275W WO2018068664A1 WO 2018068664 A1 WO2018068664 A1 WO 2018068664A1 CN 2017104275 W CN2017104275 W CN 2017104275W WO 2018068664 A1 WO2018068664 A1 WO 2018068664A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
similarity
network information
value
identified
Prior art date
Application number
PCT/CN2017/104275
Other languages
English (en)
French (fr)
Inventor
刘杰
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201610895856.9A external-priority patent/CN107741938A/zh
Priority claimed from CN201610956467.2A external-priority patent/CN107992501B/zh
Priority claimed from CN201610929276.7A external-priority patent/CN108024148B/zh
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018068664A1 publication Critical patent/WO2018068664A1/zh
Priority to US16/026,786 priority Critical patent/US10805255B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/52User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail for supporting social networking services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the field of network applications, and in particular, to a network information identification method and apparatus.
  • Some network information is real and does not contain bad content information, while some network information is false information or information containing bad content, such as pornography or horror. information.
  • the development of the network contributes to the influence of false or information containing inappropriate content. Ordinary users cannot identify such information because of limited knowledge and information.
  • the embodiments of the present invention provide a network information identification method and apparatus, which can effectively identify specific information in a network.
  • An obtaining unit configured to obtain network information to be identified
  • a calculating unit configured to calculate a similarity between the to-be-identified network information and the trusted network information, record the first similarity, and calculate a similarity between the to-be-identified network information and the untrusted network information, and record the second degree Similarity
  • a determining unit configured to determine, according to the first similarity and the second similarity, whether the to-be-identified network information is trusted.
  • a network information identification method includes:
  • the adjacent two participles are treated as a phrase, and the information type of each phrase is determined according to the information in the false information base and the real information base, and the information type includes false information and real Information and unbiased information;
  • the information type of the target text is determined according to the statistical result.
  • a network information identification device includes:
  • a word segmentation unit for performing word segmentation on the target text to obtain a word segmentation of the target text
  • a first determining unit configured to determine, according to an order of occurrence of each participle in the target text, two adjacent participles as a phrase, and determine an information type of each phrase according to the information in the false information base and the real information base, Types of information include false information, real information, and unbiased information;
  • a second determining unit configured to determine, according to the statistical result, an information type of the target text.
  • a network information identification method includes:
  • the image feature value and the first willing feature value of the viewer user are obtained, and the image feature value is used to identify the user's preference for the specific content, and the first willing feature value is used to identify the user in the preset.
  • a network information identification device includes:
  • the obtaining unit is configured to acquire the feature value of the viewer user and the first in the process of playing the multimedia file a willing feature value, the portrait feature value is used to identify a user's preference for a specific content, and the first willing feature value is used to identify a user's willingness to view a specific content within a preset time period;
  • a calculating unit configured to calculate, according to the portrait feature value and the first willing feature value, a probability that the multimedia file includes specific content
  • a detecting unit configured to determine whether the probability exceeds a preset value, and if yes, perform feature detection on the multimedia file
  • a determining unit configured to determine, according to the feature detection result, whether the multimedia file is a multimedia file of a specific content.
  • the background information may be automatically obtained, and the network information to be identified is determined according to the similarity between the network information to be identified and the trusted network information, and the similarity between the network information to be identified and the non-trusted network information. It is credible, that is, the similarity is used to determine whether the network information to be identified is trusted, and thus it is possible to automatically and efficiently identify specific network information, such as rum.
  • FIG. 1 is a schematic diagram of a scenario of a network information identification method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a network information identification method according to an embodiment of the present invention.
  • FIG. 3 is another schematic flowchart of a network information identification method according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a network information identifying apparatus according to an embodiment of the present invention.
  • FIG. 5 is another schematic structural diagram of a network information identifying apparatus according to an embodiment of the present invention.
  • FIG. 6 is a block diagram showing the hardware structure of a computer terminal that can be used to implement the social network information identification method of the embodiment of the present invention
  • FIG. 7 is a flowchart of a method for identifying a social network information according to an embodiment of the present invention.
  • FIG. 8 is a flowchart of a method for identifying a social network information according to an embodiment of the present invention.
  • FIG. 9 is a flowchart of a method for determining a type of information to which a phrase belongs according to an embodiment of the present invention.
  • FIG. 10 is a flowchart of a method for processing social network information disclosed in an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a social network information identifying apparatus according to an embodiment of the present invention.
  • FIG. 12 is a schematic diagram of a social network information identifying apparatus according to an embodiment of the present invention.
  • FIG. 13 is a schematic diagram of a social network information processing apparatus according to an embodiment of the present invention.
  • FIG. 14 is a structural block diagram of a computer terminal according to an embodiment of the present invention.
  • FIG. 15 is a block diagram showing the hardware structure of a computer terminal that can be used to implement a behavior-based multimedia file identification method according to an embodiment of the present invention
  • 16 is a flowchart of a method for identifying a multimedia file based on behavior characteristics according to an embodiment of the present invention
  • FIG. 17 is a flowchart of a method for identifying a multimedia file based on behavior characteristics according to an embodiment of the present invention
  • FIG. 18 is a flowchart of a multimedia file processing method according to an embodiment of the present invention.
  • FIG. 19 is a schematic diagram of a behavior-based multimedia file identification apparatus according to an embodiment of the present invention.
  • FIG. 20 is a schematic diagram of a behavior-based multimedia file identification apparatus according to an embodiment of the present invention.
  • FIG. 21 is a schematic diagram of a multimedia file processing apparatus according to an embodiment of the present invention.
  • FIG. 22 is a block diagram showing the structure of a computer terminal according to an embodiment of the present invention.
  • the embodiment of the present invention provides a network information identification.
  • the method and device can automatically and effectively identify rumors.
  • the network information identifying method provided by the embodiment of the present invention can be implemented in a network information identifying apparatus, and the network information identifying apparatus can be a background server.
  • a specific implementation scenario of the network information identification method in the embodiment of the present invention may be as shown in FIG. 1 .
  • the server obtains network information to be identified, and the network information to be identified may be information or comments published by the user on a social network (eg, Weibo, QQ space).
  • the server may mask the network information to be identified to prevent the rumor from continuing to propagate, or mark the network information to be identified as suspicious to prompt the user, that is, the embodiment of the present invention utilizes similarity. It is determined whether the network information to be identified is authentic, and thus the rumor can be automatically and effectively identified.
  • the method of this embodiment includes the following steps:
  • Step 201 Obtain network information to be identified.
  • the network information to be identified may be information or a message posted by the user on a social network (eg, Weibo, QQ space).
  • a social network eg, Weibo, QQ space.
  • the background server can acquire information or speech posted by the user, that is, obtain the network information to be identified.
  • Step 202 Calculate the similarity between the to-be-identified network information and the trusted network information, record the first similarity, and calculate the similarity between the to-be-identified network information and the untrusted network information, and record the second similarity. ;
  • the trusted network information and the non-trusted network information may be collected in advance, the trusted database is established according to the collected trusted network information, and the non-trusted database is established according to the collected non-trusted network information.
  • Trusted network information can be extracted from authoritative or trusted websites, such as Baidu Encyclopedia, Wikipedia, so the network information contained in the trusted database can be considered trusted. Untrusted network information can now be manually collected, and network information contained in non-trusted databases can be considered untrustworthy.
  • the cosine theorem algorithm may be used to calculate the similarity between the network information to be identified and each trusted network information in the trusted database, where multiple similarity values may be obtained.
  • the maximum value of the calculated similarity can be taken as the first similarity, that is, the first similarity is a trusted database.
  • the cosine theorem algorithm can be used to calculate the similarity between the network information to be identified and each non-trusted network information in the non-trusted database, where multiple similarity values can be obtained.
  • the maximum value of the calculated similarity can be taken as the second similarity, that is, the second similarity is untrusted. The similarity between the untrusted network information with the highest similarity to the network information to be identified in the database and the network information to be identified.
  • the cosine theorem algorithm can be used to calculate the similarity of two pieces of information.
  • Degree of course, in addition to the cosine theorem algorithm, other algorithms can be used to calculate the similarity of the two information, such as the distance editing algorithm, etc., here the specific algorithm is not limited.
  • the first similarity and the second similarity are obtained by calculating the similarity between the network information to be identified and each network information in the trusted database and the non-trusted database one by one.
  • the first similarity and the second similarity are obtained in other ways. For example, using the keyword extraction method, the trusted network information having the same keyword as the network information to be identified in the trusted database is extracted, and the similarity between the trusted network information and the network information to be identified is calculated, and the first similarity is recorded. Extracting non-trusted network information having the same keyword as the network information to be identified in the non-trusted database, and calculating the similarity between the untrusted network information and the network information to be identified, and recording the second similarity.
  • Step 203 Determine, according to the first similarity and the second similarity, whether the to-be-identified network information is trusted.
  • the first similarity and the second similarity may be compared.
  • the similarity between the network information to be identified and the trusted network information is indicated.
  • the network information to be identified is trusted, and the network information to be identified is trusted; when the second similarity is greater than the first similarity, the network information to be identified is The similarity of the untrusted network information is higher than the similarity between the network information to be identified and the trusted network information, so it can be determined that the network information to be identified is not trusted.
  • the above identification method uses both a trusted database and a non-trusted database.
  • one of the databases can also be used to identify whether the network information is trusted.
  • the first similarity is calculated by the cosine theorem algorithm, and it is determined whether the first similarity is greater than the first preset threshold (for example, 0.8). If it is greater than, the network information to be identified is considered to be credible, if not If the value is greater than, the network information to be identified is not trusted.
  • the second similarity is calculated by the cosine theorem algorithm to determine whether the second similarity is greater than a second preset threshold (for example, 0.9). If the network information to be identified is not trusted, if it is not greater than, the network information to be identified is considered to be credible.
  • the network information to be identified may be allowed to be displayed on the social network; when it is determined that the network information to be identified is not trusted, some processing measures may be adopted to prompt other users or avoid rumor propagation, for example, The network information to be identified is marked as suspicious, or the network to be identified is shielded information.
  • the background server may automatically obtain the network information to be identified, and determine whether the network information to be identified is determined according to the similarity between the network information to be identified and the trusted network information, and the similarity between the network information to be identified and the non-trusted network information. It is credible, that is, the similarity is used to determine whether the network information to be identified is credible, and thus the rumor can be automatically and effectively identified.
  • Embodiment 1 The method described in Embodiment 1 is further illustrated in detail in the embodiment. As shown in FIG. 3, the method in this embodiment includes:
  • Step 301 Collect trusted network information and non-trusted network information.
  • the trusted network information can be extracted from an authoritative or trusted website, for example, from Baidu Encyclopedia and Wikipedia, and the non-trusted network information can be manually collected.
  • Step 302 Establish a trusted database according to the collected trusted network information, and establish a non-trusted database according to the collected non-trusted network information.
  • the trusted database contains multiple trusted network information, and the network information contained in the trusted database can be considered as trusted; the non-trusted database contains multiple untrusted network information, and the network information contained in the non-trusted database Can be considered to be untrustworthy.
  • Step 303 Obtain network information to be identified.
  • the network information to be identified may be information or a message posted by the user on a social network (eg, Weibo, QQ space).
  • a social network eg, Weibo, QQ space.
  • the background server can acquire information or speech posted by the user, that is, obtain the network information to be identified.
  • Step 304 Calculate the similarity between the network information to be identified and each trusted network information in the trusted database, and calculate the maximum value of the calculated similarity as the first similarity;
  • the cosine theorem algorithm may be used to calculate the similarity between the network information to be identified and each trusted network information in the trusted database, where multiple similarity values may be obtained.
  • the maximum value of the calculated similarity can be taken as the first similarity, that is, the first similarity is a trusted database.
  • Step 305 Calculate the to-be-identified network information and each non-trusted network in the non-trusted database. The similarity of the information, and the maximum value of the calculated similarity is recorded as the second similarity;
  • the cosine theorem algorithm can be used to calculate the similarity between the network information to be identified and each non-trusted network information in the non-trusted database, where multiple similarity values can be obtained.
  • the maximum value of the calculated similarity can be taken as the second similarity, that is, the second similarity is untrusted. The similarity between the untrusted network information with the highest similarity to the network information to be identified in the database and the network information to be identified.
  • Zhang San is a singer and an actor.
  • Zhang San is not an actor, but a singer.
  • the first step word segmentation
  • Step 2 Repeat to list all the words identified
  • the third step calculate the word frequency (here indicates the number of times a word appears in a message);
  • the fourth step constructing a word frequency vector
  • the similarity of two vectors can be expressed by the size ⁇ of the angle of the vector. Specifically, it can be expressed by the cosine of the angle between the two vectors. The closer the cosine value is to 1, the closer the angle is. 0 degrees, that is, the more similar the two vectors are, the "cosine similarity".
  • Step 5 Calculate the cosine of the angle between the two vectors
  • Cos ⁇ (1*1+2*2+0*1+2*2+1*1+1*1+0*0+1*1)/(sqrt(1 ⁇ 2+2 ⁇ 2+0 ⁇ 2 +2 ⁇ 2+1 ⁇ 2+1 ⁇ 2+0 ⁇ 2+1 ⁇ 2)*sqrt(1 ⁇ 2+2 ⁇ 2+1 ⁇ 2+2 ⁇ 2+1 ⁇ 2+1 ⁇ 2+1 ⁇ 2+0 ⁇ 2+1 ⁇ 2));
  • the similarity of the two pieces of information is 0.961, the value of the similarity is close to 1, and the similarity is high.
  • step 304 and step 305 may also be in no order.
  • the cosine theorem algorithm can be used to calculate the similarity of two pieces of information.
  • Degree of course, in addition to the cosine theorem algorithm, other algorithms can be used to calculate the similarity of the two information, such as the distance editing algorithm, etc., here the specific algorithm is not limited.
  • the method described in step 304 and step 305 the first similarity and the second similarity are obtained by calculating the similarity between the network information to be identified and the network information in the trusted database and the non-trusted database one by one, the actual The first similarity and the second similarity may also be obtained in other ways.
  • the trusted network information having the same keyword as the network information to be identified in the trusted database is extracted, and the similarity between the trusted network information and the network information to be identified is calculated, and the first similarity is recorded;
  • the non-trusted network information having the same keyword as the network information to be identified is extracted from the untrusted database, and the similarity between the untrusted network information and the network information to be identified is calculated, and the second similarity is recorded.
  • Step 306 Determine whether the first similarity is greater than the second similarity. If the first similarity is greater than the second similarity, perform step 307, if the first similarity is less than the first Second similarity, step 308 is performed;
  • the first similarity and the second similarity may be compared.
  • the similarity between the network information to be identified and the trusted network information is indicated.
  • the network information to be identified is trusted, and the network information to be identified is trusted; when the second similarity is greater than the first similarity, the network information to be identified is The similarity of the untrusted network information is higher than the similarity between the network information to be identified and the trusted network information, so it can be determined that the network information to be identified is not trusted.
  • Step 307 Determine that the network information to be identified is trusted.
  • Step 308 Determine that the network information to be identified is not trusted.
  • the network information to be identified may be allowed to be displayed on the social network; when it is determined that the network information to be identified is not trusted, some processing measures may be adopted to prompt other users or avoid rumor propagation, for example, The network information to be identified is marked as suspicious or the network information to be identified is blocked.
  • the above identification method uses both a trusted database and a non-trusted database. In practice, it can also be used separately. Use one of the databases to identify if the network information is trustworthy. For example, using only the trusted database, the first similarity is calculated by the cosine theorem algorithm, and it is determined whether the first similarity is greater than the first preset threshold (for example, 0.8). If it is greater than, the network information to be identified is considered to be credible, if not If the value is greater than, the network information to be identified is not trusted. Alternatively, only the non-trusted database is used, and the second similarity is calculated by the cosine theorem algorithm to determine whether the second similarity is greater than a second preset threshold (for example, 0.9). If the network information to be identified is not trusted, if it is not greater than, the network information to be identified is considered to be credible.
  • the first preset threshold for example, 0.8
  • the background server may automatically obtain the network information to be identified, and determine whether the network information to be identified is determined according to the similarity between the network information to be identified and the trusted network information, and the similarity between the network information to be identified and the non-trusted network information. It is credible, that is, the similarity is used to determine whether the network information to be identified is credible, and thus the rumor can be automatically and effectively identified.
  • the embodiment of the present invention further provides a network information identifying apparatus.
  • the apparatus of this embodiment includes: an obtaining unit 401, a calculating unit 402, and a determining unit 403, as follows:
  • the obtaining unit 401 is configured to obtain network information to be identified.
  • the network information to be identified may be information or a message posted by the user on a social network (eg, Weibo, QQ space).
  • a social network eg, Weibo, QQ space.
  • the obtaining unit 401 can acquire information or a message posted by the user, that is, acquire the network information to be identified.
  • the calculating unit 402 is configured to calculate a similarity between the to-be-identified network information and the trusted network information, record the first similarity, and calculate a similarity between the to-be-identified network information and the untrusted network information, and record Second similarity
  • the network information identifying apparatus in this embodiment may further include an acquiring unit and an establishing unit, where:
  • the collecting unit may collect trusted network information and non-trusted network information in advance, and the establishing unit may establish a trusted database according to the collected trusted network information, and establish an untrusted database according to the collected non-trusted network information.
  • Trusted network information can be extracted from authoritative or trusted websites, such as Baidu Encyclopedia, Wikipedia Therefore, the network information contained in the trusted database can be considered to be trusted. Untrusted network information can now be manually collected, and network information contained in non-trusted databases can be considered untrustworthy.
  • the computing unit 402 can include a first computing subunit and a second computing subunit, where:
  • the first calculating subunit may calculate the similarity between the network information to be identified and each trusted network information in the trusted database by using a cosine theorem algorithm, where multiple similarity values may be obtained. The larger the similarity value is, the higher the similarity between the two information is. In this step, the first calculation subunit can take the calculated maximum value of the similarity as the first similarity, that is, the first similarity.
  • the degree is the similarity between the trusted network information with the highest similarity to the network information to be identified in the trusted database and the network information to be identified.
  • the second calculation subunit may also calculate the similarity between the network information to be identified and each non-trusted network information in the non-trusted database by using a cosine theorem algorithm, where multiple similarity values may be obtained. The larger the similarity value is, the higher the similarity between the two information is. In this step, the second calculation subunit can take the calculated maximum value of the similarity as the second similarity, that is, the second similarity.
  • the degree is the similarity between the untrusted network information with the highest similarity to the network information to be identified and the network information to be identified in the untrusted database.
  • the first calculation subunit and the second calculation subunit The cosine theorem algorithm can be used to calculate the similarity between two pieces of information.
  • other algorithms can be used to calculate the similarity of two pieces of information, such as distance editing algorithm, etc. limited.
  • the first similarity and the second similarity are obtained by calculating the similarity between the network information to be identified and each network information in the trusted database and the non-trusted database one by one.
  • the first similarity and the second similarity are obtained in other ways. For example, by using a keyword extraction method, the trusted network information having the same keyword as the network information to be identified in the trusted database is extracted, and the similarity between the trusted network information and the network information to be identified is calculated, and the first similarity is recorded; The non-trusted network information having the same keyword as the network information to be identified is extracted from the untrusted database, and the similarity between the untrusted network information and the network information to be identified is calculated, and the second similarity is recorded.
  • the determining unit 403 is configured to determine, according to the first similarity and the second similarity, whether the to-be-identified network information is trusted.
  • the determining unit 403 may include a comparison subunit, a first determining subunit, and a second determining sub Yuan, where:
  • the comparison subunit may compare the size of the first similarity and the second similarity, and when the first similarity is greater than the second similarity, indicate the similarity between the network information to be identified and the trusted network information.
  • the first determining subunit may determine that the to-be-identified network information is trusted; and when the second similarity is greater than the first similarity, the description is performed.
  • the similarity between the network information to be identified and the non-trusted network information is higher than the similarity between the network information to be identified and the trusted network information, so the second determining subunit may determine that the network information to be identified is not trusted.
  • the above identification method uses both a trusted database and a non-trusted database.
  • one of the databases can also be used to identify whether the network information is trusted.
  • the first similarity is calculated by the cosine theorem algorithm, and it is determined whether the first similarity is greater than the first preset threshold (for example, 0.8). If it is greater than, the network information to be identified is considered to be credible, if not If the value is greater than, the network information to be identified is not trusted.
  • the second similarity is calculated by the cosine theorem algorithm to determine whether the second similarity is greater than a second preset threshold (for example, 0.9). If the network information to be identified is not trusted, if it is not greater than, the network information to be identified is considered to be credible.
  • the network information identifying apparatus of this embodiment may further include a processing unit, when determining that the network information to be identified is trusted, the processing unit may allow the network information to be identified to be displayed on the social network; when determining that the network information to be identified is not trusted, The processing unit may adopt some processing measures to prompt other users or avoid rumor propagation. For example, the processing unit may mark the to-be-identified network information as suspicious or shield the network information to be identified.
  • the network information identifying apparatus when the network information identifying apparatus provided by the foregoing embodiment implements the network information identification, only the division of each functional module is used for example. In an actual application, the foregoing function may be allocated by different functional modules according to requirements. Completion, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the network information identifying apparatus and the network information identifying method provided by the foregoing embodiments are in the same concept, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • the acquiring unit may automatically obtain the network information to be identified, and the calculating unit calculates the similarity between the network information to be identified and the trusted network information, and calculates the similarity between the network information to be identified and the untrusted network information, and the determining unit is configured according to the The calculated similarity determines whether the network information to be identified is trusted.
  • the similarity is used to determine whether the network information to be identified is trusted, and thus the rumor can be automatically and effectively identified.
  • An embodiment of the present invention further provides a network information identifying apparatus, as shown in FIG. 5, which shows a schematic structural diagram of an apparatus according to an embodiment of the present invention, specifically:
  • the apparatus may include a processor 501 of one or more processing cores, a memory 502 of one or more computer readable storage media, a radio frequency (RF) circuit 503, a power source 505, an input unit 505, a display unit 506, and the like. component.
  • RF radio frequency
  • Processor 501 is the control center of the device, connecting various portions of the entire device using various interfaces and lines, by running or executing software programs and/or modules stored in memory 502, and recalling data stored in memory 502, Perform various functions and processing data of the device to monitor the device as a whole.
  • the processor 501 can include one or more processing cores; the processor 501 can integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application, etc., and the modem processor mainly Handle wireless communications. It can be understood that the above modem processor may not be integrated into the processor 501.
  • the memory 502 can be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by running software programs and modules stored in the memory 502.
  • the memory 502 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to Data created by the use of the device, etc.
  • memory 502 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 502 can also include a memory controller to provide processor 501 access to memory 502.
  • the RF circuit 503 can be used for receiving and transmitting signals during the process of transmitting and receiving information. Specifically, after receiving the downlink information of the base station, it is processed by one or more processors 501; in addition, the data related to the uplink is sent to the base station.
  • the RF circuit 503 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, and a Low Noise Amplifier (LNA). , duplexer, etc.
  • RF circuit 503 can also communicate with the network and other devices via wireless communication.
  • the wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM), general packet radio service (GPRS, General Packet Radio Service), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), e-mail, Short Message Service (SMS, Short Messaging Service), etc.
  • GSM Global System of Mobile communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • SMS Short Message Service
  • the device also includes a power source 504 (such as a battery) that supplies power to the various components.
  • the power source 504 can be logically coupled to the processor 501 through a power management system to manage functions such as charging, discharging, and power management through the power management system.
  • the power supply 504 may also include any one or more of a DC or AC power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
  • the apparatus can also include an input unit 505 that can be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function controls.
  • input unit 505 can include a touch-sensitive surface as well as other input devices. Touch-sensitive surfaces, also known as touch screens or trackpads, collect touch operations on or near the user (such as the user using a finger, stylus, etc., any suitable object or accessory on a touch-sensitive surface or touch-sensitive Operation near the surface), and drive the corresponding connecting device according to a preset program.
  • the touch sensitive surface can include two portions of a touch detection device and a touch controller.
  • the touch detection device detects the touch orientation of the user, and detects a signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts the touch information into contact coordinates, and sends the touch information.
  • the processor 501 is provided and can receive commands from the processor 501 and execute them.
  • touch-sensitive surfaces can be implemented in a variety of types, including resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 505 can also include other input devices. Specifically, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, joysticks, and the like.
  • the apparatus can also include a display unit 506 that can be used to display information entered by the user or information provided to the user and various graphical user interfaces of the device, which can be represented by graphics, text, icons, video, and It is composed of any combination.
  • the display unit 506 can include a display panel, and the display panel can be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
  • the touch-sensitive surface may cover the display panel, and when the touch-sensitive surface detects a touch operation thereon or nearby, it is transmitted to the processor 501 to determine the type of the touch event, and then the processor 501 displays the type according to the type of the touch event. A corresponding visual output is provided on the panel.
  • the touch-sensitive surface and display panel are implemented as two separate components for input and input. Functionality, but in some embodiments, the touch-sensitive surface can be integrated with the display panel to implement input and output functions.
  • the device may further include a camera, a Bluetooth module, and the like, and details are not described herein again.
  • the processor 501 in the device loads the executable file corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and is executed by the processor 501 to be stored in the memory.
  • the application in 502 thus implementing various functions, as follows:
  • the processor 501 may calculate a similarity between the to-be-identified network information and the trusted network information by using a cosine theorem algorithm, record the first similarity, and calculate the to-be-identified network information and non-trusted by using a cosine theorem algorithm.
  • the similarity of network information is recorded as the second similarity.
  • processor 501 is further configured to:
  • a trusted database is established according to the collected trusted network information, and an untrusted database is established according to the collected non-trusted network information.
  • the processor 501 may calculate a similarity between the network information to be identified and each trusted network information in the trusted database, and calculate a maximum value of the calculated similarity as a first similarity;
  • the similarity between the network information to be identified and each non-trusted network information in the non-trusted database is calculated, and the maximum value of the calculated similarity is recorded as the second similarity.
  • the processor 501 can determine whether the network information to be identified is trusted according to the following manner:
  • the processor 501 may further mark the to-be-identified network information as suspicious or block the to-be-identified network information.
  • the device in this embodiment can automatically obtain the network information to be identified, and then calculate the similarity between the network information to be identified and the trusted network information, and calculate the phase of the network information to be identified and the information of the non-trusted network.
  • the degree of similarity is determined according to the calculated degree of similarity.
  • the apparatus of the embodiment can determine whether the network information to be identified is trusted by using the similarity, and thus can automatically and effectively identify the rumor.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the embodiments of the present invention may contribute to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage.
  • the medium includes a number of instructions for causing a computer device (which may be a personal computer, device, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • a computer device which may be a personal computer, device, or network device, etc.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .
  • the embodiment provides an embodiment of a social network information identification method, and it should be noted that the steps shown in the flowchart of the drawing may be executed in a computer system such as a set of computer executable instructions, and although in the process The logical order is shown in the figures, but in some cases the steps shown or described may be performed in a different order than the ones described herein.
  • FIG. 6 is a social network that can be used to implement an embodiment of the present invention.
  • FIG. 6 is a social network that can be used to implement an embodiment of the present invention.
  • computer terminal 600 can include one or more (only one shown) processor 602 (processor 602 can include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA)
  • a memory 604 for storing data
  • a transmission device 606 for communication functions.
  • computer terminal 600 may also include more or fewer components than shown in FIG. 6, or have a different configuration than that shown in FIG.
  • the memory 604 can be used to store software programs and modules of application software, such as program instructions/modules corresponding to the social network information identification method in the embodiment of the present invention, and the processor 602 executes by executing a software program and a module stored in the memory 604.
  • Various functional applications and data processing implement the above-described social network information identification method.
  • Memory 604 can include high speed random access memory and can also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • memory 604 can further include memory remotely located relative to processor 602, which can be connected to computer terminal 10 over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • Transmission device 606 is for receiving or transmitting data via a network.
  • the network specific examples described above may include a wireless network provided by a communication provider of the computer terminal 600.
  • the transmission device 606 includes a Network Interface Controller (NIC) that can be connected to other network devices through a base station to communicate with the Internet.
  • the transmission device 606 can be a Radio Frequency (RF) module for communicating with the Internet wirelessly.
  • NIC Network Interface Controller
  • RF Radio Frequency
  • the present application provides a social network information identification method as shown in FIG. 7.
  • the method can be applied to a smart terminal device, and is executed by a processor in the smart terminal device, and the smart terminal device can be a smart phone, a tablet computer, or the like.
  • At least one application is installed in the smart terminal device.
  • the embodiment of the present invention does not limit the type of the application, and may be a system-based application or a software-based application.
  • FIG. 7 is a flowchart of a method for identifying a social network information according to Embodiment 1 of the present invention. As shown in FIG. 7, one solution of the method includes the following steps:
  • Step S701 performing word segmentation on the target text to obtain a word segmentation of the target text
  • Step S702 according to the appearance order of each participle in the target text, the adjacent two participles are taken as a phrase, and the information type of each phrase is determined according to the information in the false information base and the real information base, the information Types include false information, real information, and unbiased information;
  • Step S703 performing statistics on the information types of all phrases in the target text, and obtaining statistical results
  • Step S704 determining an information type of the target text according to the statistical result.
  • step S702 the determining, according to the information in the false information base and the real information database, the information type of each phrase, including:
  • C(W1) indicates the frequency at which the first participle in the phrase appears in the target text
  • C(W2) indicates the frequency at which the second participle in the phrase appears in the target text
  • C (W12) indicates the frequency at which the first participle and the second participle appear consecutively in the target text, and the first participle appears in the target text in a sequence earlier than the second participle;
  • Step 2 extract the associated value of the corresponding two participles in the fake information base as the first associated value, and extract the associated value of the corresponding two participles in the real information database as the second associated value; Determining the information type of the phrase by determining the proximity of the associated value to the first associated value and the second associated value; specifically: calculating a difference between the associated value and the first associated value to obtain a first difference; Determining the difference between the correlation value and the second correlation value to obtain a second difference; comparing the absolute value of the first difference value with the absolute value of the second difference value, if the absolute value of the first difference value is greater than the second value The absolute value of the difference determines that the information type of the phrase is real information.
  • the absolute value of the first difference is less than the absolute value of the second difference, determining that the information type of the phrase is false information, if the first difference is The absolute value of the second value is equal to the absolute value of the second difference, and the information type of the phrase is determined to be unbiased information.
  • the embodiment of the present invention analyzes the false information and the corresponding real information by establishing a false information base and a real information database, and calculates the correlation between the adjacent keywords in the false information and the relevance of the adjacent keywords in the real information. Determining the relevance of adjacent keywords in the target text and the proximity of the two keywords to determine the information type of the adjacent keywords in the target text, and further obtaining the target text by counting the information types of all adjacent keywords in the target text.
  • the type of information enables the rapid identification of false information on the network through simple algorithms, which can provide an important basis for network managers to respond quickly.
  • FIG. 8 is a flowchart of a method for identifying a social network information according to an embodiment of the present invention.
  • One solution of the method includes the following steps:
  • Step 1 Processing the false information samples in the false information base and the real information samples in the real information database.
  • False information samples in the false information base can be obtained by manual collection.
  • the real information samples in the real information database can be extracted from known knowledge bases (such as various encyclopedic knowledge).
  • the false information sample and the real information sample are correspondingly included, and when a false false information sample is collected, the corresponding true information sample is searched, and the false information sample is stored in the false information database, and the real Information samples are stored in the real information library.
  • the process of processing the information sample includes: segmenting the false information samples in the false information database, obtaining the word segmentation of the false information samples, and calculating the association between the adjacent two word segments according to the order of occurrence of each word segment in the false information sample. Value; the word segmentation processing is performed on the real information sample in the real information database, and the word segmentation of the real information sample is obtained, and the correlation value of the adjacent two word segments is calculated according to the order of occurrence of each word segment in the real information sample.
  • the preprocessing process for the fake information samples is the same as the preprocessing process for the real information samples, the following describes the preprocessing process by taking the false information samples as an example.
  • the preprocessing process for the fake information samples includes:
  • the word segmentation module is used to process the word segmentation of the false information sample to obtain the word segmentation result of the false information sample.
  • the false information samples are preprocessed to remove the stop words in the false information samples.
  • the stop words are collected manually, mainly including punctuation, pronouns, modal particles, auxiliary words, conjunctions, etc. These stop words are generally not special. The meaning of a word or phrase is often combined with other words.
  • the word segmentation can use the forward maximum matching algorithm, the inverse maximum matching algorithm or the two-way maximum matching algorithm.
  • the forward maximum matching algorithm and the inverse maximum matching algorithm are The common steps of the word segmentation method are not repeated here.
  • the two-way maximum matching algorithm is specifically: the forward maximum matching algorithm and the inverse maximum matching algorithm are used to segment the words, and the forward maximum matching algorithm and the inverse maximum matching algorithm are used.
  • the word segmentation result is input into the correlation calculation module, and the correlation between the adjacent two word segments is calculated according to the order of occurrence of each word segment in the false information sample, and the correlation values of the adjacent two segment words are obtained.
  • X(W) represents the associated value of two adjacent participles
  • C(W01) represents the frequency of occurrence of the first part of the two participles in the false information sample
  • C(W02) represents the first of the two participles The frequency at which two participles appear in the false information sample.
  • the first participle appears earlier than the second participle.
  • C(W) indicates that the first participle and the second participle are consecutive in the false information sample. The frequency of occurrence.
  • Step 2 Perform word segmentation on the target text to obtain the word segmentation of the target text.
  • Perform word segmentation on the target text to obtain the word segmentation of the target text including:
  • the target text is obtained;
  • the target text can be obtained from the social application software, for example, extracting the microblog information from the microblog, using the microblog information as the target text, extracting the public number article or the WeChat friend circle message from the WeChat, The article or circle of friends message is the target text.
  • the target text is preprocessed to remove the stop words in the target text.
  • Stop words are collected manually, mainly including punctuation, pronouns, modal particles, auxiliary words, conjunctions, etc. These stop words generally have no special meaning, often with other words to form words or phrases, the term generally does not include deactivation word. Examples of stop words: “ah”, “oh”, “ ⁇ ”, “and”, “of”, “ ⁇ ”, “almost”, “what”, “I”, “it”, “we”, etc.
  • word segmentation is performed on the target text by using a dictionary segmentation method to obtain a word segmentation of the target text.
  • the target text of the deactivated words is divided into words by dictionary segmentation.
  • the word segmentation can use the forward maximum matching algorithm, the inverse maximum matching algorithm or the bidirectional maximum matching algorithm.
  • the forward maximum matching algorithm and the inverse maximum matching algorithm are commonly used.
  • the specific method of the word segmentation method is not repeated here.
  • the two-way maximum matching algorithm is specifically: the segmentation text is segmented by the forward maximum matching algorithm and the inverse maximum matching algorithm respectively, and the forward maximum matching algorithm and the inverse maximum matching algorithm are obtained.
  • the number of words in the word segmentation result is inconsistent, the number of word segmentation is less as the final result. If the number of words in the word segmentation result obtained by the two methods is the same, then a word segmentation result is taken as the final result.
  • the frequency of occurrence of each participle in the target text is counted, and the order of each participle in the text is forwardly sorted, and the frequency of occurrence of each participle in the false information sample is recorded, and a word segmentation result represented by the matrix is obtained.
  • Step 3 According to the order of appearance of each participle in the target text, the adjacent two participles are taken as a phrase, and the information type of each phrase is determined according to the information in the false information base and the real information base, and the information type includes false Information, real information, and unbiased information.
  • FIG. 9 is a flowchart of a method for determining a type of information to which a phrase belongs according to an embodiment of the present invention.
  • a method for determining a type of information to which a phrase belongs includes:
  • S902 Extract the associated value of the corresponding two participles in the fake information base as the first associated value; and extract the associated value of the corresponding two participles in the real information database as the second associated value.
  • S903 Determine, according to the proximity of the associated value to the first associated value and the second associated value, the information type of the phrase.
  • the associated value of the two adjacent words "mutton” and “mung bean” in the target text is 4, and the associated value of the corresponding two words “mutton” and “mung bean” in the false information base is 1, in the real information base.
  • the associated value of the corresponding two words “mutton” and “mung bean” is 3, then 1 can be used as the first associated value, and 3 as the second associated value; the absolute value of the first difference is calculated as 3, second The absolute value of the difference is 1, which determines the information type of the phrase ("Meat” and "Mung Bean”) as real information.
  • Step 4 Statistics are performed on the information types of all phrases in the target text to obtain statistical results.
  • the step includes: obtaining information types of all phrases in the target text; counting the frequency of occurrence of each information type, and obtaining statistical results.
  • Step 5 Determine the information type of the target text according to the statistical result.
  • the information type of the target text including:
  • the type is unbiased information.
  • the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic).
  • Disc, CD-ROM including a number of instructions to enable a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the program
  • a terminal device which can be a mobile phone, computer, server, or network device, etc.
  • FIG. 10 is a flowchart of a method for processing social network information according to an embodiment of the present invention.
  • One solution of the method includes the following steps:
  • S1001 Perform word segmentation on the target text to obtain a word segmentation of the target text
  • S1003 Perform statistics on the information types of all phrases in the target text to obtain statistical results
  • S1004 Determine, according to the statistical result, an information type of the target text.
  • S1005 Process the target text according to the information type of the target text.
  • the processing the target text according to the information type of the target text comprises: deleting the target text in the social network if the information type of the target text is false information.
  • the target text can be obtained from the social application software, for example, extracting the microblog information from the microblog, using the microblog information as the target text, extracting the public number article or the WeChat friend circle message from the WeChat, and the article or the friend circle message. As the target text.
  • the corresponding target text in the social network is deleted, for example, the target text is a WeChat friend circle message, and when it is determined that the target text is false information, the network administrator may be notified to manually process The message, or the friend circle message is automatically deleted.
  • This embodiment implements a simple algorithm to quickly identify false information on the network, which can provide an important basis for network administrators to respond quickly. It is convenient for network administrators to timely process network false information and reduce or avoid the adverse effects caused by false information transmission.
  • the apparatus includes a word segmentation unit 1110, a first determination unit 1120, a statistics unit 1130, and a second determination unit 1140.
  • a word segmentation unit 1110 configured to perform word segmentation on the target text to obtain a word segmentation of the target text
  • the first determining unit 1120 is configured to use the adjacent two participles as a phrase according to the order of appearance of each participle in the target text, and determine the information type of each phrase according to the information in the false information base and the real information base.
  • the types of information include false information, real information, and unbiased information;
  • the statistic unit 1130 is configured to perform statistics on information types of all phrases in the target text to obtain statistical results
  • the second determining unit 1140 is configured to determine an information type of the target text according to the statistical result.
  • the word segmentation unit 1110 is configured to perform step S701 in the embodiment of the present invention
  • the first determining unit 1120 is configured to perform step S702 in the embodiment of the present invention
  • the statistical unit 1130 is configured to execute the present In step S703 in the embodiment of the present invention
  • the second determining unit 1140 is configured to perform step S704 in the embodiment of the present invention.
  • the word segmentation unit 1210 includes a first acquisition subunit 12101, a processing subunit 12102, and a word segmentation subunit 12103.
  • a first obtaining subunit 12101 configured to acquire target text
  • the processing sub-unit 12102 is configured to preprocess the target text to remove the stop words in the target text
  • the word segmentation sub-unit 12103 is configured to perform word segmentation processing on the target text processed by the processing sub-unit by using a dictionary segmentation method to obtain a word segmentation of the target text.
  • the first determining unit 1220 includes a calculating subunit 12201, an extracting subunit 12202, and a determining subunit 12203.
  • a calculating subunit 12201 configured to calculate an associated value of two participles in each phrase
  • the extracting sub-unit 12202 is configured to extract the associated value of the corresponding two participles in the fake information database, and as the first associated value, extract the associated value of the corresponding two participles in the real information database, as the second associated value. ;
  • the determining subunit 12203 is configured to determine an information type of the phrase according to the proximity of the associated value to the first associated value and the second associated value, respectively.
  • the determining subunit 1203 includes a calculating module 122031 and a determining module 122032.
  • a calculation module 122031 configured to calculate a difference between the associated value and the first associated value, to obtain a first difference, and calculate a difference between the associated value and the second associated value to obtain a second difference;
  • the determining module 122032 is configured to compare the absolute value of the first difference value with the absolute value of the second difference value, and if the absolute value of the first difference value is greater than the absolute value of the second difference value, determine the information of the phrase The type is real information. If the absolute value of the first difference is less than the absolute value of the second difference, the information type of the phrase is determined to be false information, if the absolute value of the first difference is equal to the absolute value of the second difference. , to determine the information type of the phrase is unbiased information.
  • the statistics unit 1230 includes:
  • a second obtaining subunit 12301 configured to acquire information types of all phrases in the target text
  • the statistics subunit 12302 is configured to count the frequency of occurrence of each information type, and obtain a statistical result
  • the second determining unit 1240 is specifically configured to compare the frequency of occurrence of the false information and the real information, and determine the information type with the higher frequency of occurrence as the information type of the target text, if the frequency of occurrence of the false information and the appearance of the real information If the frequency is the same, it is determined that the information type of the target text is unbiased information.
  • the apparatus further includes a pre-processing unit and a storage unit.
  • the pre-processing unit is configured to perform word segmentation processing on the false information samples in the false information base, obtain the word segmentation of the false information samples, and calculate the association between the adjacent two partial words according to the order of occurrence of each participle in the false information sample.
  • the value is also used to perform word segmentation processing on the real information sample in the real information database, and obtain the word segmentation of the real information sample, and calculate the correlation value of the adjacent two word segments according to the order of occurrence of each participle in the real information sample;
  • the storage unit includes a first storage module and a second storage module, where the first storage module is configured to store an associated value obtained by preprocessing the fake information sample and a corresponding word segment, and the second storage module is configured to store the pair Correlation values obtained by preprocessing the real information samples and corresponding participles.
  • the associated value of the two word segments in each phrase is calculated, and the two words corresponding to the false information base and the real information database are compared.
  • the correlation values are compared, and the information type of each phrase in the target text is determined according to the proximity degree of the correlation value, and then the information type of the target text is determined by counting the information types of all phrases in the target text, thereby realizing a relatively simple algorithm. Identifying false information on the network can provide an important basis for network managers to respond quickly, so that network managers can timely deal with false information on the network and reduce the adverse effects caused by the spread of false information.
  • the apparatus includes a word segmentation unit 1310, a first determination unit 1320, a statistics unit 1330, a second determination unit 1340, and a processing unit 1350.
  • a word segmentation unit 1310 configured to perform word segmentation on the target text to obtain a word segmentation of the target text
  • the first determining unit 1320 is configured to divide adjacent two points according to the order of appearance of each participle in the target text. As a phrase, the word determines the type of information of each phrase according to the information in the false information base and the real information base, and the information type includes false information, real information and unbiased information;
  • the statistic unit 1330 is configured to perform statistics on information types of all phrases in the target text to obtain statistical results.
  • a second determining unit 1340 configured to determine, according to the statistical result, an information type of the target text
  • the processing unit 1350 is configured to process the target text according to the information type of the target text.
  • the word segmentation unit 1310 is configured to perform step S1001 in the embodiment of the present invention
  • the first determining unit 1320 is configured to perform step S1002 in the embodiment of the present invention
  • the statistics unit 1330 is configured to execute the present
  • the second determining unit 1340 is configured to perform step S1004 in the embodiment of the present invention
  • the processing unit 1350 is configured to perform step S1005 in the embodiment of the present invention.
  • the processing unit 1350 is specifically configured to: when the second determining unit determines that the information type of the target text is false information, delete the target text in the social network.
  • Embodiments of the present invention also provide a storage medium.
  • the storage medium may be used to save the program code executed by the social network information identification method of the above embodiment.
  • the storage medium may be located in at least one of a plurality of network devices of the computer network.
  • the storage medium is arranged to store program code for performing the following steps:
  • the target text is processed by word segmentation to obtain the word segmentation of the target text.
  • the adjacent two participles are taken as a phrase, and the information type of each phrase is determined according to the information in the false information base and the real information base, and the information type includes False information, real information, and unbiased information.
  • the third step is to count the information types of all phrases in the target text to obtain statistical results.
  • the information type of the target text is determined according to the statistical result.
  • the storage medium is further configured to store program code for performing the following steps: acquiring target text; preprocessing the target text to remove stop words in the target text; and classifying the target text by dictionary segmentation , get the participle of the target text.
  • the storage medium is further configured to store program code for performing the following steps: calculating an associated value of two participles in each phrase; extracting an associated value of the corresponding two participles in the fake information base as the first associated value; Extracting an associated value of the corresponding two participles in the real information database as a second associated value; according to the association The value of the proximity of the first associated value and the second associated value respectively determines the type of information of the phrase.
  • the storage medium is further configured to store program code for performing the steps of: calculating a difference between the associated value and the first associated value to obtain a first difference; calculating a difference between the associated value and the second associated value, Obtaining a second difference; comparing the absolute value of the first difference value with the absolute value of the second difference value, and determining the information of the phrase if the absolute value of the first difference value is greater than the absolute value of the second difference value
  • the type is real information. If the absolute value of the first difference is less than the absolute value of the second difference, the information type of the phrase is determined to be false information, if the absolute value of the first difference is equal to the absolute value of the second difference. , to determine the information type of the phrase is unbiased information.
  • the storage medium is also arranged to store program code for performing the following steps: obtaining the type of information of all phrases in the target text; counting the frequency of occurrence of each type of information, and obtaining statistical results.
  • the storage medium is further configured to store program code for performing the following steps: comparing the frequency of occurrence of the false information and the real information, determining the type of information having a higher frequency as the information type of the target text, if the frequency of occurrence of the false information The frequency of occurrence of the real information is the same, and the information type of the target text is determined to be unbiased information.
  • the storage medium is further configured to store program code for performing the following steps: performing word segmentation processing on the false information samples in the fake information base, obtaining word segmentation of the false information samples, and calculating according to the order of occurrence of each participle in the false information sample.
  • the foregoing storage medium may include, but not limited to, a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, a magnetic disk, or an optical disk.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • mobile hard disk a magnetic disk
  • magnetic disk a magnetic disk
  • optical disk a variety of media that can store program code.
  • Embodiments of the present invention also provide a storage medium.
  • the foregoing storage medium may be used to save program code executed by a social network information processing method of the above embodiment.
  • the storage medium may be located in at least one of a plurality of network devices of the computer network.
  • the storage medium is arranged to store program code for performing the following steps:
  • the first step is to perform word segmentation on the target text to obtain the word segmentation of the target text
  • the adjacent two participles are taken as a phrase, and the information type of each phrase is determined according to the information in the false information base and the real information base, and the information type Includes false information, real information, and unbiased information;
  • the third step is to count the information types of all phrases in the target text to obtain statistical results
  • the fourth step is to determine the type of information of the target text according to the statistical result
  • the target text is processed according to the information type of the target text.
  • the storage medium is also arranged to store program code for performing the step of deleting the target text in the social network when the information type of the target text is false information.
  • An embodiment of the present invention further provides a computer terminal, which may be any computer terminal device in a computer terminal group.
  • the computer terminal may be replaced with a terminal device such as a mobile terminal.
  • the computer terminal may be located in at least one of a plurality of network devices of the computer network.
  • FIG. 14 is a block diagram showing the structure of a computer terminal according to an embodiment of the present invention.
  • the computer terminal A may include one or more (only one shown in the figure) processor 1401, memory 1403, and transmission device 1405.
  • the memory 1403 can be used to store software programs and modules, such as the social network information identification method and the program instructions/modules corresponding to the device in the embodiment of the present invention.
  • the processor 1401 runs the software program and the module stored in the memory 1403, thereby Performing various functional applications and data processing, that is, realizing the above social network information identification.
  • Memory 1403 can include high speed random access memory, and can also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • memory 1403 can further include memory remotely located relative to processor 1401, which can be connected to computer terminal A via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the above transmission device 1405 is for receiving or transmitting data via a network.
  • Specific examples of the above network may include a wired network and a wireless network.
  • transmission device 1405 includes a network adapter that can be connected to other network devices and routers via a network cable to communicate with the Internet or a local area network.
  • transmission device 1405 is a radio frequency module that is used to communicate wirelessly with the Internet.
  • the memory 1403 is configured to store preset action conditions and information of the preset rights user, and an application.
  • the processor 1401 can call the information and the application stored in the memory 1403 through the transmission device to perform follows the steps below:
  • the target text is processed by word segmentation to obtain the word segmentation of the target text.
  • the adjacent two participles are taken as a phrase, and the information type of each phrase is determined according to the information in the false information base and the real information base, and the information type includes False information, real information, and unbiased information.
  • the third step is to count the information types of all phrases in the target text to obtain statistical results.
  • the information type of the target text is determined according to the statistical result.
  • the embodiment provides an embodiment of a multimedia file recognition method based on behavior characteristics, and it should be noted that the steps shown in the flowchart of the drawing may be executed in a computer system such as a set of computer executable instructions, and Although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.
  • FIG. 15 is a hardware structural block diagram of a computer terminal that can be used to implement the behavior feature-based multimedia file identification method of the embodiment of the present invention.
  • computer terminal 1500 can include one or more (only one shown) processor 1502 (processor 1502 can include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA)
  • processor 1502 can include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA)
  • a memory 1504 for storing data
  • a transmission device 1506 for communication functions.
  • computer terminal 1500 may also include more or fewer components than shown in FIG. 15, or have a different configuration than that shown in FIG.
  • the memory 1504 can be used to store software programs and modules of the application software, such as the program instructions/modules corresponding to the behavior-based multimedia file recognition method in the embodiment of the present invention, and the processor 1502 runs the software programs and modules stored in the memory 1504. Thus, various functional applications and data processing are performed, that is, the above-described behavioral feature-based multimedia file recognition method is implemented.
  • Memory 1504 can include high speed random access memory, and can also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • memory 1504 can further include memory remotely located relative to processor 1502, which can be connected to computer terminal 1500 via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • Transmission device 1506 is for receiving or transmitting data via a network.
  • the above specific examples of the network can be packaged A wireless network provided by a communication provider of computer terminal 1500.
  • the transmission device 1506 includes a Network Interface Controller (NIC) that can be connected to other network devices through a base station to communicate with the Internet.
  • the transmission device 1506 can be a Radio Frequency (RF) module for communicating with the Internet wirelessly.
  • NIC Network Interface Controller
  • RF Radio Frequency
  • the embodiment of the present application provides a multimedia file recognition method based on behavior characteristics as shown in FIG. 16.
  • the method can be applied to a smart terminal device, and is executed by a processor in the smart terminal device, and the smart terminal device can be a smart phone, a tablet computer, or the like.
  • At least one application is installed in the smart terminal device.
  • the embodiment of the present invention does not limit the type of the application, and may be a system-based application or a software-based application.
  • FIG. 16 is a flowchart of a method for identifying a multimedia file based on behavior characteristics according to a first embodiment of the present invention. As shown in FIG. 16, one solution of the method includes the following steps:
  • step S1601 during the multimedia file playing process, the portrait feature value and the first willing feature value of the viewer user are obtained, and the portrait feature value is used to identify the user's preference for the specific content, and the first willing feature value is used to identify the user.
  • Step S1602 Calculate, according to the portrait feature value and the first willing feature value, a probability that the multimedia file includes specific content
  • Step S1603 determining whether the probability exceeds a preset value, and if yes, performing feature detection on the multimedia file;
  • Step S1604 Determine, according to the feature detection result, whether the multimedia file is a multimedia file of a specific content.
  • the calculating, according to the portrait feature value and the first willing feature value, the probability that the multimedia file includes the specific content includes:
  • the embodiment of the present invention analyzes the association between the user's online behavior and the viewing of specific content, and proposes to obtain the portrait feature value of the viewer user and the first intention to indicate that the user wishes to view the specific content within the preset time during the multimedia file playing process. And the feature value is further calculated according to the image feature value of each user and the first intention feature value, and the probability that the multimedia file includes the specific content is calculated, and the probability is compared with the preset value to determine whether the multimedia file needs to be further detected.
  • the user behavior characteristics are used to assist the screening to obtain the analysis to be analyzed.
  • the multimedia file performs specific content detection on the filtered multimedia file, and improves the recognition efficiency and accuracy of the multimedia file of the specific content.
  • the embodiment of the present invention can be used for detecting the bad content such as pornography and terror of multimedia files, which can greatly improve the detection efficiency and reliability, and facilitate the control of multimedia files.
  • FIG. 17 is a flowchart of a method for identifying a multimedia file based on behavior characteristics according to an embodiment of the present invention.
  • One solution of the method includes the following steps:
  • Step 1701 Analyze the user's behavior data, and determine the user's portrait feature value and the first willing feature value.
  • the user's online behavior can reflect the user's preferences.
  • the user's portrait can be determined.
  • the user's portrait is like a pornographic video. Accordingly, the user's portrait can also assist in determining the current or future of the user.
  • Online behaviors such as those who like porn videos, are more likely to be pornographic videos in the current or future.
  • User images can often reflect the user's various preferences. Therefore, it is not accurate enough to rely on the user's portrait to judge the current or future behavior of the user. Since the user's online behavior is often continuous, the search or browsing of a certain content often lasts for a while.
  • Time for example, the user pays attention to pornographic content in the first few minutes, and the chances of continuing to browse pornographic content in the current or future period are greater. Based on this, the user's behavior characteristics can be assisted by reference to the behavior characteristics of the user before the current time. User's current or future behavior.
  • the user behavior data is analyzed, and the user feature value can be used to identify the user's preference for the specific content, and the first willing feature value is used to identify the user's willingness to pay attention to the specific content for a period of time before the current time.
  • the analyzing the user's behavior data and determining the user's portrait feature value includes: acquiring the user's behavior data, where the behavior data includes browsing the first behavior data of the specific content-related text, browsing the second behavior data of the specific content-related image, Accessing the third behavior data of the specific content-related forum and the fourth behavior data of the chat in the chat group related to the specific content; respectively determining whether the first behavior data, the second behavior data, the third behavior data, and the fourth behavior data are Empty, if it is empty, it is marked as 0, if it is not empty, it is recorded as 1, corresponding to the first judgment result R1, the second judgment result R2, the third judgment result R3 and the fourth judgment result R4;
  • the weight W4 integrates and integrates the first determination result, the second determination result, the third determination result, and the fourth determination result to obtain a behavior characteristic value of the user.
  • Analysis of the user's behavior data, determine the user's first willingness feature value can be achieved in two ways: (1) through the computer butler running on the user terminal and other similar software to obtain the user's screen display content to judge; (2) can Capture user traffic on the network, such as packet capture on the router, to analyze the user's ongoing operations.
  • the specific steps include: obtaining behavior data of the user in the most recent period of time, the behavior data including a first time of browsing the specific content-related text, a second time of browsing the specific content-related image, a third time of visiting the specific content-related forum, and a fourth time of chatting in a specific content-related chat group; assigning the first weight W1 to the first time, the second weight W2 to the second time, and the third time to the third time
  • the third weight W3, the fourth time is given to the fourth weight W4, and the first time, the second time, the third time, and the fourth time are weighted and averaged to obtain a first intention feature value of the user.
  • the feature feature value indicates the user's preference for the pornographic content.
  • the first willing feature value indicates the user's willingness to watch the pornographic video for a period of time before the moment, and analyzes the user's online behavior, mainly Including whether users have viewed porn-related texts, images, and porn-related forums in recent times, and whether they are speaking in pornographic chat groups. Viewing pornographic novels, porn-related segments or Weibo can be viewed as browsing.
  • Porn-related text browsing images marked as pornographic, images on pornographic websites, and various beauty images on normal websites can be viewed as viewing pornographic related images; then calculating the user's portrait characteristics based on the weight of these behavioral features Values, such as browsing porn-related texts, have a weight of 0.4, porn-related images have a weight of 0.3, access to porn forums has a weight of 0.6, and pornographic chats have a weight of 0.5, if the user is in the most recent time.
  • Step 1702 Obtain the portrait feature value and the first willing feature value of the viewer user during the multimedia file playing process.
  • the multimedia file includes text, pictures, video and audio files, and the solution can be used to identify whether the files contain specific content, and the specific content can be horror and/or pornographic content, for example, using the solution of the embodiment of the present invention. Identifies whether the text is pornographic, whether the image is pornographic, and whether the video is pornographic.
  • the video may be an on-demand video or a live video
  • the live video includes a video played between the live broadcasts.
  • the portrait feature value and the first willing feature value of the viewer user are obtained, and the portrait feature value is used to identify the user's preference for the specific content, and the first willing feature value is used to identify the user in the preset.
  • the preset time period generally refers to a time period in which the current time moves forward, such as 40 minutes before the current time.
  • Step 1703 Calculate, according to the portrait feature value and the first willing feature value, a probability that the multimedia file contains specific content.
  • the calculating, according to the portrait feature value and the first willing feature value, the probability that the multimedia file includes the specific content comprises: determining, according to the portrait feature value and the first willing feature value, a second willing feature value of each user; Calculating the probability that the multimedia file contains specific content according to the second willing feature value of all users.
  • the comprehensive portrait feature value and the first willing feature value can improve the accuracy of determining whether the multimedia file contains specific content.
  • the second willing feature value may be obtained by summing the portrait feature value and the first willing feature value; by respectively performing the second willing feature value of each user with a preset threshold The ratio is calculated, and the ratio of the number of users whose second willing feature value exceeds the threshold to the total number of users is calculated, and the probability that the multimedia file contains specific content is obtained.
  • the portrait feature value and the first willing feature value may be weighted and averaged according to weights set in advance for the feature feature value and the first intention feature value to obtain a second will feature value; Comparing the second willing feature value of each user with a preset threshold, respectively calculating a ratio of the number of users whose second willing feature value exceeds the threshold to the total number of users, and obtaining a probability that the multimedia file contains specific content.
  • Step 1704 Determine whether the probability exceeds a preset value. If yes, perform step 1705 to perform feature detection on the multimedia file. Otherwise, perform step 1708 to play the multimedia file normally.
  • the preset value can be manually set, and the preset value can be adjusted according to whether the multimedia file is a judgment result of the multimedia file of the specific content, so as to improve the accuracy of the final judgment result. If the probability does not exceed the preset value, it indicates that the multimedia file being played is less likely to contain specific content. To improve the detection efficiency and accuracy, further detection of such multimedia files may be abandoned without any processing. If the probability exceeds a preset value, it indicates that the video is more likely to contain specific content, and further detection is needed. Media file content.
  • a character feature library may be pre-established for storing feature characters extracted from a specific content file (such as pornography, erotic images, etc.), and then using the feature characters in the character feature library to match the text content, when the matching result exceeds the preset After the matching threshold, the text file contains more feature characters, which can be determined as the text of the specific content.
  • a specific content file such as pornography, erotic images, etc.
  • further detection includes character detection, sensitive part detection, skin color pixel detection, blood color pixel detection, and the like.
  • Character detection uses character character library for feature character matching to detect.
  • Sensitive part detection uses sensitive part feature library for sensitive part matching to detect.
  • Blood color pixel detection and skin color pixel detection can first establish blood color model and skin color model, and then according to blood color model. And the skin color model performs blood color pixel detection and skin color pixel detection on the picture.
  • the construction method of the blood color model and the skin color model is prior art, and details are not described herein again.
  • an audio detection model can be trained, and the audio file to be detected is input into the audio detection model to obtain whether the detection result of the specific content is included.
  • the construction method of the audio detection model is prior art, and details are not described herein again.
  • further detection includes audio detection and image detection; wherein the audio detection can be detected using an audio detection model; the image detection includes extracting an image of the video and performing feature detection on the image.
  • the extracting the image of the video and performing feature detection on the image includes: extracting a preset number of images for a time interval of the video, for example, extracting an image by taking a 10s screenshot of the video interrupt; and then The image is subjected to feature detection to determine whether the image contains a specific feature, and the feature detection includes motion detection, character detection, sensitive part detection, skin color pixel detection, blood color pixel detection, and the like.
  • Step 1706 Determine, according to the feature detection result, whether the multimedia file is a multimedia file of a specific content.
  • the multimedia file is a video file.
  • a preset threshold P P
  • the video is determined to be a specific content video, otherwise the video is determined to be a normal video.
  • the ratio of the determined number of images including the specific feature to the total number of images extracted for the video detection may be determined by counting the number of pictures including the specific feature, and when the determined ratio is greater than the threshold Q, determining The video is a specific content video, and step 1707 is performed to process the multimedia file. Otherwise, step 1708 is executed to determine that the video is a normal video and is normally broadcast. Put the multimedia file.
  • further processing may be performed, such as classifying, grading, or exiting the video.
  • the above method can be used to identify pornographic video, wherein feature detection of the video image includes sensitive portion detection and skin color pixel detection.
  • An achievable method for detecting sensitive parts includes:
  • Step 1 Searching for feature data corresponding to the image of the human sensitive part matching the image to be identified in the pre-stored index of the human sensitive part.
  • the index of the sensitive part of the human body can organize and store the characteristic data of the image of the sensitive part of the human body in an orderly manner, which is convenient for searching. Images of sensitive parts of the human body can be obtained by marking the sensitive parts of the human body in erotic images and generating pictures.
  • the feature data may be a vector feature, which may be any feature in the existing image recognition method, such as describing texture, HOG (Histogram of Oriented Gradient) or LBP (Local Binary Patterns). )and many more.
  • the feature data of the image to be identified may be extracted, and the distance between the feature data of the image to be identified and the feature data of the image of the sensitive part of the human body may be calculated, thereby determining whether the image to be recognized matches the image of the sensitive part of the human body according to the distance.
  • the Euclidean distance can be used to represent the distance. If the Euclidean distance of the feature data of the image to be identified and the feature data of one of the human sensitive part images are the shortest, and the Euclidean distance is less than the Euclidean distance threshold, the image to be recognized and the human body are The sensitive part image is matched. It can be understood that other similarity metrics can also be used to determine whether a match, such as a correlation coefficient, etc., is not enumerated here.
  • Step 2 Calculate a confidence level corresponding to the image to be identified according to the matched feature data.
  • Confidence is a function used to measure the degree of match between a judgment and an actual observation. The higher the confidence, the higher the degree of matching between the image to be recognized and the image of the sensitive part of the human body.
  • Step 3 determining whether the image to be identified is an erotic image according to a confidence level corresponding to the image to be identified.
  • the confidence level is higher than the first confidence threshold, the degree of matching between the image to be identified and the matching image of the sensitive part of the human body is high, and the image to be recognized is an erotic image.
  • An achievable method for skin color pixel detection includes:
  • Step one detecting human body area pixels and human head area pixels in the video image.
  • Adaboost an iterative algorithm
  • Human detection algorithm based on edge histogram features is used to determine whether there is human presence in the image. First, the integral map of the video image is calculated. The edge histogram feature is extracted, and the cascading method is used to search the human body region in the image according to the set classifier feature library.
  • the training method of the classifier feature library comprises: calculating an integral map of the sample image, extracting a class-like feature of the sample image; screening the effective feature according to the Adaboost algorithm to form a weak classifier; and combining the plurality of weak classifiers to form a strong classifier; A plurality of strong classifiers are cascaded to form a classifier feature library for human detection.
  • the human body detecting unit detects the presence of the human body, it detects the video image and determines whether there is a human head.
  • the head detection uses Adaboost head detection algorithm to determine whether there is a human head in the image through the Adaboost head detection algorithm based on the rectangular-like feature.
  • First calculate the integral map of the image, extract the edge histogram feature, and run according to the trained classifier feature library.
  • the cascade cascade method searches the image for the head region.
  • the training method of the classifier feature library comprises: calculating an integral map of the sample image, extracting a class-like feature of the sample image; screening the effective feature according to the Adaboost algorithm to form a weak classifier; and combining the plurality of weak classifiers to form a strong classifier; A plurality of strong classifiers are cascaded to form a classifier feature library for human head detection.
  • step 2 the ratio of the skin color pixel to the image pixel, the ratio of the skin color pixel and the human body area pixel, and the ratio of the human head area pixel to the skin color pixel are counted in each video image.
  • Step 3 determining a video according to a preset first threshold threshold of the skin color pixel and the image pixel, a second ratio threshold of the skin color pixel and the human body region pixel, a third ratio threshold of the human head region pixel and the skin color pixel, and a preset judgment strategy Whether the image is erotic.
  • the determining strategy may be that the ratio of the skin color pixel to the image pixel is greater than the first ratio threshold, the ratio of the skin color pixel and the human body area pixel is greater than the second ratio threshold, and the ratio of the head region pixel to the skin color pixel is greater than at least two of the third ratio threshold. condition.
  • the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic).
  • the disc, the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present invention.
  • FIG. 18 is a flowchart of a multimedia file processing method according to an embodiment of the present invention.
  • One solution of the method includes the following steps:
  • S1801 Obtain a portrait feature value and a first willing feature value of the audience user during the playing of the multimedia file, where the image feature value is used to identify a user's preference for the specific content, and the first willing feature value is used to identify the user.
  • S1802 Calculate, according to the portrait feature value and the first intention feature value, a probability that the multimedia file includes specific content
  • S1803 Determine whether the probability exceeds a preset value, and if yes, perform feature detection on the multimedia file;
  • S1805 Process the multimedia file according to the determination result.
  • the multimedia file is an on-demand video or a live video, and the specific content is pornographic content; and the processing the multimedia file according to the determination result comprises: if the multimedia file is an on-demand porn video, exiting the playback of the on-demand video If the multimedia file is live porn video, close the video live room where the video is played.
  • the multimedia file is initially screened by the user behavior feature, and the selected content is detected by the selected multimedia file, thereby improving the recognition efficiency and accuracy of the specific content.
  • the use of the embodiment of the present invention for pornographic video detection can greatly improve detection efficiency and reliability, and facilitate management and control of multimedia video.
  • the embodiment provides a multimedia file identification device based on behavior characteristics.
  • the apparatus includes an acquisition unit 1920, a calculation unit 1930, a detection unit 1940, and a determination unit 1950.
  • the obtaining unit 1920 is configured to acquire a portrait feature value and a first willing feature value of the audience user during the multimedia file playing process, where the image feature value is used to identify the user's preference for the specific content, and the first willing feature value is used by To identify the user's willingness to view specific content within a preset time period;
  • the calculating unit 1930 is configured to calculate, according to the portrait feature value and the first willing feature value, a probability that the multimedia file includes specific content;
  • the detecting unit 1940 is configured to determine whether the probability exceeds a preset value, and if yes, perform feature detection on the multimedia file;
  • the determining unit 1950 is configured to determine, according to the feature detection result, whether the multimedia file is a multimedia file of a specific content.
  • the obtaining unit 1920 is configured to perform step S1601 in the embodiment of the present invention
  • the calculating unit 1930 is configured to perform step S1602 in the embodiment of the present invention
  • the detecting unit 1940 is configured to execute In step S1603 in the embodiment of the present invention
  • the determining unit 1950 is configured to perform step S1604 in the embodiment of the present invention.
  • the calculating unit 2030 includes:
  • a first calculating subunit 20301 configured to determine a second willing feature value of each user according to the portrait feature value and the first willing feature value
  • the second calculating subunit 20302 is configured to calculate a probability that the multimedia file includes specific content according to the second willing feature value of all users.
  • the first calculating subunit 20301 includes:
  • a first calculation module 201011 configured to sum the image feature value and the first will feature value to obtain the second will feature value
  • the second calculating module 203012 is configured to perform weighted averaging on the image feature value and the first willing feature value according to the weight set in advance for the image feature value and the first intention feature value to obtain a second will feature value.
  • the second calculating subunit 20302 includes:
  • the comparison module 203021 is configured to compare the second willing feature values of the respective users with the preset thresholds respectively;
  • the probability calculation module 203022 is configured to calculate the number of users and the number of users whose second willing feature value exceeds the threshold The ratio of the total number of households, the probability that the multimedia file contains specific content.
  • the apparatus further includes a pre-processing unit 2010 for analyzing behavior data of the user, and determining a portrait feature value and a first willing feature value of the user.
  • the pre-processing unit 2010 includes a first pre-processing sub-unit 20101 and a second pre-processing sub-unit 20102.
  • the first processing sub-unit 20101 is configured to: acquire behavior data of the user, where the behavior data includes first behavior data for browsing specific content-related text, second behavior data for browsing a specific content-related image, and accessing a specific content-related forum.
  • the third behavior data and the fourth behavior data of the chat group in the chat group related to the specific content respectively determining whether the first behavior data, the second behavior data, the third behavior data, and the fourth behavior data are empty, correspondingly obtaining the first a determination result, a second determination result, a third determination result, and a fourth determination result; a first weight according to the preset first determination result, a second weight of the second determination result, and the third determination a third weight of the result and a fourth weight of the fourth determination result, and the first judgment result, the second judgment result, the third judgment result, and the fourth judgment result are distributed and integrated to obtain behavior characteristics of the user. value;
  • the second processing sub-unit 20102 is configured to: obtain behavior data of the user in a most recent period of time, where the behavior data includes a first time of browsing the specific content-related text, a second time of browsing the specific content-related image, and accessing the specific content.
  • the third time of the forum and the fourth time of chatting in the chat group related to the specific content assigning the first weight to the first time, the second weight to the second time, and the third time
  • the fourth weight is given to the fourth weight, and the fourth weight is given to the fourth time, and the first time, the second time, the third time, and the fourth time are weighted and averaged to obtain a user's willing feature value.
  • the multimedia file is a video
  • the detecting unit 2040 includes:
  • An extracting subunit 20401 configured to extract a preset number of images for a time interval of the video or the like;
  • the detecting subunit 20402 is configured to perform feature detection on each image to determine whether the image includes a specific feature, and the feature detection includes sensitive part detection and skin color pixel detection.
  • the determining unit 2050 includes:
  • the first determining subunit 20501 is configured to determine that the video is a specific content video when it is determined that the number of images including the specific feature is greater than a preset threshold P, and otherwise determine that the video is a normal video; or
  • the second determining subunit 20502 is configured to determine a ratio of the determined number of images including the specific feature to the total number of images extracted for the video detection, and determine that the video is a specific content video when the determined ratio is greater than the threshold Q. Otherwise, the video is judged to be a normal video.
  • the specific content is pornographic content
  • the video is an on-demand video or a live video.
  • the apparatus includes an acquisition unit 2120, a calculation unit 2130, a detection unit 2140, a determination unit 2150, and a processing unit 2160.
  • the obtaining unit 2120 is configured to acquire a portrait feature value and a first willing feature value of the audience user during the multimedia file playing process, where the image feature value is used to identify the user's preference for the specific content, and the first willing feature value is used by To identify the user's willingness to view specific content within a preset time period;
  • the calculating unit 2130 is configured to calculate, according to the portrait feature value and the first willing feature value, a probability that the multimedia file includes specific content;
  • the detecting unit 2140 is configured to determine whether the probability exceeds a preset value, and if yes, perform feature detection on the multimedia file;
  • a determining unit 2150 configured to determine, according to the feature detection result, whether the multimedia file is a multimedia file of a specific content
  • the processing unit 2160 is configured to process the multimedia file according to the determination result.
  • the obtaining unit 2120 is configured to perform step S1801 in the embodiment of the present invention
  • the calculating unit 2130 is configured to perform step S1802 in the embodiment of the present invention
  • the detecting unit 2140 is configured to execute the embodiment of the present invention.
  • the determining unit 2150 is configured to perform step S1804 in the embodiment of the present invention
  • the processing unit 2160 is configured to perform step S1805 in the embodiment of the present invention.
  • the multimedia file is an on-demand video or a live video, and the specific content is pornographic content.
  • the processing unit 2160 is specifically configured to: when determining that the multimedia file is an on-demand porn video, exit the playback of the on-demand video; when determining that the multimedia file is a live porn video, close the video live broadcast room of the video.
  • Embodiments of the present invention also provide a storage medium.
  • the storage medium may be used to save the program code executed by the behavior feature-based multimedia file identification method of the above embodiment.
  • the storage medium may be located in at least one of a plurality of network devices of the computer network.
  • the storage medium is arranged to store program code for performing the following steps:
  • the portrait feature value and the first willing feature value of the viewer user are obtained, and the portrait feature value is used to identify the user's preference for the specific content, and the first willing feature value is used to identify The user's willingness to view specific content within a preset time period;
  • the second step calculating, according to the portrait feature value and the first willing feature value, a probability that the multimedia file contains specific content
  • the third step it is determined whether the probability exceeds a preset value, and if yes, performing feature detection on the multimedia file;
  • the fourth step it is determined whether the multimedia file is a multimedia file of a specific content according to the feature detection result.
  • the storage medium is further configured to store program code for performing: determining a second willing feature value for each user based on the portrait feature value and the first willing feature value; calculating the second intention feature value according to all users The probability that a multimedia file contains specific content.
  • the storage medium is further configured to store program code for performing the step of: summing the portrait feature value and the first willing feature value to obtain the second willing feature value, or according to a pre-image feature value and The weight of the first willing feature value is set, and the portrait feature value and the first willing feature value are weighted and averaged to obtain a second willing feature value.
  • the storage medium is further configured to store program code for performing the following steps: respectively comparing the second willing feature value of each user with a preset threshold; calculating the number of users and the user that the second willing feature value exceeds the threshold The ratio of the total number gives the probability that the multimedia file contains specific content.
  • the storage medium is also arranged to store program code for performing the following steps: analyzing the user's behavior data, determining the user's portrait feature value and the first willing feature value.
  • the storage medium is further configured to store program code for performing the following steps: when the multimedia file is a video, extracting a preset number of images for a time interval such as a video; performing feature detection on each image to determine whether the image includes In a particular feature, the feature detection includes sensitive site detection and skin tone pixel detection.
  • the storage medium is further configured to store program code for performing the following steps: determining that the video is a specific content video when it is determined that the number of images including the specific feature is greater than a preset threshold P, otherwise determining that the video is normal Video; or, determining a ratio of the determined number of images including the specific feature to the total number of images extracted for the video detection, determining that the video is a specific content video when the determined ratio is greater than the threshold Q, otherwise determining the video For normal video.
  • the foregoing storage medium may include, but not limited to, a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, a magnetic disk, or an optical disk.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • mobile hard disk a magnetic disk
  • magnetic disk a magnetic disk
  • optical disk a variety of media that can store program code.
  • Embodiments of the present invention also provide a storage medium.
  • the foregoing storage medium can be used.
  • the program code executed by a video processing method of the above embodiment is saved.
  • the storage medium may be located in at least one of a plurality of network devices of the computer network.
  • the storage medium is arranged to store program code for performing the following steps:
  • the portrait feature value and the first willing feature value of the viewer user are obtained, and the portrait feature value is used to identify the user's preference for the specific content, and the first willing feature value is used to identify The user's willingness to view specific content within a preset time period;
  • the second step calculating, according to the portrait feature value and the first willing feature value, a probability that the multimedia file contains specific content
  • the third step it is determined whether the probability exceeds a preset value, and if yes, performing feature detection on the multimedia file;
  • the fourth step is to determine, according to the feature detection result, whether the multimedia file is a multimedia file of a specific content
  • the multimedia file is processed according to the determination result.
  • the storage medium is further configured to store program code for performing the following steps: when the multimedia file is an on-demand porn video, the playback of the on-demand video is exited; when the multimedia file is a live porn video, the video live broadcast of the video is closed. .
  • An embodiment of the present invention further provides a computer terminal, which may be any computer terminal device in a computer terminal group.
  • the computer terminal may be replaced with a terminal device such as a mobile terminal.
  • the computer terminal may be located in at least one of a plurality of network devices of the computer network.
  • FIG. 22 is a block diagram showing the structure of a computer terminal according to an embodiment of the present invention.
  • the computer terminal A may include one or more (only one shown in the figure) processor 2201, memory 2203, and transmission device 2205.
  • the memory 2203 can be used to store a software program and a module, such as a behavior-based multimedia file identification method and a program instruction/module corresponding to the device in the embodiment of the present invention, and the processor 2201 runs the software program stored in the memory 2203 and The module, thereby performing various functional applications and data processing, implements the above-described multimedia file recognition method.
  • the memory 2203 can include a high speed random access memory, and can also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state Memory.
  • memory 2203 can further include memory remotely located relative to processor 2201, which can be connected to computer terminal A via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the above described transmission device 2205 is for receiving or transmitting data via a network.
  • Specific examples of the above network may include a wired network and a wireless network.
  • transmission device 2205 includes a network adapter that can be connected to other network devices and routers via a network cable to communicate with the Internet or a local area network.
  • transmission device 2205 is a radio frequency module that is used to communicate wirelessly with the Internet.
  • the memory 2203 is configured to store preset action conditions and information of the preset rights user, and an application.
  • the processor 2201 can call the information and the application stored by the memory 2203 through the transmission device to perform the following steps:
  • the portrait feature value and the first willing feature value of the viewer user are obtained, and the portrait feature value is used to identify the user's preference for the specific content, and the first willing feature value is used to identify The user's willingness to view specific content within a preset time period;
  • the second step calculating, according to the portrait feature value and the first willing feature value, a probability that the multimedia file contains specific content
  • the third step it is determined whether the probability exceeds a preset value, and if yes, performing feature detection on the multimedia file;
  • the fourth step it is determined whether the multimedia file is a multimedia file of a specific content according to the feature detection result.
  • An embodiment of the present application provides a network information identification method, where the method includes the following steps:
  • Step 1 Obtain the network information to be identified.
  • the network information to be identified may include target text.
  • Step 2 Perform word segmentation on the network information to obtain word segmentation of the network information.
  • the target text can be processed by word segmentation to obtain the word segmentation of the target text.
  • Step 3 Determine, according to the pre-stored trusted network information and the non-trusted network information, the type of information to which each participle of the network information belongs.
  • the trusted network information may be information in the real information base
  • the associated value of the two participles in each phrase may be calculated, and the associated value of the corresponding two participles in the fake information base may be extracted as the first associated value; and the corresponding two of the real information bases are extracted.
  • the difference between the associated value and the first associated value is calculated to obtain a first difference; and the difference between the associated value and the second associated value is calculated to obtain a second difference a value, comparing an absolute value of the first difference value with an absolute value of the second difference value, and if the absolute value of the first difference value is greater than an absolute value of the second difference value, determining that the information type of the phrase is true information If the absolute value of the first difference is less than the absolute value of the second difference, determining that the information type of the phrase is false information, and if the absolute value of the first difference is equal to the absolute value of the second difference, determining the The message type of the phrase is unbiased information.
  • Step 4 Perform statistics according to the type of information to which each participle belongs, and determine the type of information to which the network information belongs.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may Stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, device, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例公开了一种网络信息识别方法及装置,其中,网络信息识别方法包括:获取待识别网络信息;计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度;根据所述第一相似度及所述第二相似度确定所述待识别网络信息是否可信,本申请实施例能够有效识别网络中的特定信息。

Description

网络信息识别方法和装置
本申请要求于2016年10月13日提交中国专利局、申请号为201610895856.9、发明名称为“一种网络信息识别方法及装置”的中国专利申请的优先权,以及2016年10月27日提交中国专利局、申请号为201610956467.2、发明名称为“社交网络信息识别方法、处理方法及装置”的中国专利申请的优先权,以及2016年10月31日提交中国专利局、申请号为201610929276.7、发明名称为“基于行为特征的多媒体文件识别方法、处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及网络应用领域,特别是涉及一种网络信息识别方法和装置。
背景技术
随着网络技术的发展,网络上可传播的信息越来越多,有些网络信息是真实的且不包含不良内容信息,而有些网络信息则是虚假信息或者包含不良内容的信息,例如色情或恐怖信息。网络的发展,助长虚假或者包含不良内容的信息的影响力,普通用户由于知识和信息量有限,无法识别此类信息。
发明内容
有鉴于此,本发明实施例提供了一种网络信息识别方法及装置,能够有效识别网络中的特定信息。
本发明实施例提供的网络信息识别方法,包括:
获取待识别网络信息;
计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度;
根据所述第一相似度及所述第二相似度确定所述待识别网络信息是否可信。
本发明实施例提供的网络信息识别装置,包括:
获取单元,用于获取待识别网络信息;
计算单元,用于计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度;
确定单元,用于根据所述第一相似度及所述第二相似度确定所述待识别网络信息是否可信。
一种网络信息识别方法,包括:
对目标文本进行分词处理,得到目标文本的分词;
按照各分词在目标文本中的出现顺序,将相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息类型包括虚假信息、真实信息和无偏向信息;
对目标文本中所有词组的信息类型进行统计,得到统计结果;
根据统计结果确定所述目标文本的信息类型。
一种网络信息识别装置,包括:
分词单元,用于对目标文本进行分词处理,得到目标文本的分词;
第一确定单元,用于按照各分词在目标文本中的出现顺序,将相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息类型包括虚假信息、真实信息和无偏向信息;
统计单元,用于对目标文本中所有词组的信息类型进行统计,得到统计结果;
第二确定单元,用于根据统计结果确定所述目标文本的信息类型。
一种网络信息识别方法,包括:
在多媒体文件播放过程中,获取观众用户的画像特征值和第一意愿特征值,所述画像特征值用于标识用户对特定内容的喜好,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿;
根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率;
判断所述概率是否超过预设值,若是,则对所述多媒体文件进行特征检测;
根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件。
一种网络信息识别装置,包括:
获取单元,用于在多媒体文件播放过程中,获取观众用户的画像特征值和第一 意愿特征值,所述画像特征值用于标识用户对特定内容的喜好,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿;
计算单元,用于根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率;
检测单元,用于判断所述概率是否超过预设值,若是,则对所述多媒体文件进行特征检测;
确定单元,用于根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件。
本发明实施例中,后台可以自动获取待识别网络信息,根据待识别网络信息与可信网络信息的相似度,以及待识别网络信息与非可信网络信息的相似度,确定待识别网络信息是否可信,即利用相似度确定待识别网络信息是否可信,因而能够自动、有效地识别特定网络信息,例如谣言。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例所提供的网络信息识别方法的一个场景示意图;
图2是本发明实施例所提供的网络信息识别方法的一个流程示意图;
图3是本发明实施例所提供的网络信息识别方法的另一流程示意图;
图4是本发明实施例所提供的网络信息识别装置的一个结构示意图;
图5是本发明实施例所提供的网络信息识别装置的另一结构示意图;
图6是可用于实施本发明实施例的社交网络信息识别方法的计算机终端的硬件结构框图;
图7是本发明实施例揭示的社交网络信息识别方法的流程图;
图8是本发明实施例揭示的社交网络信息识别方法的流程图;
图9是本发明实施例揭示的确定词组所属信息类型的方法的流程图;
图10是本发明实施例揭示的社交网络信息处理方法的流程图;
图11是本发明实施例揭示的社交网络信息识别装置的示意图;
图12是本发明实施例揭示的社交网络信息识别装置的示意图;
图13是本发明实施例揭示的社交网络信息处理装置的示意图;
图14是根据本发明实施例的计算机终端的结构框图;
图15是可用于实施本发明实施例的基于行为特征的多媒体文件识别方法的计算机终端的硬件结构框图;
图16是本发明实施例揭示的基于行为特征的多媒体文件识别方法的流程图;
图17是本发明实施例揭示的基于行为特征的多媒体文件识别方法的流程图;
图18是本发明实施例揭示的多媒体文件处理方法的流程图;
图19是本发明实施例揭示的基于行为特征的多媒体文件识别装置的示意图;
图20是本发明实施例揭示的基于行为特征的多媒体文件识别装置的示意图;
图21是本发明实施例揭示的多媒体文件处理装置的示意图;
图22是根据本发明实施例的计算机终端的结构框图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
由于现有技术缺乏信息自动识别机制,用户只能凭借自身有限的知识自行识别网络信息是否可信,很多情况下,无法有效地识别出谣言,因而,本发明实施例提供了一种网络信息识别方法及装置,能够自动、有效地识别出谣言。本发明实施例提供的网络信息识别方法可实现在网络信息识别装置中,网络信息识别装置可以是后台服务器。本发明实施例网络信息识别方法一个具体实施场景可如图1所示,服务器获取待识别网络信息,待识别网络信息可以是用户在社交网络(例如微博、QQ空间)上发布的信息或言论,然后计算待识别网络信息与可信网络信息(可信数据库中的网络信息)的相似度,记为第一相似度,以及计算所述待识别网络信息与非可信网络信息(非可信数据库中的网络信息)的相似度,记为第二相似度,根据所述第一相似度及所述第二相似度确定所述待识别网络信息是否可信,然后输出识别 结果,当确定待识别网络信息不可信时,服务器可以屏蔽掉待识别网络信息,以避免谣言继续传播,或者将待识别网络信息标记为可疑,以提示用户,即本发明实施例利用相似度来确定待识别网络信息是否可信,因而能够自动、有效地识别谣言。
以下分别进行详细说明,需说明的是,以下实施例的序号不作为对实施例顺序的限定。
实施例一
如图2所示,本实施例的方法包括以下步骤:
步骤201、获取待识别网络信息;
具体实现中,待识别网络信息可以是用户在社交网络(例如微博、QQ空间)上发布的信息或言论。当用户使用终端(例如手机、平板电脑、个人计算机等)在社交网络上发布信息或言论时,后台服务器可以获取用户发布的信息或言论,即获取待识别网络信息。
步骤202、计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度;
具体实现中,可以预先收集可信网络信息及非可信网络信息,根据收集的可信网络信息建立可信数据库,以及根据收集的非可信网络信息建立非可信数据库。
可信网络信息可以从权威或可信的网站中提取,例如从百度百科、维基百科提取,因此,可信数据库中包含的网络信息可以认为是可信的。非可信网络信息目前可采用人工收集,非可信数据库中包含的网络信息可以认为是不可信的。
具体地,可以采用余弦定理算法计算待识别网络信息与可信数据库中的各个可信网络信息的相似度,此处可以得到多个相似度值。所计算得到的相似度值越大,说明两条信息的相似度越高,此步骤中,可以取计算所得的相似度的最大值记为第一相似度,即第一相似度为可信数据库中与待识别网络信息相似度最高的可信网络信息与待识别网络信息的相似度。
同样地,可以采用余弦定理算法计算待识别网络信息与非可信数据库中的各个非可信网络信息的相似度,此处可以得到多个相似度值。所计算得到的相似度值越大,说明两条信息的相似度越高,此步骤中,可以取计算所得的相似度的最大值记为第二相似度,即第二相似度为非可信数据库中与待识别网络信息相似度最高的非可信网络信息与待识别网络信息的相似度。
上面描述的方法,由于需要计算大量信息之间的相似度,而经实践证明,余弦定理算法的计算速度优于其他算法,因此,本实施例中,可以利用余弦定理算法计算两条信息的相似度,当然,除余弦定理算法之外,还可以采用其他算法计算两条信息的相似度,例如距离编辑算法等,此处对采用的具体算法不做限定。
另外,上面描述的方法,第一相似度与第二相似度是通过逐条计算待识别网络信息与可信数据库及非可信数据库中的各条网络信息的相似度得到的,实际中,还可以采用其他方式得到第一相似度及第二相似度。例如,采用关键字提取法,提取可信数据库中具有与待识别网络信息具有相同关键字的可信网络信息,计算该可信网络信息与待识别网络信息的相似度,记为第一相似度;提取非可信数据库中具有与待识别网络信息具有相同关键字的非可信网络信息,计算该非可信网络信息与待识别网络信息的相似度,记为第二相似度。
步骤203、根据所述第一相似度及所述第二相似度确定所述待识别网络信息是否可信。
具体地,可以比较所述第一相似度与所述第二相似度的大小;当所述第一相似度大于所述第二相似度时,说明待识别网络信息与可信网络信息的相似度高于待识别网络信息与非可信网络信息的相似度,因此可以确定所述待识别网络信息可信;当所述第二相似度大于所述第一相似度时,说明待识别网络信息与非可信网络信息的相似度高于待识别网络信息与可信网络信息的相似度,因此可以确定所述待识别网络信息不可信。
以上识别方法同时使用到了可信数据库及非可信数据库,实际中,还可以单独采用其中一个数据库来识别网络信息是否可信。例如,仅采用可信数据库,通过余弦定理算法计算得到第一相似度,判断第一相似度是否大于第一预设阈值(例如0.8),若大于,则认为待识别网络信息可信,若不大于,则认为待识别网络信息不可信;或者,仅采用非可信数据库,通过余弦定理算法计算得到第二相似度,判断第二相似度是否大于第二预设阈值(例如0.9),若大于,则认为待识别网络信息不可信,若不大于,则认为待识别网络信息可信。
当确定待识别网络信息可信时,可以允许待识别网络信息显示在社交网络上;当确定待识别网络信息不可信时,可以采用一些处理措施,以提示其他用户或避免谣言传播,例如可以将所述待识别网络信息标记为可疑,或者屏蔽所述待识别网络 信息。
本实施例中,后台服务器可以自动获取待识别网络信息,根据待识别网络信息与可信网络信息的相似度,以及待识别网络信息与非可信网络信息的相似度,确定待识别网络信息是否可信,即利用相似度确定待识别网络信息是否可信,因而能够自动、有效地识别谣言。
实施例二
实施例一所描述的方法,本实施例将举例作进一步详细说明,如图3所示,本实施例的方法包括:
步骤301、采集可信网络信息及非可信网络信息;
具体地,可信网络信息可以从权威或可信的网站中提取,例如从百度百科、维基百科提取,非可信网络信息目前可采用人工收集。
步骤302、根据采集的可信网络信息建立可信数据库,以及根据采集的非可信网络信息建立非可信数据库;
可信数据库中包含多个可信网络信息,可信数据库中包含的网络信息可以认为是可信的;非可信数据库中包含多个非可信网络信息,非可信数据库中包含的网络信息可以认为是非可信的。
步骤303、获取待识别网络信息;
具体实现中,待识别网络信息可以是用户在社交网络(例如微博、QQ空间)上发布的信息或言论。当用户使用终端(例如手机、平板电脑、个人计算机等)在社交网络上发布信息或言论时,后台服务器可以获取用户发布的信息或言论,即获取待识别网络信息。
步骤304、计算所述待识别网络信息与所述可信数据库中的各个可信网络信息的相似度,取计算所得的相似度的最大值记为第一相似度;
具体地,可以采用余弦定理算法计算待识别网络信息与可信数据库中的各个可信网络信息的相似度,此处可以得到多个相似度值。所计算得到的相似度值越大,说明两条信息的相似度越高,此步骤中,可以取计算所得的相似度的最大值记为第一相似度,即第一相似度为可信数据库中与待识别网络信息相似度最高的可信网络信息与待识别网络信息的相似度。
步骤305、计算所述待识别网络信息与所述非可信数据库中的各个非可信网络 信息的相似度,取计算所得的相似度的最大值记为第二相似度;
同样地,可以采用余弦定理算法计算待识别网络信息与非可信数据库中的各个非可信网络信息的相似度,此处可以得到多个相似度值。所计算得到的相似度值越大,说明两条信息的相似度越高,此步骤中,可以取计算所得的相似度的最大值记为第二相似度,即第二相似度为非可信数据库中与待识别网络信息相似度最高的非可信网络信息与待识别网络信息的相似度。
下面举例说明采用余弦定理算法计算两条信息的相似度的过程,如下:
信息1:张三是一个歌手,也是一个演员。
信息2:张三不是一个演员,但是是一个歌手。
第一步:分词;
信息1:张三/是/一个/歌手,也/是/一个/演员。
信息2:张三/不/是/一个/演员,但是/是/一个/歌手。
第二步:去重复,列出识别的所有词;
张三、是、不、一个、演员、歌手、但是、也
第三步:计算词频(这里表示某个词在一个信息里出现的次数);
信息1:张三1、是2、不0、一个2、演员1、歌手1、但是0、也1;
信息2:张三1、是2、不1、一个2、演员1、歌手1、但是1、也0;
第四步:构造词频向量;
信息1:[1,2,0,2,1,1,0,1]
信息2:[1,2,1,2,1,1,0,1]
上面构造的是两个多维的向量,其中每个维度的值就是词频,构造出了上面两个多维向量后,计算两条信息的相似度就变成了计算这两个向量的相似度了,我们知道,两个向量的相似度,可以通过向量的夹角的大小θ来表示,具体地,可以用两个向量的夹角的余弦值表示,余弦值越接近1,就表明夹角越接近0度,也就是两个向量越相似,即“余弦相似性”。
第五步:计算两个向量的夹角的余弦值;
Cosθ=(1*1+2*2+0*1+2*2+1*1+1*1+0*0+1*1)/(sqrt(1^2+2^2+0^2+2^2+1^2+1^2+0^2+1^2)*sqrt(1^2+2^2+1^2+2^2+1^2+1^2+1^2+0^2+1^2));
最终计算得到Cosθ≈0.961。
即这两个信息的相似度为0.961,相似度的值接近1,相似度较高。
需要说明的是,实际中,步骤304与步骤305的执行顺序也可以不分先后。
上面描述的方法,由于需要计算大量信息之间的相似度,而经实践证明,余弦定理算法的计算速度优于其他算法,因此,本实施例中,可以利用余弦定理算法计算两条信息的相似度,当然,除余弦定理算法之外,还可以采用其他算法计算两条信息的相似度,例如距离编辑算法等,此处对采用的具体算法不做限定。
另外,步骤304、步骤305描述的方法,第一相似度与第二相似度是通过逐条计算待识别网络信息与可信数据库及非可信数据库中的各条网络信息的相似度得到的,实际中,还可以采用其他方式得到第一相似度及第二相似度。例如采用关键字提取法,提取可信数据库中具有与待识别网络信息具有相同关键字的可信网络信息,计算该可信网络信息与待识别网络信息的相似度,记为第一相似度;提取非可信数据库中具有与待识别网络信息具有相同关键字的非可信网络信息,计算该非可信网络信息与待识别网络信息的相似度,记为第二相似度。
步骤306、判断所述第一相似度是否大于所述第二相似度,若所述第一相似度大于所述第二相似度,则执行步骤307,若所述第一相似度小于所述第二相似度,则执行步骤308;
具体地,可以比较所述第一相似度与所述第二相似度的大小;当所述第一相似度大于所述第二相似度时,说明待识别网络信息与可信网络信息的相似度高于待识别网络信息与非可信网络信息的相似度,因此可以确定所述待识别网络信息可信;当所述第二相似度大于所述第一相似度时,说明待识别网络信息与非可信网络信息的相似度高于待识别网络信息与可信网络信息的相似度,因此可以确定所述待识别网络信息不可信。
步骤307、确定所述待识别网络信息可信;
步骤308、确定所述待识别网络信息不可信。
当确定待识别网络信息可信时,可以允许待识别网络信息显示在社交网络上;当确定待识别网络信息不可信时,可以采用一些处理措施,以提示其他用户或避免谣言传播,例如可以将所述待识别网络信息标记为可疑,或者屏蔽所述待识别网络信息。
以上识别方法同时使用到了可信数据库及非可信数据库,实际中,还可以单独 采用其中一个数据库来识别网络信息是否可信。例如,仅采用可信数据库,通过余弦定理算法计算得到第一相似度,判断第一相似度是否大于第一预设阈值(例如0.8),若大于,则认为待识别网络信息可信,若不大于,则认为待识别网络信息不可信;或者,仅采用非可信数据库,通过余弦定理算法计算得到第二相似度,判断第二相似度是否大于第二预设阈值(例如0.9),若大于,则认为待识别网络信息不可信,若不大于,则认为待识别网络信息可信。
本实施例中,后台服务器可以自动获取待识别网络信息,根据待识别网络信息与可信网络信息的相似度,以及待识别网络信息与非可信网络信息的相似度,确定待识别网络信息是否可信,即利用相似度确定待识别网络信息是否可信,因而能够自动、有效地识别谣言。
实施例三
为了更好地实施以上方法,本发明实施例还提供一种网络信息识别装置,如图4所示,本实施例的装置包括:获取单元401,计算单元402及确定单元403,如下:
(1)获取单元401;
获取单元401,用于获取待识别网络信息。
具体实现中,待识别网络信息可以是用户在社交网络(例如微博、QQ空间)上发布的信息或言论。当用户使用终端(例如手机、平板电脑、个人计算机等)在社交网络上发布信息或言论时,获取单元401可以获取用户发布的信息或言论,即获取待识别网络信息。
(2)计算单元402;
计算单元402,用于计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度;
具体实现中,本实施例的网络信息识别装置还可以包括采集单元及建立单元,其中:
采集单元可以预先收集可信网络信息及非可信网络信息,建立单元可以根据收集的可信网络信息建立可信数据库,以及根据收集的非可信网络信息建立非可信数据库。
可信网络信息可以从权威或可信的网站中提取,例如从百度百科、维基百科提 取,因此,可信数据库中包含的网络信息可以认为是可信的。非可信网络信息目前可采用人工收集,非可信数据库中包含的网络信息可以认为是不可信的。
具体地,计算单元402可以包括第一计算子单元及第二计算子单元,其中:
第一计算子单元可以采用余弦定理算法计算待识别网络信息与可信数据库中的各个可信网络信息的相似度,此处可以得到多个相似度值。所计算得到的相似度值越大,说明两条信息的相似度越高,此步骤中,第一计算子单元可以取计算所得的相似度的最大值记为第一相似度,即第一相似度为可信数据库中与待识别网络信息相似度最高的可信网络信息与待识别网络信息的相似度。
同样地,第二计算子单元也可以采用余弦定理算法计算待识别网络信息与非可信数据库中的各个非可信网络信息的相似度,此处可以得到多个相似度值。所计算得到的相似度值越大,说明两条信息的相似度越高,此步骤中,第二计算子单元可以取计算所得的相似度的最大值记为第二相似度,即第二相似度为非可信数据库中与待识别网络信息相似度最高的非可信网络信息与待识别网络信息的相似度。
上面描述的方法,由于需要计算大量信息之间的相似度,而经实践证明,余弦定理算法的计算速度优于其他算法,因此,本实施例中,第一计算子单元及第二计算子单元可以利用余弦定理算法计算两条信息的相似度,当然,除余弦定理算法之外,还可以采用其他算法计算两条信息的相似度,例如距离编辑算法等,此处对采用的具体算法不做限定。
另外,上面描述的方法,第一相似度与第二相似度是通过逐条计算待识别网络信息与可信数据库及非可信数据库中的各条网络信息的相似度得到的,实际中,还可以采用其他方式得到第一相似度及第二相似度。例如采用关键字提取法,提取可信数据库中具有与待识别网络信息具有相同关键字的可信网络信息,计算该可信网络信息与待识别网络信息的相似度,记为第一相似度;提取非可信数据库中具有与待识别网络信息具有相同关键字的非可信网络信息,计算该非可信网络信息与待识别网络信息的相似度,记为第二相似度。
(3)确定单元403;
确定单元403,用于根据所述第一相似度及所述第二相似度确定所述待识别网络信息是否可信。
具体地,确定单元403可以包括比较子单元,第一确定子单元及第二确定子单 元,其中:
比较子单元可以比较所述第一相似度与所述第二相似度的大小,当所述第一相似度大于所述第二相似度时,说明待识别网络信息与可信网络信息的相似度高于待识别网络信息与非可信网络信息的相似度,因此第一确定子单元可以确定所述待识别网络信息可信;当所述第二相似度大于所述第一相似度时,说明待识别网络信息与非可信网络信息的相似度高于待识别网络信息与可信网络信息的相似度,因此第二确定子单元可以确定所述待识别网络信息不可信。
以上识别方法同时使用到了可信数据库及非可信数据库,实际中,还可以单独采用其中一个数据库来识别网络信息是否可信。例如,仅采用可信数据库,通过余弦定理算法计算得到第一相似度,判断第一相似度是否大于第一预设阈值(例如0.8),若大于,则认为待识别网络信息可信,若不大于,则认为待识别网络信息不可信;或者,仅采用非可信数据库,通过余弦定理算法计算得到第二相似度,判断第二相似度是否大于第二预设阈值(例如0.9),若大于,则认为待识别网络信息不可信,若不大于,则认为待识别网络信息可信。
另外,本实施例的网络信息识别装置还可以包括处理单元,当确定待识别网络信息可信时,处理单元可以允许待识别网络信息显示在社交网络上;当确定待识别网络信息不可信时,处理单元可以采用一些处理措施,以提示其他用户或避免谣言传播,例如处理单元可以将所述待识别网络信息标记为可疑,或者屏蔽所述待识别网络信息。
需要说明的是,上述实施例提供的网络信息识别装置在实现网络信息识别时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的网络信息识别装置与网络信息识别方法属于同一构思,其具体实现过程详见方法实施例,此处不再赘述。
本实施例中,获取单元可以自动获取待识别网络信息,计算单元计算待识别网络信息与可信网络信息的相似度,以及计算待识别网络信息与非可信网络信息的相似度,确定单元根据所计算的相似度确定待识别网络信息是否可信,即本实施例中,利用相似度确定待识别网络信息是否可信,因而能够自动、有效地识别谣言。
实施例四
本发明实施例还提供了一种网络信息识别装置,如图5所示,其示出了本发明实施例所涉及的装置的结构示意图,具体来讲:
该装置可以包括一个或者一个以上处理核心的处理器501、一个或一个以上计算机可读存储介质的存储器502、射频(Radio Frequency,RF)电路503、电源505、输入单元505、以及显示单元506等部件。本领域技术人员可以理解,图5中示出的装置结构并不构成对装置的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
处理器501是该装置的控制中心,利用各种接口和线路连接整个装置的各个部分,通过运行或执行存储在存储器502内的软件程序和/或模块,以及调用存储在存储器502内的数据,执行装置的各种功能和处理数据,从而对装置进行整体监控。处理器501可包括一个或多个处理核心;处理器501可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器501中。
存储器502可用于存储软件程序以及模块,处理器501通过运行存储在存储器502的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器502可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据装置的使用所创建的数据等。此外,存储器502可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器502还可以包括存储器控制器,以提供处理器501对存储器502的访问。
RF电路503可用于收发信息过程中,信号的接收和发送,特别地,将基站的下行信息接收后,交由一个或者一个以上处理器501处理;另外,将涉及上行的数据发送给基站。通常,RF电路503包括但不限于天线、至少一个放大器、调谐器、一个或多个振荡器、用户身份模块(SIM)卡、收发信机、耦合器、低噪声放大器(LNA,Low Noise Amplifier)、双工器等。此外,RF电路503还可以通过无线通信与网络和其他设备通信。所述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(GSM,Global System of Mobile communication)、通用分组无线服务 (GPRS,General Packet Radio Service)、码分多址(CDMA,Code Division Multiple Access)、宽带码分多址(WCDMA,Wideband Code Division Multiple Access)、长期演进(LTE,Long Term Evolution)、电子邮件、短消息服务(SMS,Short Messaging Service)等。
装置还包括给各个部件供电的电源504(比如电池),,电源504可以通过电源管理系统与处理器501逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源504还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
该装置还可包括输入单元505,该输入单元505可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。具体地,在一个具体的实施例中,输入单元505可包括触敏表面以及其他输入设备。触敏表面,也称为触摸显示屏或者触控板,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触敏表面上或在触敏表面附近的操作),并根据预先设定的程式驱动相应的连接装置。触敏表面可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器501,并能接收处理器501发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触敏表面。除了触敏表面,输入单元505还可以包括其他输入设备。具体地,其他输入设备可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
该装置还可包括显示单元506,该显示单元506可用于显示由用户输入的信息或提供给用户的信息以及装置的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示单元506可包括显示面板,可以采用液晶显示器(LCD,Liquid Crystal Display)、有机发光二极管(OLED,Organic Light-Emitting Diode)等形式来配置显示面板。进一步的,触敏表面可覆盖显示面板,当触敏表面检测到在其上或附近的触摸操作后,传送给处理器501以确定触摸事件的类型,随后处理器501根据触摸事件的类型在显示面板上提供相应的视觉输出。虽然在图5中,触敏表面与显示面板是作为两个独立的部件来实现输入和输入 功能,但是在某些实施例中,可以将触敏表面与显示面板集成而实现输入和输出功能。
尽管未示出,装置还可以包括摄像头、蓝牙模块等,在此不再赘述。具体在本实施例中,装置中的处理器501会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器502中,并由处理器501来运行存储在存储器502中的应用程序,从而实现各种功能,如下:
获取待识别网络信息;
计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度;
根据所述第一相似度及所述第二相似度确定所述待识别网络信息是否可信。
具体地,处理器501可以采用余弦定理算法计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及采用余弦定理算法计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度。
进一步地,处理器501还用于,
在获取待识别网络信息之前,采集可信网络信息及非可信网络信息;
根据采集的可信网络信息建立可信数据库,以及根据采集的非可信网络信息建立非可信数据库。
具体地,处理器501可以计算所述待识别网络信息与所述可信数据库中的各个可信网络信息的相似度,取计算所得的相似度的最大值记为第一相似度;
计算所述待识别网络信息与所述非可信数据库中的各个非可信网络信息的相似度,取计算所得的相似度的最大值记为第二相似度。
具体地,处理器501可按照如下方式确定待识别网络信息是否可信:
比较所述第一相似度与所述第二相似度的大小;
当所述第一相似度大于所述第二相似度时,确定所述待识别网络信息可信;
当所述第二相似度大于所述第一相似度时,确定所述待识别网络信息不可信。
进一步地,在确定所述待识别网络信息不可信时,处理器501还可以将所述待识别网络信息标记为可疑,或者屏蔽所述待识别网络信息。
由上可知,本实施例的装置可以自动获取待识别网络信息,然后计算待识别网络信息与可信网络信息的相似度,以及计算待识别网络信息与非可信网络信息的相 似度,最后根据所计算的相似度确定待识别网络信息是否可信,即本实施例的装置可以利用相似度确定待识别网络信息是否可信,因而能够自动、有效地识别谣言。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,装置,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本实施例提供一种社交网络信息识别方法的实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
本申请所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在计算机终端上为例,图6是可用于实施本发明实施例的社交网络 信息识别方法的计算机终端的硬件结构框图。如图6所示,计算机终端600可以包括一个或多个(图中仅示出一个)处理器602(处理器602可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器604、以及用于通信功能的传输装置606。本领域普通技术人员可以理解,图6所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,计算机终端600还可包括比图6中所示更多或者更少的组件,或者具有与图6所示不同的配置。
存储器604可用于存储应用软件的软件程序以及模块,如本发明实施例中的社交网络信息识别方法对应的程序指令/模块,处理器602通过运行存储在存储器604内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的社交网络信息识别方法。存储器604可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器604可进一步包括相对于处理器602远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输装置606用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端600的通信供应商提供的无线网络。在一个实例中,传输装置606包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置606可以为射频(Radio Frequency,简称为RF)模块,其用于通过无线方式与互联网进行通讯。
在上述运行环境下,本申请提供了如图7所示的一种社交网络信息识别方法。该方法可以应用于智能终端设备中,由智能终端设备中的处理器执行,智能终端设备可以是智能手机、平板电脑等。智能终端设备中安装有至少一个应用程序,本发明实施例并不限定应用程序的种类,可以为系统类应用程序,也可以为软件类应用程序。
图7是本发明实施例一揭示的社交网络信息识别方法的流程图。如图7所示,该方法的一种方案包括如下步骤:
步骤S701,对目标文本进行分词处理,得到目标文本的分词;
步骤S702,按照各分词在目标文本中的出现顺序,将相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息 类型包括虚假信息、真实信息和无偏向信息;
步骤S703,对目标文本中所有词组的信息类型进行统计,得到统计结果;
步骤S704,根据统计结果确定所述目标文本的信息类型。
作为步骤S702的一种实施方式,所述根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,包括:
步骤1,根据公式X(W12)=C(W2)*C(W12)/C(W1)计算得到每个词组中两个分词的关联值;其中,X(W12)表示所述词组中两个分词的关联值,C(W1)表示所述词组中的第一个分词在目标文本中出现的频次,C(W2)表示所述词组中的第二个分词在目标文本中出现的频次,C(W12)表示第一个分词和第二个分词在目标文本中有顺序的同时连续出现的频次,所述第一个分词在目标文本中的出现顺序早于第二个分词;
步骤2,提取虚假信息库中对应的所述两个分词的关联值,作为第一关联值;提取真实信息库中对应的所述两个分词的关联值,作为第二关联值;根据所述关联值分别与第一关联值和第二关联值的接近程度,确定所述词组的信息类型;具体包括:计算所述关联值与第一关联值的差值,得到第一差值;计算所述关联值与第二关联值的差值,得到第二差值;比较所述第一差值的绝对值和第二差值的绝对值的大小,若第一差值的绝对值大于第二差值的绝对值,则确定该词组的信息类型为真实信息,若第一差值的绝对值小于第二差值的绝对值,则确定该词组的信息类型为虚假信息,若第一差值的绝对值与第二差值的绝对值相等,则确定该词组的信息类型为无偏向信息。
本发明实施例通过建立虚假信息库和真实信息库,对虚假信息和对应的真实信息进行分析,计算得到虚假信息中相邻关键词的相关度和真实信息中相邻关键词的相关度,通过判断目标文本中相邻关键词的相关度与二者的接近程度,来确定目标文本中相邻关键词的信息类型,并进一步通过统计目标文本中所有相邻关键词的信息类型得到目标文本的信息类型,实现了通过较为简单的算法快速识别网络虚假信息,可以为网络管理者快速反应提供重要的依据。
本实施例提供一种社交网络信息识别方法。在如实施例的运行环境下,本申请实施例提供了如图8所示的社交网络信息识别方法。如图8所示,图8是根据本发明实施例的社交网络信息识别方法的流程图,该方法的一种方案包括如下步骤:
步骤一:对虚假信息库中的虚假信息样本及真实信息库中的真实信息样本进行处理。
虚假信息库中的虚假信息样本可以通过人工收集获得,真实信息库中的真实信息样本可以从已知的知识库(如各种百科知识)里提取得到。较优的,虚假信息样本和真实信息样本一一对应收录,当收集到一个错误的虚假信息样本,则对应的查找一个正确的真实信息样本,将虚假信息样本存入虚假信息库,将该真实信息样本存入真实信息库。
对信息样本的处理过程包括:对虚假信息库中的虚假信息样本进行分词处理,得到虚假信息样本的分词,按照各分词在该虚假信息样本中的出现顺序,计算得到相邻两个分词的关联值;对真实信息库中的真实信息样本进行分词处理,得到真实信息样本的分词,按照各分词在该真实信息样本中的出现顺序,计算得到相邻两个分词的关联值。
由于对虚假信息样本的预处理过程和对真实信息样本的预处理过程相同,下面就以虚假信息样本为例对预处理过程展开说明。
参见图8,对虚假信息样本的预处理过程包括:
第一,从虚假信息库中提取虚假信息样本,将虚假信息样本输入分词模块。
第二,利用分词模块对虚假信息样本进行分词处理,得到虚假信息样本的分词结果。
具体包括:
首先对虚假信息样本进行预处理,去除虚假信息样本中的停用词,停用词是人工收集得到的,主要包含标点符号、代词、语气词、助词、连词等,这些停用词一般没有特殊的意义,经常搭配别的词构成词或短语。
然后对去除停用词的虚假信息样本采用字典分词法进行分词,分词时可以采用正向最大匹配算法、逆向最大匹配算法或双向最大匹配算法,其中,正向最大匹配算法和逆向最大匹配算法是常用的分词方法,其具体步骤在此不再赘述,双向最大匹配算法具体为:对待分词文本分别采用正向最大匹配算法和逆向最大匹配算法进行分词,当正向最大匹配算法和逆向最大匹配算法得到的分词结果中词数不一致时,取分词数量较少的作为最终结果,若两种方法得到的分词结果中词数一致,则任取一个分词结果作为最终结果。
之后,统计各个分词在虚假信息样本中的出现频次,按照各分词在虚假信息样本中的出现顺序进行正向排序,并对应记录各分词在虚假信息样本中的出现频次。例如文本Q:温、热性的狗、羊肉就不能与寒、凉性的绿豆、西瓜同食。对文本Q进行分词处理后,可以得到一个矩阵样式的分词结果,如表一所示。
表一:
Figure PCTCN2017104275-appb-000001
第三,将分词结果输入相关性计算模块,按照各分词在虚假信息样本中的出现顺序,计算相邻两个分词的相关性,得到相邻两个分词的关联值。
具体地,可以按照各分词在虚假信息样本中的出现顺序,根据公式X(W)=C(W02)*C(W)/C(W01)计算相邻两个分词的关联值;
其中,X(W)表示相邻两个分词的关联值,C(W01)表示两个分词中的第一个分词在虚假信息样本中出现的频次,C(W02)表示两个分词中的第二个分词在虚假信息样本中出现的频次,第一个分词的出现顺序早于第二个分词,C(W)表示第一个分词和第二个分词在虚假信息样本中有顺序的同时连续出现的频次。
第四,将相邻两个分词及其关联值对应存储。
步骤二:对目标文本进行分词处理,得到目标文本的分词。
对目标文本进行分词处理,得到目标文本的分词,具体包括:
第一,获取目标文本;目标文本可以从社交应用软件中获取得到,例如从微博中提取微博信息,将微博信息作为目标文本,从微信提取公众号文章或微信朋友圈消息,将该文章或者朋友圈消息作为目标文本。
第二,对所述目标文本进行预处理,去除目标文本中的停用词。
停用词是人工收集得到的,主要包含标点符号、代词、语气词、助词、连词等,这些停用词一般没有特殊的意义,经常搭配别的词构成词或短语,术语一般不包含停用词。停用词示例:“啊”、“哦”、“呃”、“以及”、“的”、“得”、“几乎”、“什么”、“我”、“它”、“我们”等。
第三,采用字典分词法对所述目标文本进行分词处理,得到目标文本的分词。
对去除停用词的目标文本采用字典分词法进行分词,分词时可以采用正向最大匹配算法、逆向最大匹配算法或双向最大匹配算法,其中,正向最大匹配算法和逆向最大匹配算法是常用的分词方法,其具体步骤在此不再赘述,双向最大匹配算法具体为:对待分词文本分别采用正向最大匹配算法和逆向最大匹配算法进行分词,当正向最大匹配算法和逆向最大匹配算法得到的分词结果中词数不一致时,取分词数量较少的作为最终结果,若两种方法得到的分词结果中词数一致,则任取一个分词结果作为最终结果。然后,统计各个分词在目标文本中的出现频次,按照各分词在文本中的出现顺序进行正向排序,并对应记录各分词在虚假信息样本中的出现频次,得到一个用矩阵表示的分词结果。
步骤三:按照各分词在目标文本中的出现顺序,将相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息类型包括虚假信息、真实信息和无偏向信息。
图9是本发明实施例揭示的确定词组所属信息类型的方法的流程图。参见图9,确定词组所属信息类型的方法包括:
S901:计算每个词组中两个分词的关联值。
具体地,可以根据公式X(W12)=C(W2)*C(W12)/C(W1)计算得到词组中两个分词的关联值;其中,X(W12)表示所述词组中两个分词的关联值,C(W1)表示所述词组中的第一个分词在目标文本中出现的频次,C(W2)表示所述词组中的第二个分词在目标文本中出现的频次,C(W12)表示第一个分词和第二个分词在目标文本中有顺序的同时连续出现的频次,所述第一个分词在目标文本中的出现顺序早于第二个分词。
S902:提取虚假信息库中对应的所述两个分词的关联值,作为第一关联值;提取真实信息库中对应的所述两个分词的关联值,作为第二关联值。
S903:根据所述关联值分别与第一关联值和第二关联值的接近程度,确定所述词组的信息类型。
所述根据所述关联值分别与第一关联值和第二关联值的接近程度,确定所述词组的信息类型,包括:
计算所述关联值与第一关联值的差值,得到第一差值;计算所述关联值与第二 关联值的差值,得到第二差值;
比较所述第一差值的绝对值和第二差值的绝对值的大小,若第一差值的绝对值大于第二差值的绝对值,则确定该词组的信息类型为真实信息,若第一差值的绝对值小于第二差值的绝对值,则确定该词组的信息类型为虚假信息,若第一差值的绝对值与第二差值的绝对值相等,则确定该词组的信息类型为无偏向信息。
例如,目标文本中相邻的两个分词“羊肉”、“绿豆”的关联值是4,虚假信息库中对应的两个词“羊肉”、“绿豆”的关联值是1,真实信息库中对应的两个词“羊肉”、“绿豆”的关联值是3,则可以将1作为第一关联值,将3作为第二关联值;计算得到第一差值的绝对值为3,第二差值的绝对值为1,可以确定该词组(“羊肉”和“绿豆”)的信息类型为真实信息。
步骤四:对目标文本中所有词组的信息类型进行统计,得到统计结果。
该步骤包括:获取目标文本中所有词组的信息类型;统计各个信息类型的出现频次,得到统计结果。
步骤五:根据统计结果确定所述目标文本的信息类型。
所述根据统计结果确定所述目标文本的信息类型,包括:
比较虚假信息和真实信息的出现频次,将出现频次较高的信息类型确定为所述目标文本的信息类型,如果虚假信息的出现频次和真实信息的出现频次相同,则确定所述目标文本的信息类型为无偏向信息。
需要说明的是,对于前述的方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例所涉及的动作和模块并不一定是本发明实施例所必须的。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本 发明各个实施例所述的方法。
本实施例提供一种社交网络信息处理方法。在如实施例的运行环境下,本申请提供了如图10所示的社交网络信息处理方法。如图10所示,图10是根据本发明实施例的社交网络信息处理方法的流程图,该方法的一种方案包括如下步骤:
S1001:对目标文本进行分词处理,得到目标文本的分词;
S1002:按照各分词在目标文本中的出现顺序,将相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息类型包括虚假信息、真实信息和无偏向信息;
S1003:对目标文本中所有词组的信息类型进行统计,得到统计结果;
S1004:根据统计结果确定所述目标文本的信息类型;
S1005:根据目标文本的信息类型对所述目标文本进行处理。
所述根据目标文本的信息类型对所述目标文本进行处理,包括:若所述目标文本的信息类型为虚假信息,则删除社交网络中的所述目标文本。
其中,目标文本可以从社交应用软件中获取得到,例如从微博中提取微博信息,将微博信息作为目标文本,从微信提取公众号文章或微信朋友圈消息,将该文章或者朋友圈消息作为目标文本。当确定所述目标文本的信息类型为虚假信息时,则删除社交网络中对应的目标文本,例如目标文本为微信朋友圈消息,当确定该目标文本是虚假信息时,可通知网络管理者手动处理该信息,或则自动删除该朋友圈消息。
本实施例实现了通过较为简单的算法快速识别网络虚假信息,可以为网络管理者快速反应提供重要的依据,便于网络管理者及时处理网络虚假信息,降低或避免虚假信息传播造成的不良影响。
本实施例提供一种社交网络信息识别装置。如图11所示,该装置包括分词单元1110、第一确定单元1120、统计单元1130和第二确定单元1140。
分词单元1110,用于对目标文本进行分词处理,得到目标文本的分词;
第一确定单元1120,用于按照各分词在目标文本中的出现顺序,将相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息类型包括虚假信息、真实信息和无偏向信息;
统计单元1130,用于对目标文本中所有词组的信息类型进行统计,得到统计结果;
第二确定单元1140,用于根据统计结果确定所述目标文本的信息类型。
该实施例的社交网络信息识别装置中,分词单元1110用于执行本发明实施例中的步骤S701,第一确定单元1120用于执行本发明实施例中的步骤S702,统计单元1130用于执行本发明实施例中的步骤S703,第二确定单元1140用于执行本发明实施例中的步骤S704。
参见图12,作为一种实施方式,所述分词单元1210包括第一获取子单元12101、处理子单元12102和分词子单元12103。
第一获取子单元12101,用于获取目标文本;
处理子单元12102,用于对所述目标文本进行预处理,去除目标文本中的停用词;
分词子单元12103,用于采用字典分词法对经过处理子单元处理后的目标文本进行分词处理,得到目标文本的分词。
作为一种实施方式,所述第一确定单元1220包括计算子单元12201、提取子单元12202和确定子单元12203。
计算子单元12201,用于计算每个词组中两个分词的关联值;
提取子单元12202,用于提取虚假信息库中对应的所述两个分词的关联值,作为第一关联值,提取真实信息库中对应的所述两个分词的关联值,作为第二关联值;
确定子单元12203,用于根据所述关联值分别与第一关联值和第二关联值的接近程度,确定所述词组的信息类型。
进一步地,所述确定子单元1203包括计算模块122031和确定模块122032。
计算模块122031,用于计算所述关联值与第一关联值的差值,得到第一差值;计算所述关联值与第二关联值的差值,得到第二差值;
确定模块122032,用于比较所述第一差值的绝对值和第二差值的绝对值的大小,若第一差值的绝对值大于第二差值的绝对值,则确定该词组的信息类型为真实信息,若第一差值的绝对值小于第二差值的绝对值,则确定该词组的信息类型为虚假信息,若第一差值的绝对值与第二差值的绝对值相等,则确定该词组的信息类型为无偏向信息。
所述计算子单元12201,具体用于根据公式X(W12)=C(W2)*C(W12)/C(W1)计算得到词组中两个分词的关联值;其中,X(W12)表示所述词组中两个分词的关联值, C(W1)表示所述词组中的第一个分词在目标文本中出现的频次,C(W2)表示所述词组中的第二个分词在目标文本中出现的频次,C(W12)表示第一个分词和第二个分词在目标文本中有顺序的同时连续出现的频次,所述第一个分词在目标文本中的出现顺序早于第二个分词。
作为一种实施方式,所述统计单元1230包括:
第二获取子单元12301,用于获取目标文本中所有词组的信息类型,
统计子单元12302,用于统计各个信息类型的出现频次,得到统计结果;
所述第二确定单元1240,具体用于比较虚假信息和真实信息的出现频次,将出现频次较高的信息类型确定为所述目标文本的信息类型,如果虚假信息的出现频次和真实信息的出现频次相同,则确定所述目标文本的信息类型为无偏向信息。
进一步地,所述装置还包括预处理单元和存储单元。
所述预处理单元,用于对虚假信息库中的虚假信息样本进行分词处理,得到虚假信息样本的分词,按照各分词在该虚假信息样本中的出现顺序,计算得到相邻两个分词的关联值;还用于对真实信息库中的真实信息样本进行分词处理,得到真实信息样本的分词,按照各分词在该真实信息样本中的出现顺序,计算得到相邻两个分词的关联值;
所述存储单元包括第一存储模块和第二存储模块,所述第一存储模块用于存储对虚假信息样本进行预处理得到的关联值及对应的分词,所述第二存储模块用于存储对真实信息样本进行预处理得到的关联值及对应的分词。
本发明实施例通过对目标文本进行分词,将相邻两个分词作为一个词组,计算每个词组中两个分词的关联值,将其与虚假信息库和真实信息库中对应的两个词的关联值进行比对,根据关联值接近程度来确定目标文本中每个词组的信息类型,进而通过统计目标文本中所有词组的信息类型来确定目标文本的信息类型,实现了通过较为简单的算法快速识别网络虚假信息,可以为网络管理者快速反应提供重要的依据,便于网络管理者及时处理网络虚假信息,降低虚假信息传播造成的不良影响。
本实施例提供一种社交网络信息处理装置。如图13所示,该装置包括分词单元1310、第一确定单元1320、统计单元1330、第二确定单元1340和处理单元1350。
分词单元1310,用于对目标文本进行分词处理,得到目标文本的分词;
第一确定单元1320,用于按照各分词在目标文本中的出现顺序,将相邻两个分 词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息类型包括虚假信息、真实信息和无偏向信息;
统计单元1330,用于对目标文本中所有词组的信息类型进行统计,得到统计结果;
第二确定单元1340,用于根据统计结果确定所述目标文本的信息类型;
处理单元1350,用于根据目标文本的信息类型对所述目标文本进行处理。
该实施例的社交网络信息识别装置中,分词单元1310用于执行本发明实施例中的步骤S1001,第一确定单元1320用于执行本发明实施例中的步骤S1002,统计单元1330用于执行本发明实施例中的步骤S1003,第二确定单元1340用于执行本发明实施例中的步骤S1004,处理单元1350用于执行本发明实施例中的步骤S1005。
所述处理单元1350,具体用于当第二确定单元确定所述目标文本的信息类型为虚假信息时,删除社交网络中的所述目标文本。
本发明的实施例还提供了一种存储介质。在本实施例中,上述存储介质可以用于保存上述实施例的一种社交网络信息识别方法所执行的程序代码。
在本实施例中,上述存储介质可以位于计算机网络的多个网络设备中的至少一个网络设备。
在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:
第一步,对目标文本进行分词处理,得到目标文本的分词。
第二步,按照各分词在目标文本中的出现顺序,将相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息类型包括虚假信息、真实信息和无偏向信息。
第三步,对目标文本中所有词组的信息类型进行统计,得到统计结果。
第四步,根据统计结果确定所述目标文本的信息类型。
存储介质还被设置为存储用于执行以下步骤的程序代码:获取目标文本;对所述目标文本进行预处理,去除目标文本中的停用词;采用字典分词法对所述目标文本进行分词处理,得到目标文本的分词。
存储介质还被设置为存储用于执行以下步骤的程序代码:计算每个词组中两个分词的关联值;提取虚假信息库中对应的所述两个分词的关联值,作为第一关联值;提取真实信息库中对应的所述两个分词的关联值,作为第二关联值;根据所述关联 值分别与第一关联值和第二关联值的接近程度,确定所述词组的信息类型。
存储介质还被设置为存储用于执行以下步骤的程序代码:计算所述关联值与第一关联值的差值,得到第一差值;计算所述关联值与第二关联值的差值,得到第二差值;比较所述第一差值的绝对值和第二差值的绝对值的大小,若第一差值的绝对值大于第二差值的绝对值,则确定该词组的信息类型为真实信息,若第一差值的绝对值小于第二差值的绝对值,则确定该词组的信息类型为虚假信息,若第一差值的绝对值与第二差值的绝对值相等,则确定该词组的信息类型为无偏向信息。
存储介质还被设置为存储用于执行以下步骤的程序代码:获取目标文本中所有词组的信息类型;统计各个信息类型的出现频次,得到统计结果。
存储介质还被设置为存储用于执行以下步骤的程序代码:比较虚假信息和真实信息的出现频次,将出现频次较高的信息类型确定为所述目标文本的信息类型,如果虚假信息的出现频次和真实信息的出现频次相同,则确定所述目标文本的信息类型为无偏向信息。
存储介质还被设置为存储用于执行以下步骤的程序代码:对虚假信息库中的虚假信息样本进行分词处理,得到虚假信息样本的分词,按照各分词在该虚假信息样本中的出现顺序,计算得到相邻两个分词的关联值;对真实信息库中的真实信息样本进行分词处理,得到真实信息样本的分词,按照各分词在该真实信息样本中的出现顺序,计算得到相邻两个分词的关联值。
在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本发明的实施例还提供了一种存储介质。在本实施例中,上述存储介质可以用于保存上述实施例的一种社交网络信息处理方法所执行的程序代码。
在本实施例中,上述存储介质可以位于计算机网络的多个网络设备中的至少一个网络设备。
在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:
第一步,对目标文本进行分词处理,得到目标文本的分词;
第二步,按照各分词在目标文本中的出现顺序,将相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息类型 包括虚假信息、真实信息和无偏向信息;
第三步,对目标文本中所有词组的信息类型进行统计,得到统计结果;
第四步,根据统计结果确定所述目标文本的信息类型;
第五步,根据目标文本的信息类型对所述目标文本进行处理。
存储介质还被设置为存储用于执行以下步骤的程序代码:当所述目标文本的信息类型为虚假信息时,删除社交网络中的所述目标文本。
本发明的实施例还提供一种计算机终端,该计算机终端可以是计算机终端群中的任意一个计算机终端设备。在本实施例中,上述计算机终端也可以替换为移动终端等终端设备。
在本实施例中,上述计算机终端可以位于计算机网络的多个网络设备中的至少一个网络设备。
图14是根据本发明实施例的计算机终端的结构框图。如图14所示,该计算机终端A可以包括:一个或多个(图中仅示出一个)处理器1401、存储器1403、以及传输装置1405。
其中,存储器1403可用于存储软件程序以及模块,如本发明实施例中的社交网络信息识别方法和装置对应的程序指令/模块,处理器1401通过运行存储在存储器1403内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的社交网络信息识别。存储器1403可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器1403可进一步包括相对于处理器1401远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端A。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
上述的传输装置1405用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装置1405包括一个网络适配器,其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置1405为射频模块,其用于通过无线方式与互联网进行通讯。
其中,具体地,存储器1403用于存储预设动作条件和预设权限用户的信息、以及应用程序。
处理器1401可以通过传输装置调用存储器1403存储的信息及应用程序,以执 行下述步骤:
第一步,对目标文本进行分词处理,得到目标文本的分词。
第二步,按照各分词在目标文本中的出现顺序,将相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息类型包括虚假信息、真实信息和无偏向信息。
第三步,对目标文本中所有词组的信息类型进行统计,得到统计结果。
第四步,根据统计结果确定所述目标文本的信息类型。
本实施例提供一种基于行为特征的多媒体文件识别方法的实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
本申请所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在计算机终端上为例,图15是可用于实施本发明实施例的基于行为特征的多媒体文件识别方法的计算机终端的硬件结构框图。如图15所示,计算机终端1500可以包括一个或多个(图中仅示出一个)处理器1502(处理器1502可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器1504、以及用于通信功能的传输装置1506。本领域普通技术人员可以理解,图15所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,计算机终端1500还可包括比图15中所示更多或者更少的组件,或者具有与图15所示不同的配置。
存储器1504可用于存储应用软件的软件程序以及模块,如本发明实施例中的基于行为特征的多媒体文件识别方法对应的程序指令/模块,处理器1502通过运行存储在存储器1504内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的基于行为特征的多媒体文件识别方法。存储器1504可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器1504可进一步包括相对于处理器1502远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端1500。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输装置1506用于经由一个网络接收或者发送数据。上述的网络具体实例可包 括计算机终端1500的通信供应商提供的无线网络。在一个实例中,传输装置1506包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置1506可以为射频(Radio Frequency,简称为RF)模块,其用于通过无线方式与互联网进行通讯。
在上述运行环境下,本申请实施例提供了如图16所示的一种基于行为特征的多媒体文件识别方法。该方法可以应用于智能终端设备中,由智能终端设备中的处理器执行,智能终端设备可以是智能手机、平板电脑等。智能终端设备中安装有至少一个应用程序,本发明实施例并不限定应用程序的种类,可以为系统类应用程序,也可以为软件类应用程序。
图16是本发明实施例一揭示的基于行为特征的多媒体文件识别方法的流程图。如图16所示,该方法的一种方案包括如下步骤:
步骤S1601,在多媒体文件播放过程中,获取观众用户的画像特征值和第一意愿特征值,所述画像特征值用于标识用户对特定内容的喜好,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿;
步骤S1602,根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率;
步骤S1603,判断所述概率是否超过预设值,若是,则对所述多媒体文件进行特征检测;
步骤S1604,根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件。
作为步骤S1602的一种实施方式,所述根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率,包括:
根据所述画像特征值和第一意愿特征值确定每个用户的第二意愿特征值;
根据所有用户的第二意愿特征值计算所述多媒体文件包含特定内容的概率。
本发明实施例分析用户的上网行为与观看特定内容的关联,提出在多媒体文件播放过程中,通过获取观众用户的画像特征值和用于表示用户在预设时间内希望观看特定内容的第一意愿特征值,进而根据各个用户的画像特征值和第一意愿特征值来计算该多媒体文件包含特定内容的概率,并将所述概率与预设值进行比对来确定是否需要进一步检测该多媒体文件,由此通过用户行为特征辅助筛选得到待分析的 多媒体文件,对筛选出来的多媒体文件进行特定内容检测,提高了对特定内容的多媒体文件的识别效率和准确性。将本发明实施例用于多媒体文件的色情、恐怖等不良内容的检测中,可以极大的提高检测效率和可靠性,方便多媒体文件的管控。
本实施例提供一种基于行为特征的视频内容识别方法。在如实施例的运行环境下,本申请实施例提供了如图17所示的基于行为特征的多媒体文件识别方法。如图17所示,图17是根据本发明实施例的基于行为特征的多媒体文件识别方法的流程图,该方法的一种方案包括如下步骤:
步骤1701:分析用户的行为数据,确定用户的画像特征值和第一意愿特征值。
用户的上网行为可以反映用户的喜好,通过分析用户的搜索、浏览、点击推荐信息等行为可以确定用户画像,例如用户画像为喜好色情视频,相应地,用户画像也可以辅助判断用户当前或未来的上网行为,例如喜好色情视频的用户当下或未来观看色情视频的几率较不喜好色情视频的用户更大。用户画像往往可以反映用户的多种喜好,因而仅依靠用户画像来判断用户当下或未来的行为还不够准确,由于用户的上网行为往往具有连续性,对某一内容的搜索或浏览往往会持续一段时间,例如用户在前几分钟关注了色情内容,在当前或未来一段时间继续浏览色情相关内容的几率就较大,基于此,可以参考当前时间之前的一时间段内用户的行为特点来辅助判断用户当下或未来的行为。
分析用户行为数据,可以用画像特征值标识用户对特定内容的喜好,用第一意愿特征值标识用户在当前时间之前的一段时间内关注特定内容的意愿。
其中,分析用户的行为数据,确定用户的画像特征值,包括:获取用户的行为数据,所述行为数据包括浏览特定内容相关文本的第一行为数据、浏览特定内容相关图片的第二行为数据、访问特定内容相关论坛的第三行为数据和在特定内容相关的聊天群里聊天的第四行为数据;分别判断所述第一行为数据、第二行为数据、第三行为数据和第四行为数据是否为空,若为空则记为0,若不为空则记为1,对应得到第一判断结果R1、第二判断结果R2、第三判断结果R3和第四判断结果R4;
根据预先设定的所述第一判断结果的第一权重W1、所述第二判断结果的第二权重W3、所述第三判断结果的第三权重W3和所述第四判断结果的第四权重W4,对所述第一判断结果、第二判断结果、第三判断结果和第四判断结果进行分配整合,得到所述用户的行为特征值。作为一种方式,行为特征值 B=W1*R1+W2*R2+W3*R3+W4*R4,作为另一种方式,行为特征值B=(W1*R1+W2*R2+W3*R3+W4*R4)/4。
分析用户的行为数据,确定用户的第一意愿特征值,可以通过两种方式实现:(1)通过用户终端上运行的电脑管家等类似软件,获取用户的屏幕显示内容来判断;(2)可以在网络上捕获用户的流量,比如路由器上抓包,从而分析出用户正在进行的操作。具体步骤包括:获取用户在最近一段时间内的行为数据,所述行为数据包括浏览特定内容相关文本的第一时间、浏览特定内容相关图片的第二时间、访问特定内容相关论坛的第三时间和在特定内容相关的聊天群里聊天的第四时间;为所述第一时间赋予所述第一权重W1、所述第二时间赋予所述第二权重W2、所述第三时间赋予所述第三权重W3、所述第四时间赋予所述第四权重W4,对所述第一时间、第二时间、第三时间和第四时间进行加权平均,得到用户的第一意愿特征值。
举例说明,假设特定内容为色情内容,画像特征值表示用户对色情内容的喜好程度,第一意愿特征值表示用户在此刻之前一段时间内希望观看色情视频的意愿,分析该用户的上网行为,主要包括用户在最近一段时间是否浏览过色情相关的文字、图片以及是否访问色情相关的论坛、是否在色情聊天群里发言,其中,浏览色情小说、色情相关的段子或微博等可以视为浏览过色情相关的文字,浏览被标记为色情的图片、色情网站上的图片以及正常网站上的各种美女图片可以视为浏览过色情相关的图片;然后根据这些行为特征的权重,计算用户的画像特征值,如浏览色情相关的文字对应的权重为0.4,浏览色情相关的图片对应的权重为0.3,访问色情论坛的权重为0.6,在色情聊天群里发言的权重是0.5,如果用户在最近一段时间内浏览过色情相关的图片、访问了色情相关的论坛并且还在色情聊天群里发言,则该用户的行为特征值B=0.4*0+0.6*1+0.3*1+0.5*1=1.4,依照历史数据分析,大于1说明用户较多关注色情内容,可以标记用户为色情用户。若用户在当前时刻之前的40分钟内花费10分钟看色情小说、10分钟看色情图片、20分钟访问色情论坛,则第一意愿特征值为(0.4*10+0.3*10+0.6*20)/40=0.475。
步骤1702:在多媒体文件播放过程中,获取观众用户的画像特征值和第一意愿特征值。
多媒体文件包括文本、图片、视频和音频文件,本方案可用于识别这些文件是否包含特定内容,特定内容可以是恐怖和/或色情内容,例如采用本发明实施例方案 识别文本是否为色情文本、图片是否为色情图片、视频是否为色情视频。
当多媒体文件为视频文件时,所述视频可以是点播视频或直播视频,所述直播视频包括直播间播放的视频。在视频播放过程中,获取观众用户的画像特征值和第一意愿特征值,所述画像特征值用于标识用户对特定内容的喜好程度,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿,预设时间段一般是指当前时间往前推移的一时间段,比如当前时间之前的40分钟。
步骤1703:根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率。
所述根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率具体包括:根据所述画像特征值和第一意愿特征值确定每个用户的第二意愿特征值;根据所有用户的第二意愿特征值计算所述多媒体文件包含特定内容的概率。综合画像特征值和第一意愿特征值,可以提高判断所述多媒体文件是否包含特定内容的准确性。
在一个实施例中,可以通过对所述画像特征值和第一意愿特征值进行求和,得到所述第二意愿特征值;通过分别将各个用户的第二意愿特征值与预设的阈值进行比对,计算所述第二意愿特征值超过阈值的用户数量与用户总数量的比值,得到所述多媒体文件包含特定内容的概率。
在另一个实施例中,还可以根据预先为画像特征值和第一意愿特征值设定的权重,对所述画像特征值和第一意愿特征值进行加权平均,得到第二意愿特征值;通过分别将各个用户的第二意愿特征值与预设的阈值进行比对,计算所述第二意愿特征值超过阈值的用户数量与用户总数量的比值,得到所述多媒体文件包含特定内容的概率。
步骤1704:判断所述概率是否超过预设值,若是,则执行步骤1705,对所述多媒体文件进行特征检测,否则执行步骤1708,正常播放该多媒体文件。
预设值可以人为设定,预设值可以结合多媒体文件是否为特定内容的多媒体文件的判断结果进行调整,以提高最终判断结果的准确性。如果所述概率没有超过预设值,说明正在播放的多媒体文件包含特定内容的可能性较小,为提高检测效率和准确性,可以放弃对这类多媒体文件的进一步检测,不对其进行任何处理。如果所述概率超过预设值,说明所述视频包含特定内容的可能性较大,需要进一步检测多 媒体文件内容。
对于文本文件,进一步检测包括对文本内容进行字符检测。可以预先建立字符特征库,用于存放从特定内容文件(例如色情小说、色情图片等)中提取的特征字符,然后利用字符特征库中的特征字符与文本内容进行匹配,当匹配结果超过预设的匹配阈值后,说明文本文件包含的特征字符较多,可以确定其为特定内容的文本。
对于图片文件,进一步检测包括对图片进行字符检测、敏感部位检测、肤色像素检测、血色像素检测等。字符检测利用字符特征库进行特征字符匹配来进行检测,敏感部位检测利用敏感部位特征库进行敏感部位匹配来进行检测,血色像素检测和肤色像素检测可以首先建立血色模型和肤色模型,再根据血色模型和肤色模型对图片进行血色像素检测和肤色像素检测。血色模型和肤色模型的构建方法为现有技术,在此不再赘述。
对于音频文件,进一步检测时,可以训练一个音频检测模型,将待检测的音频文件输入音频检测模型,来获取是否包含特定内容的检测结果。音频检测模型的构建方法为现有技术,在此不再赘述。
对于视频文件,进一步检测包括音频检测和图像检测;其中,音频检测可以采用音频检测模型进行检测;图像检测包括提取所述视频的图像,对所述图像进行特征检测。具体的,所述提取所述视频的图像,对所述图像进行特征检测,包括:对所述视频等时间间隔提取预设数目的图像,例如通过对视频间断10s截屏来提取图像;然后对每张图像进行特征检测,判断图像是否包含特定特征,特征检测包括运动检测、字符检测、敏感部位检测、肤色像素检测、血色像素检测等。
步骤1706:根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件。
所述多媒体文件为视频文件,在一个实施例中,可以通过统计包含特定特征的图片数目,判断该数目是否大于预先设定的阈值P,当判断出包含特定特征的图像数目大于预先设定的阈值P时,确定所述视频为特定内容视频,否则判断所述视频为正常视频。在另一个实施例中,可以通过统计包含特定特征的图片数目,确定判断出的包含特定特征的图像数目与针对该视频检测提取得到的图像总数目的比值,在确定的比值大于阈值Q时,判断所述视频为特定内容视频,执行步骤1707,对该多媒体文件进行处理,否则,执行步骤1708,判断所述视频为正常视频,正常播 放该多媒体文件。
对于判断为特定内容的视频,可以进行进一步处理,例如对视频进行分类、分级或退出播放。
上述方法可用于识别色情视频,其中,对视频图像的特征检测包括敏感部位检测和肤色像素检测。
敏感部位检测的一种可实现方法包括:
步骤一,查找预先存储的人体敏感部位索引中与待识别图像匹配的人体敏感部位图片所对应的特征数据。人体敏感部位索引可以将人体敏感部位图片的特征数据按一定方式有序地组织、存储起来,方便查找。人体敏感部位图片可以通过在色情图片中标注出人体敏感部位并生成图片而获得。特征数据可以是向量特征,该向量特征可以是现有图像识别方法中的任意特征,比如描述纹理、HOG(Histogram of Oriented Gradient,图像梯度方向直方图)或LBP(Local Binary Patterns,局部二值模式)等等。可以通过提取待识别图像的特征数据,并计算待识别图像的特征数据与人体敏感部位图片的特征数据的距离,从而根据距离判断待识别图片与人体敏感部位图片是否匹配。比如,可以使用欧氏距离来表示距离,如果待识别图像的特征数据与其中一个人体敏感部位图片的特征数据的欧氏距离最短,且该欧式距离小于欧式距离阈值,则待识别图像与该人体敏感部位图片是匹配的。可以理解的是,还可以通过其他的相似性度量来判断是否匹配,比如相关系数等,这里不一一列举。
步骤二,根据匹配的特征数据计算待识别图像对应的置信度。置信度用来衡量某种判断与实际观测结果之间匹配程度的函数。置信度越高,待识别图像与人体敏感部位图片的匹配程度越高。在一个实施例中,待识别图像的特征数据与匹配的特征数据之间的欧式距离和置信度是负相关的关系,可以使用一个负相关的函数表示两者之间的关系,比如c=e-x,其中x是待识别图像的特征数据与匹配的特征数据之间的欧式距离,c是置信度。
步骤三,根据待识别图像对应的置信度判断待识别图像是否为色情图像。当置信度高于第一置信度阈值时,说明待识别图像与匹配的人体敏感部位图片的匹配程度很高,待识别图像是色情图像。
肤色像素检测的一种可实现方法包括:
步骤一,检测视频图像中人体区域像素和人头区域像素。
人体检测一般采用Adaboost(一种迭代算法)人体检测算法(当然,也可以采用其他算法),通过基于边缘直方图特征的Adaboost人体检测算法判断图像中是否有人体存在,首先计算视频图像的积分图,提取边缘直方图特征,根据已设定好的分类器特征库,运行级联的方法在图像中搜索人体区域。其中分类器特征库训练方法包括:计算样本图像的积分图,提取样本图像的类矩形特征;根据Adaboost算法筛选有效的特征,构成弱分类器;通过组合多个弱分类器,构成强分类器;级联多个强分类器,形成人体检测的分类器特征库。在人体检测单元检测出存在人体时,再对视频图像进行检测,并判断是否存在人头。
人头检测采用Adaboost人头检测算法,通过基于类矩形特征的Adaboost人头检测算法判断图像中是否有人头存在,首先计算图像的积分图,提取边缘直方图特征,根据已训练好的分类器特征库,运行cascade级联的方法在图像中搜索人头区域。其中分类器特征库训练方法包括:计算样本图像的积分图,提取样本图像的类矩形特征;根据Adaboost算法筛选有效的特征,构成弱分类器;通过组合多个弱分类器,构成强分类器;级联多个强分类器,形成人头检测的分类器特征库。
步骤二,统计每张视频图像中肤色像素与图像像素的比例、肤色像素和人体区域像素的比例以及人头区域像素与肤色像素的比例。
步骤三,根据预先设定的肤色像素与图像像素的第一比例阈值,肤色像素和人体区域像素的第二比例阈值、人头区域像素与肤色像素的第三比例阈值和预设的判断策略判断视频图像是否为色情图像。
首先判断所述肤色像素与图像像素的比例是否大于第一比例阈值、所述肤色像素和人体区域像素的比例是否大于第二比例阈值、所述人头区域像素与肤色像素的比例是否大于第三比例阈值,分别得到第一结果、第二结果和第三结果;然后判断第一结果、第二结果和第三结果是否满足判断策略,若满足,说明视频图像的肤色像素符合色情图像特点,确定该视频图像是色情图像。判断策略可以是满足肤色像素与图像像素的比例大于第一比例阈值、肤色像素和人体区域像素的比例大于第二比例阈值、人头区域像素与肤色像素的比例大于第三比例阈值中的至少两个条件。
需要说明的是,对于前述的方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。 其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于实施例,所涉及的动作和模块并不一定是本发明实施例所必须的。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
本实施例提供一种多媒体文件处理方法。在如实施例的运行环境下,本申请提供了如图18所示的多媒体文件处理方法。如图18所示,图18是根据本发明实施例的多媒体文件处理方法的流程图,该方法的一种方案包括如下步骤:
S1801:在多媒体文件播放过程中,获取观众用户的画像特征值和第一意愿特征值,所述画像特征值用于标识用户对特定内容的喜好,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿;
S1802:根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率;
S1803:判断所述概率是否超过预设值,若是,则对所述多媒体文件进行特征检测;
S1804:根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件;
S1805:根据所述判断结果对所述多媒体文件进行处理。
所述多媒体文件为点播视频或直播视频,所述特定内容为色情内容;所述根据所述判断结果对所述多媒体文件进行处理包括:若多媒体文件是点播的色情视频,则退出点播视频的播放;若多媒体文件是直播的色情视频,则关闭播放该视频的视频直播间。
本实施例实现了通过用户行为特征对多媒体文件进行初步筛选,再对筛选出来的多媒体文件进行特定内容检测,提高了对特定内容的识别效率和准确性。将本发明实施例用于色情视频检测可以极大的提高检测效率和可靠性,方便多媒体视频的管控。
本实施例提供一种基于行为特征的多媒体文件识别装置。如图19所示,该装置包括获取单元1920、计算单元1930、检测单元1940和确定单元1950。
获取单元1920,用于在多媒体文件播放过程中,获取观众用户的画像特征值和第一意愿特征值,所述画像特征值用于标识用户对特定内容的喜好,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿;
计算单元1930,用于根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率;
检测单元1940,用于判断所述概率是否超过预设值,若是,则对所述多媒体文件进行特征检测;
确定单元1950,用于根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件。
该实施例的基于行为特征的多媒体文件识别装置中,获取单元1920用于执行本发明实施例中的步骤S1601,计算单元1930用于执行本发明实施例中的步骤S1602,检测单元1940用于执行本发明实施例中的步骤S1603,确定单元1950用于执行本发明实施例中的步骤S1604。
参见图20,作为一种实施方式,所述计算单元2030包括:
第一计算子单元20301,用于根据所述画像特征值和第一意愿特征值确定每个用户的第二意愿特征值;
第二计算子单元20302,用于根据所有用户的第二意愿特征值计算所述多媒体文件包含特定内容的概率。
作为一种实施方式,所述第一计算子单元20301包括:
第一计算模块203011,用于对所述画像特征值和第一意愿特征值进行求和,得到所述第二意愿特征值;
第二计算模块203012,用于根据预先为画像特征值和第一意愿特征值设定的权重,对所述画像特征值和第一意愿特征值进行加权平均,得到第二意愿特征值。
进一步地,所述第二计算子单元20302包括:
比对模块203021,用于分别将各个用户的第二意愿特征值与预设的阈值进行比对;
概率计算模块203022,用于计算所述第二意愿特征值超过阈值的用户数量与用 户总数量的比值,得到所述多媒体文件包含特定内容的概率。
进一步地,所述装置还包括预处理单元2010,预处理单元2010用于分析用户的行为数据,确定用户的画像特征值和第一意愿特征值。所述预处理单元2010包括第一预处理子单元20101和第二预处理子单元20102。
第一处理子单元20101,用于:获取用户的行为数据,所述行为数据包括浏览特定内容相关文本的第一行为数据、浏览特定内容相关图片的第二行为数据、访问特定内容相关论坛的第三行为数据和在特定内容相关的聊天群里聊天的第四行为数据;分别判断所述第一行为数据、第二行为数据、第三行为数据和第四行为数据是否为空,对应得到第一判断结果、第二判断结果、第三判断结果和第四判断结果;根据预先设定的所述第一判断结果的第一权重、所述第二判断结果的第二权重、所述第三判断结果的第三权重和所述第四判断结果的第四权重,对所述第一判断结果、第二判断结果、第三判断结果和第四判断结果进行分配整合,得到所述用户的行为特征值;
第二处理子单元20102,用于:获取用户在最近一段时间内的行为数据,所述行为数据包括浏览特定内容相关文本的第一时间、浏览特定内容相关图片的第二时间、访问特定内容相关论坛的第三时间和在特定内容相关的聊天群里聊天的第四时间;为所述第一时间赋予所述第一权重、所述第二时间赋予所述第二权重、所述第三时间赋予所述第三权重、所述第四时间赋予所述第四权重,对所述第一时间、第二时间、第三时间和第四时间进行加权平均,得到用户的意愿特征值。
作为一种实施方式,所述多媒体文件为视频,所述检测单元2040包括:
提取子单元20401,用于对所述视频等时间间隔提取预设数目的图像;
检测子单元20402,用于对每张图像进行特征检测,判断所述图像是否包含特定特征,所述特征检测包括敏感部位检测和肤色像素检测。
作为一种的实施方式,所述确定单元2050包括:
第一确定子单元20501,用于在判断出包含特定特征的图像数目大于预先设定的阈值P时,判断所述视频为特定内容视频,否则判断所述视频为正常视频;或
第二确定子单元20502,用于确定判断出的包含特定特征的图像数目与针对该视频检测提取得到的图像总数目的比值,在确定的比值大于阈值Q时,判断所述视频为特定内容视频,否则判断所述视频为正常视频。
作为本实施例的方式,所述特定内容为色情内容,所述视频为点播视频或直播视频。
本实施例提供一种多媒体文件处理装置。如图21所示,该装置包括获取单元2120、计算单元2130、检测单元2140、确定单元2150和处理单元2160。
获取单元2120,用于在多媒体文件播放过程中,获取观众用户的画像特征值和第一意愿特征值,所述画像特征值用于标识用户对特定内容的喜好,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿;
计算单元2130,用于根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率;
检测单元2140,用于判断所述概率是否超过预设值,若是,则对所述多媒体文件进行特征检测;
确定单元2150,用于根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件;
处理单元2160,用于根据所述判断结果对所述多媒体文件进行处理。
该实施例的多媒体文件处理装置中,获取单元2120用于执行本发明实施例中的步骤S1801,计算单元2130用于执行本发明实施例中的步骤S1802,检测单元2140用于执行本发明实施例中的步骤S1803,确定单元2150用于执行本发明实施例中的步骤S1804,处理单元2160用于执行本发明实施例中的步骤S1805。
所述多媒体文件为点播视频或直播视频,所述特定内容为色情内容。所述处理单元2160具体用于:在确定多媒体文件是点播的色情视频时,退出点播视频的播放;在确定多媒体文件是直播的色情视频时,关闭播放该视频的视频直播间。
本发明的实施例还提供了一种存储介质。在本实施例中,上述存储介质可以用于保存上述实施例的一种基于行为特征的多媒体文件识别方法所执行的程序代码。
在本实施例中,上述存储介质可以位于计算机网络的多个网络设备中的至少一个网络设备。
在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:
第一步,在多媒体文件播放过程中,获取观众用户的画像特征值和第一意愿特征值,所述画像特征值用于标识用户对特定内容的喜好,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿;
第二步,根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率;
第三步,判断所述概率是否超过预设值,若是,则对所述多媒体文件进行特征检测;
第四步,根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件。
存储介质还被设置为存储用于执行以下步骤的程序代码:根据所述画像特征值和第一意愿特征值确定每个用户的第二意愿特征值;根据所有用户的第二意愿特征值计算所述多媒体文件包含特定内容的概率。
存储介质还被设置为存储用于执行以下步骤的程序代码:对所述画像特征值和第一意愿特征值进行求和,得到所述第二意愿特征值,或者,根据预先为画像特征值和第一意愿特征值设定的权重,对所述画像特征值和第一意愿特征值进行加权平均,得到第二意愿特征值。
存储介质还被设置为存储用于执行以下步骤的程序代码:分别将各个用户的第二意愿特征值与预设的阈值进行比对;计算所述第二意愿特征值超过阈值的用户数量与用户总数量的比值,得到所述多媒体文件包含特定内容的概率。
存储介质还被设置为存储用于执行以下步骤的程序代码:分析用户的行为数据,确定用户的画像特征值和第一意愿特征值。
存储介质还被设置为存储用于执行以下步骤的程序代码:所述多媒体文件为视频时,对视频等时间间隔提取预设数目的图像;对每张图像进行特征检测,判断所述图像是否包含特定特征,所述特征检测包括敏感部位检测和肤色像素检测。
存储介质还被设置为存储用于执行以下步骤的程序代码:当判断出包含特定特征的图像数目大于预先设定的阈值P时,判断所述视频为特定内容视频,否则判断所述视频为正常视频;或,确定判断出的包含特定特征的图像数目与针对该视频检测提取得到的图像总数目的比值,在确定的比值大于阈值Q时,判断所述视频为特定内容视频,否则判断所述视频为正常视频。
在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本发明的实施例还提供了一种存储介质。在本实施例中,上述存储介质可以用 于保存上述实施例的一种视频处理方法所执行的程序代码。
在本实施例中,上述存储介质可以位于计算机网络的多个网络设备中的至少一个网络设备。
在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:
第一步,在多媒体文件播放过程中,获取观众用户的画像特征值和第一意愿特征值,所述画像特征值用于标识用户对特定内容的喜好,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿;
第二步,根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率;
第三步,判断所述概率是否超过预设值,若是,则对所述多媒体文件进行特征检测;
第四步,根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件;
第五步,根据所述判断结果对所述多媒体文件进行处理。
存储介质还被设置为存储用于执行以下步骤的程序代码:当多媒体文件是点播的色情视频时,退出点播视频的播放;当多媒体文件是直播的色情视频时,关闭播放该视频的视频直播间。
本发明的实施例还提供一种计算机终端,该计算机终端可以是计算机终端群中的任意一个计算机终端设备。在本实施例中,上述计算机终端也可以替换为移动终端等终端设备。
在本实施例中,上述计算机终端可以位于计算机网络的多个网络设备中的至少一个网络设备。
图22是根据本发明实施例的计算机终端的结构框图。如图22所示,该计算机终端A可以包括:一个或多个(图中仅示出一个)处理器2201、存储器2203、以及传输装置2205。
其中,存储器2203可用于存储软件程序以及模块,如本发明实施例中的基于行为特征的多媒体文件识别方法及装置对应的程序指令/模块,处理器2201通过运行存储在存储器2203内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的多媒体文件识别方法。存储器2203可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态 存储器。在一些实例中,存储器2203可进一步包括相对于处理器2201远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端A。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
上述的传输装置2205用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装置2205包括一个网络适配器,其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置2205为射频模块,其用于通过无线方式与互联网进行通讯。
其中,具体地,存储器2203用于存储预设动作条件和预设权限用户的信息、以及应用程序。
处理器2201可以通过传输装置调用存储器2203存储的信息及应用程序,以执行下述步骤:
第一步,在多媒体文件播放过程中,获取观众用户的画像特征值和第一意愿特征值,所述画像特征值用于标识用户对特定内容的喜好,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿;
第二步,根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率;
第三步,判断所述概率是否超过预设值,若是,则对所述多媒体文件进行特征检测;
第四步,根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件。
在本申请一实施例提供一种网络信息识别方法,该方法包括如下步骤:
步骤一、获取待识别网络信息。
在本步骤中,待识别网络信息可以包括目标文本。
步骤二、对网络信息进行分词处理,得到网络信息的分词。
在本步骤中,可以对目标文本进行分词处理,得到该目标文本的分词。
步骤三、根据预存的可信网络信息和非可信网络信息确定该网络信息的各分词所属的信息的类型。
在本步骤中,可信网络信息可以为真实信息库中的信息,非可信网络信息可以为虚假信息库中的信息。根据预存的可信网络信息和非可信网络信息确定该网络信息的各分词所述的信息的类型可以包括:按照各分词在网络信息中的出现顺序,将 相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型。
在本步骤中,可以计算每个词组中两个分词的关联值,提取虚假信息库中对应的所述两个分词的关联值,作为第一关联值;提取真实信息库中对应的所述两个分词的关联值,作为第二关联值,计算所述关联值与第一关联值的差值,得到第一差值;计算所述关联值与第二关联值的差值,得到第二差值,比较所述第一差值的绝对值和第二差值的绝对值的大小,若第一差值的绝对值大于第二差值的绝对值,则确定该词组的信息类型为真实信息,若第一差值的绝对值小于第二差值的绝对值,则确定该词组的信息类型为虚假信息,若第一差值的绝对值与第二差值的绝对值相等,则确定该词组的信息类型为无偏向信息。
步骤四、根据各分词所属的信息的类型进行统计,确定该网络信息所属的信息类型。
本实施例中的具体示例可以参考上述实施例和实施例中所描述的示例,在此不再赘述。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可 以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,装置,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (35)

  1. 一种网络信息识别方法,其特征在于,包括:
    获取待识别网络信息;
    计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度;
    根据所述第一相似度及所述第二相似度确定所述待识别网络信息是否可信。
  2. 根据权利要求1所述的方法,其特征在于,所述计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度包括:
    采用余弦定理算法计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及采用余弦定理算法计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度。
  3. 根据权利要求1或2所述的方法,其特征在于,在获取待识别网络信息之前,所述方法还包括:
    采集可信网络信息及非可信网络信息;
    根据采集的可信网络信息建立可信数据库,以及根据采集的非可信网络信息建立非可信数据库。
  4. 根据权利要求3所述的方法,其特征在于,所述计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度包括:
    计算所述待识别网络信息与所述可信数据库中的各个可信网络信息的相似度,取计算所得的相似度的最大值记为第一相似度;
    计算所述待识别网络信息与所述非可信数据库中的各个非可信网络信息的相似度,取计算所得的相似度的最大值记为第二相似度。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述第一相似度及所述第二相似度确定所述待识别网络信息是否可信包括:
    比较所述第一相似度与所述第二相似度的大小;
    当所述第一相似度大于所述第二相似度时,确定所述待识别网络信息可信;
    当所述第二相似度大于所述第一相似度时,确定所述待识别网络信息不可信。
  6. 根据权利要求5所述的方法,其特征在于,在确定所述待识别网络信息不可信时,所述方法还包括:
    将所述待识别网络信息标记为可疑,或者屏蔽所述待识别网络信息。
  7. 一种网络信息识别装置,其特征在于,包括:
    获取单元,用于获取待识别网络信息;
    计算单元,用于计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度;
    确定单元,用于根据所述第一相似度及所述第二相似度确定所述待识别网络信息是否可信。
  8. 根据权利要求7所述的装置,其特征在于,所述计算单元具体用于,
    采用余弦定理算法计算所述待识别网络信息与可信网络信息的相似度,记为第一相似度,以及采用余弦定理算法计算所述待识别网络信息与非可信网络信息的相似度,记为第二相似度。
  9. 根据权利要求7所述的装置,其特征在于,所述计算单元包括:
    第一计算子单元,用于计算所述待识别网络信息与所述可信数据库中的各个可信网络信息的相似度,取计算所得的相似度的最大值记为第一相似度;
    第二计算子单元,用于计算所述待识别网络信息与所述非可信数据库中的各个非可信网络信息的相似度,取计算所得的相似度的最大值记为第二相似度;所述确定单元包括:
    比较子单元,用于比较所述第一相似度与所述第二相似度的大小;
    第一确定子单元,用于当所述第一相似度大于所述第二相似度时,确定所述待识别网络信息可信;
    第二确定子单元,用于当所述第二相似度大于所述第一相似度时,确定所述待识别网络信息不可信。
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括:
    处理单元,用于在所述第二确定子单元确定所述待识别网络信息不可信时,将所述待识别网络信息标记为可疑,或者屏蔽所述待识别网络信息。
  11. 一种网络信息识别方法,其特征在于,包括:
    对目标文本进行分词处理,得到目标文本的分词;
    按照各分词在目标文本中的出现顺序,将相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息类型包括虚假信息、真实信息和无偏向信息;
    对目标文本中所有词组的信息类型进行统计,得到统计结果;
    根据统计结果确定所述目标文本的信息类型。
  12. 根据权利要求11所述的方法,其特征在于,所述对目标文本进行分词处理,得到目标文本的分词,包括:
    获取目标文本;
    对所述目标文本进行预处理,去除目标文本中的停用词;
    采用字典分词法对所述目标文本进行分词处理,得到目标文本的分词。
  13. 根据权利要求11所述的方法,其特征在于,所述根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,包括:
    计算每个词组中两个分词的关联值;
    提取虚假信息库中对应的所述两个分词的关联值,作为第一关联值;提取真实信息库中对应的所述两个分词的关联值,作为第二关联值;
    根据所述关联值分别与第一关联值和第二关联值的接近程度,确定所述词组的信息类型。
  14. 根据权利要求13所述的方法,其特征在于,所述根据所述关联值分别与第一关联值和第二关联值的接近程度,确定所述词组的信息类型,包括:
    计算所述关联值与第一关联值的差值,得到第一差值;计算所述关联值与第二关联值的差值,得到第二差值;
    比较所述第一差值的绝对值和第二差值的绝对值的大小,若第一差值的绝对值大于第二差值的绝对值,则确定该词组的信息类型为真实信息,若第一差值的绝对值小于第二差值的绝对值,则确定该词组的信息类型为虚假信息,若第一差值的绝对值与第二差值的绝对值相等,则确定该词组的信息类型为无偏向信息。
  15. 根据权利要求13所述的方法,其特征在于,所述计算每个词组中两个分词的关联值,包括:
    根据公式X(W12)=C(W2)*C(W12)/C(W1)计算得到词组中两个分词的关联值;
    其中,X(W12)表示所述词组中两个分词的关联值,C(W1)表示所述词组中的第一个分词在目标文本中出现的频次,C(W2)表示所述词组中的第二个分词在目标文本中出现的频次,C(W12)表示第一个分词和第二个分词在目标文本中有顺序的同时连续出现的频次,所述第一个分词在目标文本中的出现顺序早于第二个分词。
  16. 根据权利要求11所述的方法,其特征在于,所述对目标文本中所有词组的信息类型进行统计,得到统计结果,包括:
    获取目标文本中所有词组的信息类型;所述根据统计结果确定所述目标文本的信息类型,包括:
    比较虚假信息和真实信息的出现频次,将出现频次较高的信息类型确定为所述目标文本的信息类型,如果虚假信息的出现频次和真实信息的出现频次相同,则确定所述目标文本的信息类型为无偏向信息。
  17. 根据权利要求11所述的方法,其特征在于,所述对目标文本进行分词处理,得到目标文本的分词之前,还包括:
    对虚假信息库中的虚假信息样本进行分词处理,得到虚假信息样本的分词,按照各分词在该虚假信息样本中的出现顺序,计算得到相邻两个分词的关联值;
    对真实信息库中的真实信息样本进行分词处理,得到真实信息样本的分词,按照各分词在该真实信息样本中的出现顺序,计算得到相邻两个分词的关联值。
  18. 根据权利要求11所述的方法,其特征在于,进一步包括:
    若所述目标文本的信息类型为虚假信息,则删除网络中的所述目标文本。
  19. 一种网络信息识别装置,其特征在于,包括:
    分词单元,用于对目标文本进行分词处理,得到目标文本的分词;
    第一确定单元,用于按照各分词在目标文本中的出现顺序,将相邻两个分词作为一个词组,根据虚假信息库和真实信息库中的信息,确定每个词组的信息类型,所述信息类型包括虚假信息、真实信息和无偏向信息;
    统计单元,用于对目标文本中所有词组的信息类型进行统计,得到统计结果;
    第二确定单元,用于根据统计结果确定所述目标文本的信息类型。
  20. 根据权利要求19所述的装置,其特征在于,所述分词单元包括:
    第一获取子单元,用于获取目标文本;
    处理子单元,用于对所述目标文本进行预处理,去除目标文本中的停用词;
    分词子单元,用于采用字典分词法对经过处理子单元处理后的目标文本进行分词处理,得到目标文本的分词。
  21. 根据权利要求19所述的装置,其特征在于,所述第一确定单元包括:
    计算子单元,用于计算每个词组中两个分词的关联值;
    提取子单元,用于提取虚假信息库中对应的所述两个分词的关联值,作为第一关联值,提取真实信息库中对应的所述两个分词的关联值,作为第二关联值;
    确定子单元,用于根据所述关联值分别与第一关联值和第二关联值的接近程度,确定所述词组的信息类型。
  22. 一种网络信息识别方法,其特征在于,包括:
    在多媒体文件播放过程中,获取观众用户的画像特征值和第一意愿特征值,所述画像特征值用于标识用户对特定内容的喜好,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿;
    根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率;
    判断所述概率是否超过预设值,若是,则对所述多媒体文件进行特征检测;
    根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件。
  23. 根据权利要求22所述的方法,其特征在于,所述根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率,包括:
    根据所述画像特征值和第一意愿特征值确定每个用户的第二意愿特征值;
    根据所有用户的第二意愿特征值计算所述多媒体文件包含特定内容的概率。
  24. 根据权利要求23所述的方法,其特征在于,所述根据所述画像特征值和第一意愿特征值确定每个用户的第二意愿特征值,包括:
    对所述画像特征值和第一意愿特征值进行求和,得到所述第二意愿特征值,或者,
    根据预先为画像特征值和第一意愿特征值设定的权重,对所述画像特征值和第一意愿特征值进行加权平均,得到第二意愿特征值。
  25. 根据权利要求23所述的方法,其特征在于,所述根据所有用户的第二意愿特征值计算所述多媒体文件包含特定内容的概率,包括:
    分别将各个用户的第二意愿特征值与预设的阈值进行比对;
    计算所述第二意愿特征值超过阈值的用户数量与用户总数量的比值,得到所述多媒体文件包含特定内容的概率。
  26. 根据权利要求22所述的方法,其特征在于,还包括:
    获取用户的行为数据,所述行为数据包括浏览特定内容相关文本的第一行为数据、浏览特定内容相关图片的第二行为数据、访问特定内容相关论坛的第三行为数据和在特定内容相关的聊天群里聊天的第四行为数据;
    分别判断所述第一行为数据、第二行为数据、第三行为数据和第四行为数据是否为空,对应得到第一判断结果、第二判断结果、第三判断结果和第四判断结果;
    根据预先设定的所述第一判断结果的第一权重、所述第二判断结果的第二权重、所述第三判断结果的第三权重和所述第四判断结果的第四权重,对所述第一判断结果、第二判断结果、第三判断结果和第四判断结果进行分配整合,得到所述用户的行为特征值。
  27. 根据权利要求26所述的方法,其特征在于,所述分析用户的行为数据,确定用户的第一意愿特征值,包括:
    获取用户在最近一段时间内的行为数据,所述行为数据包括浏览特定内容相关文本的第一时间、浏览特定内容相关图片的第二时间、访问特定内容相关论坛的第三时间和在特定内容相关的聊天群里聊天的第四时间;
    为所述第一时间赋予所述第一权重、所述第二时间赋予所述第二权重、所述第三时间赋予所述第三权重、所述第四时间赋予所述第四权重,对所述第一时间、第二时间、第三时间和第四时间进行加权平均,得到用户的意愿特征值。
  28. 根据权利要求22所述的方法,其特征在于,所述多媒体文件为视频;
    所述对所述多媒体文件进行特征检测,包括:
    对视频等时间间隔提取预设数目的图像;
    对每张图像进行特征检测,判断所述图像是否包含特定特征,所述特征检测包括敏感部位检测和肤色像素检测。
  29. 根据权利要求28所述的方法,其特征在于,所述根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件,包括:
    当判断出包含特定特征的图像数目大于预先设定的阈值P时,判断所述视频为 特定内容视频,否则判断所述视频为正常视频;或
    确定判断出的包含特定特征的图像数目与针对该视频检测提取得到的图像总数目的比值,在确定的比值大于阈值Q时,判断所述视频为特定内容视频,否则判断所述视频为正常视频。
  30. 根据权利要求22所述的方法,其特征在于,所述特定内容为色情内容,所述视频为点播视频或直播视频。
  31. 根据权利要求22所述的方法,其特征在于,所述多媒体文件为点播视频或直播视频,所述特定内容为色情内容;
    该方法进一步包括:若多媒体文件是点播的色情视频,则退出点播视频的播放;若多媒体文件是直播的色情视频,则关闭播放该视频的视频直播间。
  32. 一种网络信息识别装置,其特征在于,包括:
    获取单元,用于在多媒体文件播放过程中,获取观众用户的画像特征值和第一意愿特征值,所述画像特征值用于标识用户对特定内容的喜好,所述第一意愿特征值用于标识用户在预设时间段内希望观看特定内容的意愿;
    计算单元,用于根据所述画像特征值和第一意愿特征值计算所述多媒体文件包含特定内容的概率;
    检测单元,用于判断所述概率是否超过预设值,若是,则对所述多媒体文件进行特征检测;
    确定单元,用于根据特征检测结果判断所述多媒体文件是否为特定内容的多媒体文件。
  33. 根据权利要求32所述的装置,其特征在于,所述多媒体文件为视频;
    所述检测单元包括:
    提取子单元,用于对视频等时间间隔提取预设数目的图像;
    检测子单元,用于对每张图像进行特征检测,判断所述图像是否包含特定特征,所述特征检测包括敏感部位检测和肤色像素检测。
  34. 根据权利要求33所述的装置,其特征在于,所述确定单元包括:
    第一确定子单元,用于在判断出包含特定特征的图像数目大于预先设定的阈值P时,判断所述视频为特定内容视频,否则判断所述视频为正常视频;
    第二确定子单元,用于确定判断出的包含特定特征的图像数目与针对该视频检 测提取得到的图像总数目的比值,在确定的比值大于阈值Q时,判断所述视频为特定内容视频,否则判断所述视频为正常视频。
  35. 一种非易失性存储介质,用于存储机器可读指令,当所述机器可读指令被执行时,执行所述权利要求1至6,11至18、22至31任一项所述的方法。
PCT/CN2017/104275 2016-10-13 2017-09-29 网络信息识别方法和装置 WO2018068664A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/026,786 US10805255B2 (en) 2016-10-13 2018-07-03 Network information identification method and apparatus

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN201610895856.9 2016-10-13
CN201610895856.9A CN107741938A (zh) 2016-10-13 2016-10-13 一种网络信息识别方法及装置
CN201610956467.2A CN107992501B (zh) 2016-10-27 2016-10-27 社交网络信息识别方法、处理方法及装置
CN201610956467.2 2016-10-27
CN201610929276.7A CN108024148B (zh) 2016-10-31 2016-10-31 基于行为特征的多媒体文件识别方法、处理方法及装置
CN201610929276.7 2016-10-31

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/026,786 Continuation US10805255B2 (en) 2016-10-13 2018-07-03 Network information identification method and apparatus

Publications (1)

Publication Number Publication Date
WO2018068664A1 true WO2018068664A1 (zh) 2018-04-19

Family

ID=61905158

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/104275 WO2018068664A1 (zh) 2016-10-13 2017-09-29 网络信息识别方法和装置

Country Status (2)

Country Link
US (1) US10805255B2 (zh)
WO (1) WO2018068664A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222335A (zh) * 2019-05-20 2019-09-10 平安科技(深圳)有限公司 一种文本分词方法及装置
CN111274403B (zh) * 2020-02-09 2023-04-25 重庆大学 一种网络欺凌检测方法
CN111914645A (zh) * 2020-06-30 2020-11-10 五八有限公司 识别虚假信息的方法、装置、电子设备及存储介质
CN113395263B (zh) * 2021-05-26 2022-07-26 西南科技大学 在线社交网络下共享视频的信任度计算方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140258322A1 (en) * 2013-03-06 2014-09-11 Electronics And Telecommunications Research Institute Semantic-based search system and search method thereof
CN105335422A (zh) * 2014-08-06 2016-02-17 阿里巴巴集团控股有限公司 舆情信息的告警方法及装置
CN105354307A (zh) * 2015-11-06 2016-02-24 腾讯科技(深圳)有限公司 一种图像内容识别方法及装置

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9003517B2 (en) * 2009-10-28 2015-04-07 Microsoft Technology Licensing, Llc Isolation and presentation of untrusted data
CN102541899B (zh) 2010-12-23 2014-04-16 阿里巴巴集团控股有限公司 一种信息识别方法及设备
US20160065534A1 (en) * 2011-07-06 2016-03-03 Nominum, Inc. System for correlation of domain names
US8938511B2 (en) * 2012-06-12 2015-01-20 International Business Machines Corporation Method and apparatus for detecting unauthorized bulk forwarding of sensitive data over a network
US10735216B2 (en) * 2012-09-21 2020-08-04 Google Llc Handling security services visitor at a smart-home
US20150120598A1 (en) * 2012-09-21 2015-04-30 Google Inc. Tracking of a package delivery to a smart-home
US10332059B2 (en) * 2013-03-14 2019-06-25 Google Llc Security scoring in a smart-sensored home
DE202014011541U1 (de) * 2013-04-19 2022-03-03 Twitter, Inc. System zum Herstellen einer Vertrauensverknüpfung
CN103530562A (zh) 2013-10-23 2014-01-22 腾讯科技(深圳)有限公司 一种恶意网站的识别方法和装置
CN103744905B (zh) 2013-12-25 2018-03-30 新浪网技术(中国)有限公司 垃圾邮件判定方法和装置
CN105447036B (zh) 2014-08-29 2019-08-16 华为技术有限公司 一种基于观点挖掘的社交媒体信息可信度评估方法及装置
EP3304856A1 (en) * 2015-06-05 2018-04-11 Convida Wireless, LLC Unified authentication for integrated small cell and wi-fi networks
CN105100119A (zh) 2015-08-31 2015-11-25 百度在线网络技术(北京)有限公司 网址的检测方法及装置
CN105426759A (zh) 2015-10-30 2016-03-23 百度在线网络技术(北京)有限公司 Url的合法性识别方法及装置
CN105426706B (zh) 2015-11-20 2018-06-15 北京奇虎科技有限公司 盗版应用检测方法和装置、系统
US10097580B2 (en) * 2016-04-12 2018-10-09 Microsoft Technology Licensing, Llc Using web search engines to correct domain names used for social engineering
US10663662B1 (en) * 2017-10-12 2020-05-26 National Technology & Engineering Solutions Of Sandia, Llc High density optical waveguide using hybrid spiral pattern

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140258322A1 (en) * 2013-03-06 2014-09-11 Electronics And Telecommunications Research Institute Semantic-based search system and search method thereof
CN105335422A (zh) * 2014-08-06 2016-02-17 阿里巴巴集团控股有限公司 舆情信息的告警方法及装置
CN105354307A (zh) * 2015-11-06 2016-02-24 腾讯科技(深圳)有限公司 一种图像内容识别方法及装置

Also Published As

Publication number Publication date
US20190014071A1 (en) 2019-01-10
US10805255B2 (en) 2020-10-13

Similar Documents

Publication Publication Date Title
US11286310B2 (en) Methods and apparatus for false positive minimization in facial recognition applications
CN107066983B (zh) 一种身份验证方法及装置
WO2017045443A1 (zh) 一种图像检索方法及系统
US9996735B2 (en) Facial recognition
CN110309795B (zh) 视频检测方法、装置、电子设备及存储介质
WO2019119505A1 (zh) 人脸识别的方法和装置、计算机装置及存储介质
KR101190395B1 (ko) 모바일 디바이스에 의해 기록된 이미지 콘텐츠에 기초한 데이터 액세스
WO2018068664A1 (zh) 网络信息识别方法和装置
CN108304452B (zh) 文章处理方法及装置、存储介质
US11636710B2 (en) Methods and apparatus for reducing false positives in facial recognition
WO2021120875A1 (zh) 搜索方法、装置、终端设备及存储介质
WO2021138499A1 (en) Methods and apparatus for facial recognition on a user device
CN110889036A (zh) 一种多维度信息的处理方法、装置及终端设备
US20230410221A1 (en) Information processing apparatus, control method, and program
Fan et al. A novel approach for privacy-preserving video sharing
CN108024148B (zh) 基于行为特征的多媒体文件识别方法、处理方法及装置
WO2021175010A1 (zh) 用户性别识别的方法、装置、电子设备及存储介质
CN111354013A (zh) 目标检测方法及装置、设备和存储介质
CN112995757B (zh) 视频剪裁方法及装置
CN114302231A (zh) 视频处理方法及装置、电子设备和存储介质
WO2020202327A1 (ja) 学習システム、学習方法、及びプログラム
Han et al. Shad: Privacy-friendly shared activity detection and data sharing
CN113449275B (zh) 用户身份认证方法、装置及终端设备
KR20160104174A (ko) 인터넷 유해성 판단 방법
KR20150044557A (ko) 휴대 단말에 포함된 카메라를 이용한 음란물 시청 탐지 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17860244

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17860244

Country of ref document: EP

Kind code of ref document: A1