CN108090082B - Information processing method and information processing apparatus - Google Patents

Information processing method and information processing apparatus Download PDF

Info

Publication number
CN108090082B
CN108090082B CN201611036969.XA CN201611036969A CN108090082B CN 108090082 B CN108090082 B CN 108090082B CN 201611036969 A CN201611036969 A CN 201611036969A CN 108090082 B CN108090082 B CN 108090082B
Authority
CN
China
Prior art keywords
property
information
type
attribute
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611036969.XA
Other languages
Chinese (zh)
Other versions
CN108090082A (en
Inventor
董鹏
赵亚楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Beijing Co Ltd
Original Assignee
Tencent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Beijing Co Ltd filed Critical Tencent Technology Beijing Co Ltd
Priority to CN201611036969.XA priority Critical patent/CN108090082B/en
Publication of CN108090082A publication Critical patent/CN108090082A/en
Application granted granted Critical
Publication of CN108090082B publication Critical patent/CN108090082B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an information processing method and an information processing device; the method comprises the following steps: acquiring an information set composed of a plurality of information; extracting attribute values of a plurality of attributes corresponding to an object described by corresponding information from each piece of information; coding the attribute values of the object corresponding to the attributes to obtain digital characteristic values of the object corresponding to the attributes; combining the digital characteristic values of the objects corresponding to the attributes to form the digital characteristics of the objects; determining similarity between digital features of objects described by the information, and identifying the objects of which the similarity of the digital features meets a similarity condition as the same object; and identifying repeated information corresponding to the same object in the information set. By implementing the method and the device, the repeated information corresponding to the same object can be efficiently and accurately identified from the information set.

Description

Information processing method and information processing apparatus
Technical Field
The present invention relates to computer technologies, and in particular, to an information processing method and an information processing apparatus.
Background
At present, the internet is widely applied, and various information aggregation platforms can display various information on pages of erected websites for users to browse, and the information acquisition and the information organization and storage are involved.
For example, the idle item (e.g., idle books and idle) information platform interfaces with various online and offline information sources for publishing idle items, obtains publishing information of idle items from different information sources, and displays the publishing information in a page of a website based on regions, item categories, and the like for an access user to browse and select a desired item.
For another example, the property information platform obtains relevant information of properties to be sold from online and offline property information sources (such as different property brokerages), classifies the information according to dimensions such as regions, price intervals and the like, and displays the information on a website page for a user to quickly locate interesting properties.
The information aggregation platform has problems in the following examples:
1) for example, there is a difference in the descriptions of the published information for the same idle item published by the user at two idle item websites, resulting in the repeated publication of the information for the idle item at the information aggregation platform.
2) For another example, because the information of the same property provided by the user to different brokerages is different, the information aggregation platform issues the property information of the same property from multiple brokerages as different property information, which causes information interference to the audience.
As can be seen from the above example, since there is a case where multiple pieces of information obtained from different information sources describe the same object repeatedly, and there is a difference in the pieces of information obtained from different information sources that describe the same object, it is impossible to accurately distinguish which pieces of information are repeated, on one hand, the cost of organizing and storing information by the information aggregation platform is increased, and on the other hand, the information is repeatedly released for the same object, which causes interference and also affects the accuracy of the information.
Disclosure of Invention
Embodiments of the present invention provide an information processing method and an information processing apparatus, which can efficiently and accurately identify information corresponding to a same object from an information set.
The technical scheme of the embodiment of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides an information processing method, including:
acquiring an information set formed by information corresponding to a plurality of objects;
extracting attribute values of a plurality of attributes of the corresponding objects from information corresponding to the objects;
encoding the attribute values of the attributes of the object to form digital characteristic values of the object in each dimension;
combining the digital characteristic values of the object corresponding to all dimensions to form the digital characteristic of the object;
comparing the digital characteristics of the objects, and identifying the objects with the digital characteristic similarity meeting the preset condition as the same object;
and identifying repeated information corresponding to the same object in the information set.
In a second aspect, an embodiment of the present invention provides an information processing apparatus, including:
an acquisition unit configured to acquire an information set composed of a plurality of pieces of information;
the extracting unit is used for extracting attribute values of a plurality of attributes corresponding to the object described by the corresponding information from each piece of information;
the encoding unit is used for encoding the attribute values of the objects corresponding to the attributes to obtain the digital characteristic values of the objects corresponding to the attributes;
the combination unit is used for combining the digital characteristic values of the objects corresponding to the attributes to form the digital characteristics of the objects;
the comparison unit is used for determining the similarity between the digital characteristics of the objects described by the information and identifying the objects of which the similarity meets the similarity condition as the same object;
and the identification unit is used for identifying the repeated information corresponding to the same object in the information set.
In a third aspect, an embodiment of the present invention provides an information processing apparatus, including a processor and a memory; the memory has stored therein executable instructions for causing the processor to perform the following operations:
acquiring an information set formed by information corresponding to a plurality of objects;
extracting attribute values of a plurality of attributes of the corresponding objects from information corresponding to the objects;
encoding the attribute values of the attributes of the object to form digital characteristic values of the object in each dimension;
combining the digital characteristic values of the object corresponding to all dimensions to form the digital characteristic of the object;
comparing the digital characteristics of the objects, and identifying the objects with the digital characteristic similarity meeting the preset condition as the same object;
and identifying repeated information corresponding to the same object in the information set.
In a fourth aspect, an embodiment of the present invention provides a storage medium, which stores executable instructions for executing the information processing method provided by the embodiment of the present invention.
The embodiment of the invention has the following beneficial effects:
on one hand, because different attributes of the object are quantized into digital features, whether the objects described by each information in the information set are the same object can be efficiently and accurately judged based on the similarity of the digital features;
on the other hand, based on the identified same object, duplicate removal processing can be performed on the repeated information aiming at the same object in the information set, resource consumption caused by maintaining the repeated information in the information set is saved, a plurality of pieces of information describing the same object in the information set are eliminated, the information is perceived as a plurality of objects by audiences, interference to the audiences is avoided, and the precision of the information set is ensured.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating an alternative information processing method according to an embodiment of the present invention;
fig. 2-1 is an optional schematic diagram of sorting and dividing second-class attribute values of each piece of information in an information set into value spaces according to an embodiment of the present invention;
fig. 2-2 is an optional schematic diagram of sorting and dividing second-class attribute values of each piece of information in an information set into value spaces according to an embodiment of the present invention;
fig. 3-1 is a schematic diagram of an alternative application scenario when an information processing apparatus provided by an embodiment of the present invention is deployed in a network-side server;
fig. 3-2 is a schematic diagram of an optional application scenario when the information processing apparatus provided by the embodiment of the present invention is deployed in a user-side terminal;
fig. 4 is a schematic diagram of an alternative software and hardware structure of the information processing apparatus 10 according to the embodiment of the present invention;
fig. 5 is a schematic diagram of an alternative structure of an information processing apparatus according to an embodiment of the present invention;
FIG. 6-1 is a schematic diagram of a process for identifying multiple property information of the same property in the property information of an information set according to an embodiment of the present invention;
FIG. 6-2 is a schematic diagram showing the classification of the attributes constituting the DNA of a property provided in the embodiment of the present invention;
FIGS. 6-3 are schematic diagrams illustrating an alternative encoding rule for encoding attribute values of a class A attribute of a property according to an embodiment of the present invention
FIGS. 6-4 are schematic diagrams of an alternative numerical characteristic of the attribute values of the class A attributes of property information in an information set according to embodiments of the present invention;
FIGS. 6-5 are schematic diagrams illustrating an alternative process for generating numerical characteristic values of attribute values of class B attributes of property information of property sets in an information set according to an embodiment of the present invention;
FIGS. 6-6 are schematic diagrams of alternative sorting results for sorting areas of properties provided by embodiments of the present invention;
FIGS. 6-7 are an alternative schematic diagram of the numerical characteristic values assigned to each group after the areas of the property are grouped according to the embodiment of the present invention;
FIGS. 6-8 are diagrams of alternative numerical characteristic values of class B property values of a property provided by an embodiment of the present invention;
FIGS. 6-9 are schematic diagrams of an alternative flow chart of numerical characteristic values of attribute values of class C attributes of property information of each property in an information set;
FIGS. 6-10 are schematic diagrams of alternative D NA of a property provided by an embodiment of the present invention;
FIGS. 6-11 are schematic diagrams of alternative processes for calculating DNA of a property based on property information and identifying the same property based on DNA similarity according to embodiments of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the examples provided herein are merely illustrative of the present invention and are not intended to limit the present invention. In addition, the following embodiments are provided as some embodiments for implementing the invention, not all embodiments for implementing the invention, and those skilled in the art will not make creative efforts to recombine technical solutions of the following embodiments and other embodiments based on implementing the invention all belong to the protection scope of the invention.
It should be noted that, in the embodiments of the present invention, the terms "comprises", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion, so that a method or apparatus including a series of elements includes not only the explicitly recited elements but also other elements not explicitly listed or inherent to the method or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other related elements in a method or apparatus that comprises the element (e.g., steps in a method or elements in an apparatus).
For example, the information processing method provided by the embodiment of the present invention includes a series of steps, but the information processing method provided by the embodiment of the present invention is not limited to the described steps, and similarly, the information processing apparatus provided by the embodiment of the present invention includes a series of units, but the information processing apparatus provided by the embodiment of the present invention is not limited to include the explicitly described units, and may further include units that are required to acquire related information or perform processing based on the information.
Before further detailed description of the present invention, terms and expressions referred to in the embodiments of the present invention are described, and the terms and expressions referred to in the embodiments of the present invention are applicable to the following explanations.
1) An information set, i.e. information describing an object, a piece of information describes the object with one or more types of attributes included, which are used to qualitatively or quantitatively describe the characteristics of the object in a certain dimension.
2) The attribute includes an attribute name (attribute name) and a corresponding attribute value (attribute value), and the attribute value may be described in a qualitative or quantitative manner, such as a floor attribute of a real estate (i.e., the attribute name is a floor), may be a qualitative attribute value such as "high level", or may be a quantitative attribute value such as 18 levels.
3) Encoding means that attribute values of various attributes in information are uniformly expressed by digitized feature values (digital feature values), and digital features of the information are formed by combining the digital feature values of various attributes in the information, so that various attributes carried in the information can be processed and analyzed by using a computer.
Note that the digital feature value and the digital feature are different from the hash sequence calculated for the file by using the hash algorithm, and the hash sequence can only represent the uniqueness of the file, and never possibly carry the attribute in the information.
In the related art, there is a case where information acquired from a plurality of information sources repeatedly describes the same object, that is, there are a plurality of pieces of information in an information set that repeatedly describes the same object, and the plurality of pieces of information describing the same object are not completely described in the same manner, resulting in difficulty in distinguishing whether or not they are repeated information for the same object.
For example, for the information of the description of the same property, the following attributes may be included in the information: city (#); cell (#); floors (20 floors/high floors); degree of fit (general/luxury); price (20/30 ten thousand). In consideration of privacy protection, the house property information acquired from different agents does not include building numbers and house number numbers, and therefore, it is impossible to distinguish whether or not there is repeated information corresponding to the same house property in the information set.
For property information of the same cell acquired from different intermediaries, the amount of the information even exceeds the amount of all properties in the cell, and for an information aggregation platform aggregating the property information, redundant information causes excessive consumption of resources for data storage and organization in a background, and simultaneously causes interference to audiences of the property information.
In view of the above problem, an information processing method and an information processing apparatus for implementing the information processing method are provided according to an embodiment of the present invention, which is shown in fig. 1, and an optional flowchart of the information processing method provided by the embodiment of the present invention includes the following steps:
in step 101, an information set comprising a plurality of pieces of information is obtained.
In one embodiment, information is acquired from a plurality of information sources in a periodic or aperiodic manner, and information sets are organized on the acquired information.
Taking the information set of the property information as an example, the property information is acquired from the website of each property information in the network, and can be butted with the databases of each intermediary to acquire the property information opened by the intermediary organization, and the property information is organized to form the information set according to the source, the release time and other modes.
In practical application, the information set is updated frequently, for example, the information set of the property information may be updated many times within one hour, and the amount of information involved in the update is large, so as to avoid the problem of too high load caused by frequently acquiring information to form the information set, a service dedicated to acquiring information from different information sources may be deployed on the network side, the information set is formed by collecting information from each information source by the service, and the information processing apparatus directly acquires the information set from the service, thereby saving the occupation of computing resources and communication resources of the information processing apparatus, being beneficial to reducing the deployment difficulty of the information processing apparatus, and being especially easy to deploy at a terminal of a user side.
Note that, since each piece of information in the information set is used to describe one object, and it is not known whether or not a plurality of pieces of information repeatedly describe the same object when the information set is acquired, there are the following 2 cases:
case 1) the objects described by the respective information in the information set are all different objects.
Case 2) there is a case where partial information is described for the same object in the information set, that is, there is a case where some objects are the same object among a plurality of objects.
And 102, extracting attribute values of a plurality of attributes corresponding to the object described by the corresponding information from each information in the information set.
In one embodiment, considering that the information set is composed of a series of information, each piece of information describes one object based on included attributes (including an attribute name and a corresponding attribute value), the attribute names of a plurality of preset attributes are used as keywords, and the attribute values of a plurality of attributes corresponding to the corresponding object are inquired in the information of each object of the information set.
Taking the property information as an example, each piece of information in the information set is searched by taking the attribute name of the following attributes as a keyword: the city, county, business district, cell name (or alias), house type and orientation of the house; inquiring in the information to obtain attribute values of the property corresponding to the plurality of attributes, for example, the following attribute values of the property corresponding to the plurality of attributes are obtained: beijing, Chaoyang, Anzhen, Ziyu Huafu, Lianglun and Dong.
And 103, encoding the attribute values of the attributes corresponding to the object described by the information in the information set to obtain the digital characteristic values of the attributes corresponding to the object.
In one embodiment, the attribute of each dimension corresponds to one coding rule, the coding rule includes the corresponding relationship between different attribute values of the corresponding attribute and the corresponding digital feature value, and the attribute value of each attribute of the object is used as an index to query the corresponding relationship in the coding rule of the corresponding attribute, so as to obtain the digital feature value corresponding to the attribute value of the corresponding attribute.
The following property information is taken as an example to describe the encoding process of the property values of different properties: the city is as follows: beijing; the county; towards the sun; the cell name: purple imperial Huafu; the house type: one room for one hall; orientation: the east direction.
Specifically, when the attribute name is the city where the property is located, the correspondence between different cities (attribute values) where the property is located and the corresponding digital feature values is as shown in fig. 6-3, and when the city to which the property belongs in one piece of property information in the information set is "beijing", the digital feature value of "0001" is queried based on the correspondence shown in table 1;
when the county of the city to which the house property belongs is "facing sun", the numerical feature value of "0001" is inquired based on the correspondence shown in fig. 6-3;
when the cell of the city to which the house property belongs is "purple Imperial Huafu", the numerical feature value of "0002" is inquired based on the correspondence shown in fig. 6-3;
when the house type of the city to which the house property belongs is "one room and one hall", the numerical feature value of "0001" is inquired based on the correspondence shown in fig. 6-3;
when the product orientation of the city to which the house property belongs is "east", the numerical feature value of "0001" is searched for based on the correspondence shown in fig. 6 to 3.
In another embodiment, different attributes may be divided into multiple types, and accordingly, the attribute values are encoded in a targeted manner to form digital feature values according to the type of the attribute corresponding to the attribute value of the object, and the following description is provided for encoding the attribute values of different similar attributes to form digital feature values.
Illustratively, the different attributes are divided into the following optional types:
1) the first-class attribute is that the attribute value of the first class corresponding to the same object in the information of the information set has uniqueness, that is, for a plurality of pieces of information from different information sources of the same object, the value of the first-class attribute in the plurality of pieces of information has a stable and unchangeable characteristic.
Taking property information as an example, the first type of attribute may include: the city, county, business district, cell name (or alias), house type and orientation, etc. No matter the property information provided by the property owner or the property information provided by the medium, the attribute values aiming at the first type of attribute in a plurality of property information of the same property do not have difference.
Because the characteristic unique to the attribute value of the first-class attribute is selected by a plurality of pieces of information of the same object in the information set, for any first-class attribute encoding rule, the corresponding relation between the attribute value of each first-class attribute and the digital characteristic value can be formed in the encoding rule, so that the digital characteristic value formed based on the first-class attribute value (the attribute value of the first-class attribute) can realize the effect of accurately identifying the same object at the level of the first-class attribute.
Illustratively, the attribute values of the attributes of the first class are encoded in such a way as to form corresponding numerical characteristic values: and inquiring the corresponding relation between the attribute value and the digital characteristic value in the coding rule of the corresponding first-class attribute by taking the attribute value of each first-class attribute of the object as an index to obtain the digital characteristic value corresponding to the attribute value of the corresponding first-class dimension.
Still taking the property information as an example, for the following property information: the city is as follows: beijing; the county; towards the sun; the cell name: purple imperial Huafu; the house type: one room for one hall; orientation: the east direction. Since the attributes involved belong to the first type of attributes, the digital feature values (digital feature values) corresponding to different attribute values can be queried based on the encoding rule of the first type of attributes as shown in fig. 6-3, which in turn are: a numerical characteristic value of "0001" corresponding to "Beijing"; the numerical characteristic value of 0002 corresponding to 'purple imperial luxury'; a numerical characteristic value of "0001" corresponding to "one room and one hall"; towards the numerical characteristic value of "0001" corresponding to "east".
2) And the attribute value of the same object in the information set corresponding to the second attribute has a continuous value space. For a plurality of pieces of information from different information sources of the same object, values of attribute values of the second type of attribute in the plurality of pieces of information have unstable characteristics, and possible values form a continuous value space.
Taking the property information as an example, the second type of attribute may include: floor, area, floor type, age, price, etc. For various reasons (such as difference of information provided by a house owner, or beneficial and fuzzy house property information of intermediaries), in the house property information from multiple intermediaries of the same house property, the property value of the property aiming at the floor range may have difference, such as "18 layers" or "high layers", and is in the value space of (18, 30).
Because the attribute values of a plurality of pieces of information of the same object in the information set aiming at the second type attribute have difference and are relatively stable in a value space, if the coding rule of the second-order attribute adopts the one-to-one corresponding relationship between the first-class attribute and the digital characteristic value, the digital characteristic values of the second-class attribute values (attribute values of the second-class attribute) of the same property are differentiated, which is not beneficial to identifying the same object based on the digital characteristic values of the second-class attribute values, if the corresponding relation between the attribute value space and the digital characteristic value is used in the encoding rule of the second type of attribute, the same digital characteristic value is distributed to the attribute value in the same value space, the effect of accurately identifying the same object based on the digital characteristic value can be achieved, and the situation that the same object is mistakenly identified as different objects due to difference of the attribute values of the second-class attribute in a plurality of pieces of information of the same object is avoided.
……。
2.1) as a coding scheme, the coding rules of the second-class attributes include a correspondence between a preset value space of the attribute value of the second-bit attribute and the digital feature value, and one optional structure of the coding rules is as follows:
second class attribute coding
Preset value space 1 coding result 1
Preset value space 2 coding result 2
Correspondingly, the attribute value of each second type attribute of the object is used as an index, the coding rule of the corresponding second type attribute is inquired, and the digital characteristic value corresponding to the attribute value of the corresponding second type attribute is obtained, and the following mode can be adopted:
and inquiring the corresponding relation between the value space and the digital characteristic value in the coding rule corresponding to the second type attribute, determining the value space where the second type attribute value is located, and taking the digital characteristic value corresponding to the value space in the coding rule as the coding result of the corresponding second bit attribute value.
Taking the second type of attribute "floor" in the property information as an example, an alternative example of the encoding rule is as follows:
floor coding
1-10 layers 0001
11-30 layers, high layer 0002
As mentioned above, the property values of the floors in the property information of the same property may be different, such as "18 floors" or "10 floors or more", but the corresponding codes are "0003", so as to avoid the situation that the property information of the second type of property is different and thus different properties are identified.
2.2) in the above coding scheme 2.1), a predetermined value space is adopted in the coding rule of the second type of attribute, but the distribution of the second type of attribute values related to the information of the information set is random and has no regularity, that is, the distribution of the second type of attribute values cannot be uniform. The second bit attribute values in the information set may be distributed too intensively in a certain value space, which may cause most of the digital feature values of the second type attributes of the objects in the information set to be consistent, and further cause that the digital feature values based on the second bit attribute may not effectively distinguish the objects.
For example, if the information of the property information is concentrated and the floors of the property information are concentrated and distributed by 1-5 floors, the same numerical feature value "0001" of the second type of attribute is obtained based on the above coding rule, so that it is difficult to distinguish different properties based on the numerical feature value of the second type of attribute.
For the above problem, the encoding scheme 2.2) provides a scheme for dynamically dividing the value space and encoding in the encoding rule of the second type of attribute, and there is an exemplary scheme for dividing the value space and encoding as in the following 2:
2.2.1) as an example of dividing the value space and encoding, sorting the attribute values of the second type of attribute in each information in the information set, and dividing the value range corresponding to the sorting result into the value space meeting the predetermined condition.
Referring to fig. 2-1, fig. 2-1 is an optional schematic diagram of sorting and dividing the second-class attribute values of each information in the information set into value spaces according to an embodiment of the present invention, and dividing the value spaces into 3 value spaces.
With reference to fig. 2-1, exemplarily, one of the following conditions for dividing the value space is satisfied between the divided value spaces:
the condition 1) that the distance between the value spaces exceeds a distance threshold value, so that the attribute values with similar values are divided into the same value space, and because the encoding modes corresponding to the attribute values of the same value space are the same (the same digital characteristic value is distributed), the effect that the encoding results (the same digital characteristic value) of the second-class attribute values in a plurality of pieces of information describing the same object are approached to the maximum degree is realized, and whether the described object is the same object can be accurately identified based on the difference degree of the encoding results of the second-class attribute values of different pieces of information.
The number of the value space divided under the condition 2) is at least 2; of course, when the distribution space of the second type attribute values is larger, the number of the value spaces may be correspondingly increased, and generally, the span of the distribution space of the second type attribute values is linearly and positively correlated with the number of the value spaces.
The encoding rules of the second type attributes include a value space dynamically divided according to a value range of an attribute value of a second-order attribute of each piece of information (each piece of information is used for describing an object) in an information set. Based on the dynamically divided value space, correspondingly, the attribute value of each second type attribute of the object is taken as an index, the coding rule of the corresponding second type attribute is inquired, and the digital characteristic value corresponding to the attribute value of the corresponding second type attribute is obtained, and the following mode can also be adopted:
and taking the value space where the attribute value of each second type attribute of the object is located as an index, and inquiring the corresponding relation between the value space and the digital characteristic value in the coding rule of the corresponding second type attribute to obtain the digital characteristic value corresponding to the attribute value of each second type attribute of the object.
For example, for the value space divided by the value range of the second class of dimension attribute values shown in fig. 2-1, an optional structure of the corresponding encoding rule is as follows:
Figure BDA0001159254580000121
the distribution range of the second type attribute of the information in the information set is dynamically divided into 3 value spaces, so that the second type attribute values of all the information have 3 coding results, the difference of the coding results is realized, the condition that the coding results of the second type attribute values are the same is avoided, and the objects can be distinguished based on the coding results of the second type attribute values.
2.2.2) as another example of dividing a value space and encoding, dividing an information set into groups with the same digital characteristic value of a first type attribute, that is, the digital characteristic value of the first type attribute in the information of each group is the same; partitioning the value space for each packet: and sorting the attribute values of the second type of attributes in the grouped information, and dividing a value range corresponding to a sorting result into at least two value spaces meeting a value space dividing condition.
Referring to fig. 2-2, fig. 2-2 is an optional schematic diagram that sorts and divides the second-class attribute values of each group into value spaces in each information group in the information set according to the embodiment of the present invention.
With reference to fig. 2-2, exemplarily, the divided value spaces satisfy one of the following value space division conditions:
the method includes the steps that 1) the distance (including the distance 1 to the distance 4) between the value spaces exceeds a distance threshold value, so that attribute values with similar values are divided into the same value space (including the value spaces 1 to 3 formed by dividing the value range of the second type of attribute values of the group 1 and the value spaces 4 to 6 formed by dividing the value range of the second type of attribute values of the group 2), and because the encoding modes corresponding to the attribute values of the same value space are the same (the same digital characteristic value is distributed), the encoding results (the same digital characteristic value) of the second type of attribute values in multiple pieces of information of the same object are guaranteed, and the same object can be accurately identified based on the encoding results of the second type of attribute values.
The number of the value space divided under the condition 2) is at least 2; of course, when the span of the value range of the second type attribute value is large, the number of the value spaces may be correspondingly increased, and generally, the span of the value range of the second type attribute value is in positive linear correlation with the number of the value spaces.
The encoding rules of the second type attributes include a value space dynamically divided according to a value range of an attribute value of a second-order attribute of each piece of information (each piece of information is used for describing an object) in an information set. Based on the dynamically divided value space, correspondingly, the attribute value of each second type attribute of the object is taken as an index, the coding rule of the corresponding second type attribute is inquired, and the digital characteristic value corresponding to the attribute value of the corresponding second type attribute is obtained, and the following mode can also be adopted:
and taking the value space where the attribute value of each second type attribute of the object is located as an index, and inquiring the corresponding relation between the value space and the digital characteristic value in the coding rule of the corresponding second type attribute to obtain the digital characteristic value corresponding to the attribute value of each second type attribute of the object.
Because the first-class dimension attributes in the information of the same object have uniqueness (namely are the same), the value space is divided by referring to the distribution of the second-class attribute values in the grouping in the information grouping of the first-class attribute values, so that the second-class attribute values in the information of the same object are divided into the same value space to the maximum extent, the same digital characteristic value can be distributed to the second-class attribute values of the same value space, and the effect of accurately fitting the digital characteristic values of the second-class attribute values of different objects with the similarity is realized.
3) And the attribute value of the same object in the information set corresponding to the third type attribute has a discrete value space, and the attribute value of the third type attribute in the information of the information set corresponding to the same object has a discrete value space.
Taking the property information as an example, the third type of attribute may include: the house is decorated, full of five (the time that house purchasers own the house is 5 years), unique (the house purchasers are the only house floor, area, building type, age and price of the family), and due to various reasons (such as the difference of information provided by house owners, or the beneficial fuzzy house information of intermediaries), in the house information from a plurality of intermediaries of the same house, the attribute value aiming at the decoration attribute may have difference, such as 'fine decoration', and also may be 'luxury decoration', and is positioned in a discrete value space (blank house, general decoration, fine decoration and luxury decoration).
The encoding rule of the third type attribute includes a correspondence between the attribute value of the third type attribute and the numerical feature value, and an optional example is
Whether it is full of five codes
Is 0001 of
NO 0002
Accordingly, the attribute value of the third type attribute is encoded in the following manner:
and inquiring the corresponding relation between the attribute value and the digital characteristic value in the coding rule of the corresponding third type attribute by taking the attribute value of each third type attribute of the object as an index to obtain the digital characteristic value corresponding to the attribute value of each third type attribute of the object.
And 104, combining the digital characteristic values of the object corresponding to all dimensions to form the digital characteristic of the object.
Illustratively, the feature is formed by combining the numerical characteristic value of the first type attribute + the numerical characteristic value of the second type attribute + the numerical characteristic value of the third type attribute, but of course, any other form of combination may be adopted.
And 105, determining the similarity among the digital characteristics of each object, and identifying the objects with the similarity of the digital characteristics meeting the preset conditions as the same object.
In one embodiment, the digital features of the objects described by any 2 information in the information set are compared, and the objects described by the 2 information with the digital feature similarity higher than the similarity threshold (e.g. 99%) are identified as the same object.
Particularly, since the first-class attribute values in the information of the same object in the information set are the same, and therefore the corresponding digital feature values are also necessarily the same, then, the digital features of the objects are compared to obtain candidate objects with the same digital feature values of the first-class attributes, the condition that the objects with different first-class attribute values are identified as the same object is eliminated, then, the candidate objects with the similarity of the digital feature values of the second-class attributes meeting the predetermined condition and the similarity of the digital feature values of the third-class attributes meeting the predetermined condition are identified as the same object, and the accuracy of identifying the same object is guaranteed to the greatest extent.
In another embodiment, the digital features of the objects described by any 2 pieces of information in the information sets are compared, and a predetermined number of objects with the highest similarity of the digital features are identified as the same object, wherein the predetermined number is determined according to the proportion of the same object repeatedly described by the information in the past information sets.
Step 106, identifying the repeated information corresponding to the same object in the information set.
In one embodiment, information of an information set that describes the same object, which satisfies a deletion condition, is deleted. For example, information with non-highest aging priority information or information with non-highest source reliability priority in the information of the same object in the deletion process ensures the reliability and the timeliness of the information in the information set.
In another embodiment, different attribute values of the same attribute in the information describing the same object in the information set are integrated: and deleting repeated attribute values, such as 'Beijing' in a city, on one hand, information redundancy is avoided, on the other hand, comprehensiveness of information is guaranteed, and information loss is avoided.
As an example of integration, different attribute values of the same attribute in information describing the same object in an information set are formed into new attribute values in a parallel manner, and for 2 items of property information of the same property, "top floor" and "18 floor, waiting for attic" of the floor attribute values are integrated into "18-top floor/with attic".
As another example of integration, an attribute value having the largest amount of information among different attribute values of the same attribute is retained, and for example, for 2 pieces of property information of the same property, the information amount maximum floor attribute of "18 floors" is taken as the floor attribute value of the integrated property information.
According to an embodiment of the present invention, there is provided an information processing apparatus to which the above-described information processing method is applied, and the information processing apparatus can be implemented in various ways, which will be exemplarily described below.
In one embodiment, the information processing apparatus is implemented based on resources of a network-side server (e.g., a processor, a memory storing executable instructions for execution by the processor, etc.) and communication resources (e.g., an integrated circuit chip for implementing wireless network communication and cellular communication, etc.).
Fig. 3-1 is a schematic diagram of an optional application scenario when the information processing apparatus provided in the embodiment of the present invention is deployed in a network-side server, where the server acquires information of multiple objects from different information sources to form an information set, identifies the same object and removes duplicate information of the same object according to the information processing method described above, and issues the processed information set to a front-end page of a front end for a user to access, thereby implementing an information aggregation service of the network-side server.
Generally, the server sorts or orders the information in the information set (with duplicate information of the same object removed) in a specific way, facilitating access to information that the user is interested in quickly locating.
Taking the information set of the property information as an example, sorting according to the issuing sequence of the property information, and arranging the most time-efficient property information at the top of the page or other significant positions in the page so as to avoid the user from missing attention to the most recent property information.
Of course, the server may also calculate user preferences according to browsing records or subscriptions of the user to push information, which collectively conforms to the user preferences, to the user, or display property information conforming to the user preferences at a significant position of a front-end page when the user accesses the front-end page of the server.
For example, according to preferences of a user in a cell, a house type, a price interval and the like recently browsed/subscribed by the user, property information with corresponding attributes in an information set is regularly or irregularly pushed to the user side device, so that the situation that the user is interfered by repeatedly pushing the same property is avoided, on one hand, an effect of accurately pushing the information is achieved, and on the other hand, timeliness of pushing the information is guaranteed.
In another embodiment, the information processing apparatus may be implemented based on computing resources (e.g., a processor, a memory storing executable instructions for execution by the processor, etc.) and communication resources (e.g., an integrated circuit chip for implementing wireless network communication and cellular communication, etc.) of the user-side terminal.
Fig. 3-2 is a schematic diagram of an optional application scenario when the information processing apparatus provided in the embodiment of the present invention is deployed in a user side terminal, where the terminal acquires information of multiple objects from different information sources to form an information set, identifies the same object and removes duplicate information of the same object according to the information processing method described above, and prompts the user to view the processed information set at the user side terminal, thereby implementing an information aggregation service of the user side terminal.
In general, a terminal classifies or sorts a set of information (with duplicate information of the same object removed) in a particular manner, facilitating access to information of interest to a user for quick location. Of course, the terminal can also calculate the user preference according to the past browsing records or subscriptions of the user, and present the information which is in the information set and accords with the user preference to the user in various ways.
As described above, the information processing apparatus is deployed in a network-side server or in a user-side terminal, and in a hardware implementation manner, hardware resources for implementing the information processing apparatus include computing resources such as a processor and a memory, and may further include communication resources such as an integrated circuit chip for performing communication in various manners (such as wireless local area network communication and cellular communication);
in a software implementation, the information processing apparatus may be implemented as executable instructions (including computer-executable instructions such as programs, modules) stored in a storage medium that are executable using one thread or multiple parallel threads at the processor.
As described above, when the information processing apparatus is implemented based on the computing resources and communication resources of the network user side terminal, referring to an optional software and hardware configuration diagram of the information processing apparatus 10 shown in fig. 4, the information processing apparatus 10 includes a hardware layer, an intermediate layer, an operating system layer, and a software layer. However, it should be understood by those skilled in the art that the structure of the information processing apparatus 10 shown in fig. 4 is merely an example, and does not constitute a limitation on the structure of the information processing apparatus 10. For example, the information processing apparatus 10 may be provided with more components than those shown in fig. 4 according to the implementation need, or omit the provision of some components according to the implementation need.
The hardware layers of the information processing apparatus 10 include a processor 11, an input/output interface 13, a storage medium 14, and a network interface 12, and the components can communicate via a system bus connection.
The processor 11 may be implemented by a Central Processing Unit (CPU), a Microprocessor (MCU), an Application Specific Integrated Circuit (ASIC), or a Field-Programmable Gate Array (FPGA).
The input/output interface 13 may be implemented using input/output devices such as a display screen, a touch screen, a speaker, etc.
The storage medium 14 may be implemented by a nonvolatile storage medium such as a flash memory, a hard disk, and an optical disk, or may be implemented by a volatile storage medium such as a Double Data Rate (DDR) dynamic cache, in which an executable instruction for executing the information processing method is stored.
For example, the storage medium 14 may be provided in a centralized manner with other components of the information processing apparatus 10, or may be provided in a distributed manner with respect to other components in the information processing apparatus 10. The network interface 12 provides the processor 11 with external data such as Access capability of a storage medium 14 set in a remote location, for example, the network interface 12 may perform Near Field Communication (NFC) based technology, Bluetooth (Bluetooth) technology, ZigBee (ZigBee) technology, and in addition, may also implement cellular Communication based on a Communication system and an evolution system thereof, such as Code Division Multiple Access (CDMA) and Wideband Code Division Multiple Access (WCDMA), and for example, Communication based on a wireless Access Point (AP) Access network side in a wireless compatibility authentication (WiFi) manner.
The driver layer includes middleware 15 for the operating system 16 to recognize and communicate with the components of the hardware layer, such as a set of drivers for the components of the hardware layer.
The operating system 16 is used for providing a graphical interface facing a user, and exemplarily comprises a plug-in icon, a desktop background and an application icon, and the operating system 16 supports the user to control the device via the graphical interface, and the embodiment of the present invention does not limit the software environment of the device, such as the type and the version of the operating system, and may be, for example, a Linux operating system, a UNIX operating system or other operating systems.
The application layer includes a client run by the user-side terminal, such as an information aggregation application 17 and an application plug-in that provide an aggregation service of various information.
Referring to an alternative structural diagram of the information processing apparatus 20 shown in fig. 5, the functional structure of the information processing apparatus will be described again, including: the acquisition unit 21, the extraction unit 22, the encoding unit 23, the combining unit 24, the comparison unit 25, and the identification unit 26 are explained separately.
An acquisition unit 21 configured to acquire an information set composed of a plurality of pieces of information.
And the extracting unit 22 is used for extracting attribute values of a plurality of attributes corresponding to the object described by the corresponding information from each piece of information.
In one embodiment, the preset attribute names of a plurality of attributes are used as keywords, and the attribute values of the plurality of attributes corresponding to the object described by the corresponding information are inquired in each information of the information set.
And the encoding unit 23 is configured to perform encoding processing on the attribute values of the object corresponding to the attributes to obtain digital feature values of the object corresponding to the attributes.
In an embodiment, the encoding unit 23 is configured to query the encoding rule of the corresponding attribute by using the attribute value of each attribute corresponding to the object as an index, and obtain a digital feature value corresponding to the attribute value of the corresponding attribute.
In another embodiment, the different attributes may be divided into the following optional types:
1) the attribute value of the first type attribute corresponding to the same object in the information of the information set has uniqueness. 2) And the attribute value of the same object in the information set corresponding to the second attribute has a continuous value space. 3) And the attribute value of the same object in the information set corresponding to the third attribute has a discrete value space.
For the first-class attributes, the encoding unit 23 is further configured to query, with the attribute value of each first-class attribute corresponding to the object as an index, a corresponding relationship between the attribute value and the digital feature value in the encoding rule corresponding to the first-class attribute, and obtain the digital feature value corresponding to the attribute value of the corresponding first-class attribute, where the attribute values of the first classes corresponding to the same object in the information of the information set have uniqueness.
For the second-class attributes, the encoding unit 23 is further configured to query, as an index, a value space where the object corresponds to the attribute value of each second-class attribute, and a corresponding relationship between the value space and the digital feature value in the encoding rule of the corresponding second-class attribute, to obtain the digital feature value corresponding to the attribute value of each second-class attribute of the object; and the attribute values of the second type attributes corresponding to the same object in the information of the information set have a continuous value space.
For the value space, the encoding unit 23 may determine in this manner, sequence the attribute values of the second type attribute of each information in the information set, and divide the value range corresponding to the sequencing result into at least two value spaces whose distances satisfy the value space division condition.
For the value space, the encoding unit 23 may also determine in this manner, and the encoding unit 23 is further configured to divide the information set into groups with the same digital characteristic values of the first type of attribute, sort the attribute values of the second type of attribute in the information of each group, and divide the value range corresponding to the sorting result into at least two value spaces that satisfy the value space division condition.
For the third type attributes, the encoding unit 23 is further configured to query, by using the attribute value of each third type attribute corresponding to the object as an index, a corresponding relationship between the attribute value and the digital feature value in the encoding rule corresponding to the third type attribute, and obtain the digital feature value corresponding to the attribute value of the corresponding third type attribute.
And a combining unit 24, configured to combine the digital feature values of the object corresponding to the respective attributes to form a digital feature of the object.
And the combination unit 25 is used for determining the similarity between the digital characteristics of the objects described by the information and identifying the objects with the similarity satisfying the similarity condition as the same object.
The identification unit 26 is used for identifying the repeated information corresponding to the same object in the information set.
The identifying unit 26 is further configured to compare the digital features of the objects described by the respective information in the information sets, and identify, as the same object, an object whose digital feature similarity is higher than a similarity threshold, or a predetermined number of objects whose digital feature similarity is the highest.
In an embodiment, the identifying unit 26 is further configured to compare the digital features of the objects described by the pieces of information in the information sets to obtain candidate objects with the same digital feature values of the first type of attributes, and identify the candidate objects with the similarity of the digital feature values of the second type of attributes and the similarity of the digital feature values of the third type of attributes exceeding a similarity threshold as the same object;
the attribute values of the first class corresponding to the same object in the information of the information set have uniqueness; the attribute values of the second type attributes corresponding to the same object in the information of the information set have a continuous value space; the attribute values of the third type attributes corresponding to the same object in the information of the information set have discrete value space.
In one embodiment, the identifying unit 26 is further configured to delete information satisfying the deletion condition from the information describing the same object in the information set.
In one embodiment, the identifying unit 26 is further configured to form new attribute values from different attribute values of the same attribute in the information describing the same object of the information set in a parallel manner, or delete an attribute value not having the largest information amount from different attribute values of the same attribute; duplicate attribute values for the same attribute are deleted.
It can be understood that the above objects are distinguished according to the main body of the information set description in the actual application scenario, for example, the object of the property information description obtained from the intermediary of different properties is a property, the object of the information description obtained from different restricted goods trading platforms is an idle article to be sold (exchanged), and according to different objects, the above three types of attributes can be easily divided by those in the art based on the similarity, so as to form a digital feature based on the encoding result of the attribute values of the attributes, and further identify whether the properties (or the idle articles) described by the information description are the same property (or the same idle article) based on the similarity.
Taking an example of acquiring the property information sources from different information sources, and taking an example of an information set formed by property information acquired from different information sources (for example, different brokers, various online websites for publishing the property information, etc.), since the property information does not include specific building numbers and house number information, the process of directly performing rearrangement processing on the property information (that is, finding out a plurality of property information corresponding to the same property from the information set) and identifying a plurality of property information of the same property from the property information of the information set (the property values of the plurality of property information for the same property are different) will be described.
First, terms involved in the processing are explained as follows.
1) The property DNA and the property characteristics refer to the specific properties which are selected from the basic properties of the property and can effectively distinguish the different properties of the property, and the property values of a plurality of properties of the property are coded to form digital characteristic values (digital characteristic values) and combined to form the property characteristics.
The properties of the property are classified into A, B and C, each of which includes several specific properties.
1.1) the class A attributes (first class attributes) include: the city, county, business district, cell name (or alias), house type and orientation of the house are 6 items.
The attribute values of the class-A attributes of the properties in the property information have uniqueness, that is, the attribute values of the class-A attributes do not change for property information of different sources of the same property.
1.2) the class B attributes (second class attributes) include: the floor range, area range, building type, age and price range of the house are 5 items.
The attribute values of the type B attributes of the properties in the property information have continuous value space, that is, for the property information of different sources of the same property, the attribute values of the type B attributes belong to a continuous value space.
For price, the house owner may give many different offers, such as any value in a continuous space of values (200 ten thousand, 300 ten thousand).
For the floors, the house owner can report wrong floors to different intermediaries when protecting privacy, or give an approximate range of the floors, such as high floors, more than 10 floors and the like, and accordingly, the value space of the floors can be determined according to the actual floor number of the buildings in different areas.
For the building type, because different building types are identified by numbers in advance, the attribute values of the building types can also be regarded as belonging to a continuous value space.
1.3) class C attributes (third category attribute) include 3 items of finish, full five (meaning that the house buyer owns the house for 5 years) and unique (meaning that the house buyer is the only house in the home).
The attribute value of the type C attribute of a property in the property information has a discrete value space, that is, for property information of different sources of the same property, the attribute value of the type C attribute belongs to a discrete value space.
For the decoration degree, the value space can be as follows: general decoration; finishing; luxurious decoration.
For all five and only, the value space can be: is that; and no.
2) And (3) real property attribute coding: refers to the process of digitally encoding a particular property according to some type of attribute, such as type a. The digital characteristic value of each attribute in the coding process is represented by 4 decimal digits, of course, other digits and other binary systems can be used, and the coding values corresponding to the attributes are arranged together according to a specific sequence to form the DNA (digital characteristic) of the property.
3) The coding standard comprises a series of coding rules for coding attribute values of different attributes, for example, a set of coding rules corresponding to each attribute of the property, and the coding rules comprise the corresponding relation between the attribute values of the property and the digital characteristic values of the 4-digit numerical values.
For the attribute values of A, C types of properties of a property, the encoding rule includes the corresponding relationship between the attribute value and the digital characteristic value, and therefore is a static encoding mode, and once the attribute value of the property of a property is determined, the digital characteristic value corresponding to the attribute value is determined.
For an attribute value of the type B attribute of the property, the encoding rule comprises the corresponding relation between the value space where the attribute value is located and the digital characteristic value, and the value space where the attribute value is located is obtained by sequencing and dividing the value space of the corresponding attribute values of all properties in the information set, so that the value space where the attribute value of the property is located has dynamic randomness, and the digital characteristic value of the attribute value is determined in a dynamic mode.
The dynamic coding mode can avoid dividing the attribute values of the B-type attributes with close values into the same value space, avoid the condition that the digital characteristic values are different due to the fact that the attribute values are divided into different value spaces, and enable the digital characteristic values of the attribute values of the B-type attributes to be fitted with the similarity between the properties of the property.
4) Property aggregation of the real estate: the present invention relates to a method for merging and classifying attributes to effectively classify the blurred property information into the above-described A, B, C three types of attributes, and more particularly, to a method for specially classifying attributes such as a floor range, an area range, and a price range of a B type attribute of a property.
5) The house property similarity formula: the method is a calculation method for evaluating whether any two properties are similar properties, and the similarity between the properties is obtained by calculating the DNA of the properties, comparing the DNA of the properties and calculating the similarity based on the difference obtained by the comparison.
Referring to fig. 6-1 and 6-1, which are schematic diagrams illustrating a process of identifying multiple property information of the same property from the property information of the information set according to an embodiment of the present invention, in fig. 6-1, DNA of the corresponding property is extracted from each property information of the information set, codes of 3 types of attributes corresponding to A, B, C of each property are extracted from the extracted DNA of each property, the codes of 3 types of attributes corresponding to A, B, C of each property are compared, a similarity between the properties is calculated based on the similarity of the codes, and the same property (property rearrangement) corresponding to the property information is identified.
Referring to fig. 6-2, fig. 6-2 is a schematic diagram illustrating classification of attributes constituting the property DNA according to an embodiment of the present invention, specifically, the property attributes include A, B, C total 3 types as shown in fig. 6-2, and the a-type attributes of the property include: 6 items of cities, counties, business circles, cell names (or aliases), house types and orientations of the property; the B-type attributes comprise 5 items of floor range, area range, building type, age and price range; the class C attributes include fitment, full five and a unique total of 3 items.
Since A, B, C types include different attributes, encoding processing for corresponding attribute values is also different, and the processing for encoding attribute values of A, B, C types of attributes is explained as follows.
1) Class A attribute coding
For the encoding of the attribute value of the a-class attribute, refer to fig. 6-3, where fig. 6-3 is an optional schematic diagram of an encoding rule for encoding the attribute value of the a-class attribute of the property provided in the embodiment of the present invention, where the optional schematic diagram includes a correspondence between a plurality of attribute values of the a-class attribute and a digital feature value, and for the city attribute, the digital feature value corresponding to the attribute value "beijing" is "0001", and the digital feature value corresponding to the attribute value "shanghai" is "0002", and other correspondences of the encoding rule may be understood according to the above description, and are not described one by one.
For each piece of property information in the information set, the property values of the four properties, namely the city, the county, the business circle, the cell name (or alias), the house type and the orientation of the property recorded in the property information are inquired in the coding rule, the digital feature value (digital feature value) of the corresponding property is inquired, the digital feature value is represented by 4 decimal digits, and the digital feature values corresponding to the property values of the properties of the type A are arranged and combined according to a preset sequence to form the digital feature value of the type A property of the property, which is used as the identifier of the property corresponding to the type A.
Referring to fig. 6-4, fig. 6-4 is an alternative diagram of numerical characteristic values of attribute values of class a attributes of property information in an information set, where the numerical characteristic value of class a attribute of each property is composed of 6 groups of 4 decimal digits, and in conjunction with fig. 6-3, taking the property information numbered "000001" in fig. 6-4 as an example, it is assumed that the property information includes the attribute values of: based on fig. 6-3, the numerical characteristic value of the category a attribute of the house is 000100010001000200010002 in beijing, chaoyang, privet, purple imperial dynasty, one room to one living room, and east direction.
2) Class B attribute coding
For the encoding of the attribute value of the class B attribute, after the attribute value of the class a attribute of the property is encoded for the property information of each property in the information set, the attribute value corresponding to the class B attribute (including the floor range, the area range, and the price range) of the property is dynamically encoded (it can be understood that, for the attribute value of the class B attribute, a similar manner to the encoding of the attribute value of the class a attribute may also be adopted).
For the attributes of the floor, the area and the price of the house, the value space corresponding to the attribute value is naturally in the continuous value space, for example, the attribute value of the floor is in the continuous value space of (0, 18), the attribute value of the area attribute is in the continuous value space of (20, 200), and the price is in the continuous value space of (100 ten thousand, 200 ten thousand).
For the building type and the age (building age) of the house property, the digital identifier can be pre-allocated to correspond to different building types and different ages, so that the building type and the age of the house property have continuous value space.
The method comprises the steps of extracting attribute values of B-type attributes from the property information of each property in an information set, sorting the B-type attribute values (attribute values of the B-type attributes, namely index data) according to the size, calculating the difference value delta of the B-type attribute values (attribute values of the B-type attributes) of adjacent properties in a sorting result, taking two groups of B-type attribute values with the largest difference value as a grouping critical value, dividing a value range corresponding to the sorting result into three spaces by the grouping critical value, equivalently dividing the B-type attribute values of each property into three groups (the groups correspond to the value spaces one by one), and distributing preset uniform digital characteristic values for the B-type attribute values (data) in the groups.
The encoding process is exemplified with the class B property for a property as an area.
And extracting the areas in the property information with the same digital characteristic value of the A-type attribute value, and grouping the properties according to the size of the areas. Referring to fig. 6-5, fig. 6-5 are an alternative flow diagram of numerical characteristic values of attribute values of class B attributes of property information of property sets in an information set.
First, the areas of the properties are sorted from small to large. Referring to fig. 6-6, fig. 6-6 are schematic diagrams illustrating an alternative sorting result for sorting areas of properties provided by the embodiment of the present invention.
Secondly, calculating the difference between the areas of the adjacent 2 properties in the sequencing result, and sequentially taking the 2 groups of properties with the largest difference as grouping criticality. The properties are divided into 3 groups according to the grouping criticality.
And thirdly, distributing a uniform digital characteristic value corresponding to the area attribute for each group, and distributing the digital characteristic value of the area attribute for the property according to the group in which the area pair of the property is located. Referring to fig. 6-7, fig. 6-7 are an alternative schematic diagram of the digital characteristic values assigned to each group after the areas of the properties are grouped according to the embodiment of the present invention, where the area of the property corresponds to a digital characteristic value of 0001 when the area of the property is in the 1 st group, and similarly, the area of the property corresponds to a digital characteristic value of 0002 when the area of the property is in the 2 nd group, and the area of the property corresponds to a digital characteristic value of 0003 when the area of the property is in the 3 rd group.
The digital characteristic values of the attribute values of other property values of the property class B of the property (such as corresponding to the floor, the price, the building type and the year) can be implemented according to the digital characteristic values of the calculated area, and the digital characteristic values of the property class B of each property are arranged and combined to form 5 groups of 4-digit sequences, which are shown in fig. 6 to 8, wherein fig. 6 to 8 are an optional schematic diagram of the digital characteristic values of the property class B of the property provided by the embodiment of the invention.
3) Class C attribute value encoding
Coding the decoration degree, the full five and the only 3 attributes of the house property according to the coding mode of the A-type attribute values, wherein the coding rule of the C-type attribute comprises the one-to-one correspondence of the C-type attribute values and the digital characteristic values, the C-type attribute values are used as indexes to query the correspondence in the coding rule to obtain the digital characteristic values corresponding to the C-type attribute values of the house property, and the digital characteristic values of all the C-type attribute values of the house property are arranged and combined to form the digital characteristic values of the C-type attribute. . Referring to fig. 6-9, fig. 6-9 are schematic diagrams illustrating an alternative flow of numerical characteristic values of attribute values of class C attributes of property information of each property in an information set.
The A, B of the property and the numerical characteristic value of the class C attribute are arranged and combined to form DNA of each property, and referring to FIGS. 6-10, FIGS. 6-10 are an alternative schematic diagram of DNA of the property provided by the embodiment of the present invention.
House property DNA comparison
The DNAs of the respective properties are compared to calculate the similarity of DNAs, illustratively, the similarity of 2 properties is calculated based on the rule shown in table 1, such as that the similarity is determined to be 99% when A, B of the 2 property information in the information set is identical to the numerical characteristic value of the class C attribute, the similarity of properties is determined to be 50% when the numerical characteristic value of the A, B attribute of the 2 property information in the information set is identical and the numerical characteristic value of the class C attribute is not identical, and whether the 2 properties are the same property is judged according to the comparison result of the similarity with the similarity threshold.
Figure BDA0001159254580000261
TABLE 1
Referring to fig. 6-11, fig. 6-11 are another alternative flow chart for calculating the DNA of the property based on the property information and identifying the same property based on the DNA similarity according to the embodiment of the present invention, which is different from the above-mentioned scheme for identifying the same property, first calculating the numerical feature values of the class a attributes of each property, and summarizing the properties with the same numerical feature values of the class a attributes.
Secondly, the attribute values of the B-type attributes of the summarized properties (the digital characteristic values of the A-type attributes are consistent) are grouped, and the digital characteristic values are dynamically distributed to the B-type attribute values of the properties based on the grouping.
Thirdly, calculating the digital characteristic value of the corresponding C-type attribute based on the properties with the consistent digital characteristic values of the summarized B-type attributes, comparing the similarity of the digital characteristic values of the C-type attributes of the 2 properties as the similarity of DNA of the properties, and comparing the similarity with a similarity threshold value to judge whether the 2 properties are the same property.
In summary, the embodiments of the present invention achieve the following beneficial effects:
1) different attributes of objects described by the information in the information set are quantized into digital features, and the information which is repeatedly described corresponding to the same object in the information set can be efficiently and accurately judged based on the similarity of the digital features.
2) Based on the identified same object, duplicate removal processing can be performed on the duplicate information of the same object in the information set, so that resource consumption caused by maintaining the duplicate information in the information set is saved, and interference of the information which is repeatedly described corresponding to the same object in the information set on audiences is eliminated.
3) The method has the advantages that the attributes of the objects are classified, the digital characteristic values of the attributes of all the classes are independently used for calculating the similarity of the objects described by different information, and when the information sources are limited, such as the situation that only the information including part of the attributes of the objects can be obtained, the objects described in the information set can be identified, so that the applicability is high.
4) The similarity between the objects is calculated by combining the digital characteristic values of the attributes of a plurality of categories, so that the condition that the digital characteristic values of some objects in certain category of attributes are mistakenly identified when the digital characteristic values of other category of attributes are greatly different is avoided, and the accurate identification of the same object is realized.
5) The multiple pieces of information describing the same object in the information set are subjected to deduplication processing in a deduplication fusion mode, so that the situation of information duplication is avoided, and the loss of information quantity caused by deleting duplicated information is also avoided.
Those skilled in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable Memory device, a Random Access Memory (RAM), a Read-Only Memory (ROM), a magnetic disk, and an optical disk.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a RAM, a ROM, a magnetic or optical disk, or other various media that can store program code.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (9)

1. An information processing method characterized by comprising:
acquiring an information set composed of a plurality of information from different information sources; wherein each piece of information in the information set is used for describing a property;
extracting attribute values of a plurality of attributes corresponding to the property described by the corresponding information from each piece of information;
dividing the attributes into a first type of attributes, a second type of attributes and a third type of attributes;
the attribute values of the first type of attributes have uniqueness, the attribute values of the second type of attributes have continuous value space, and the attribute values of the third type of attributes have discrete value space;
the first type of attribute comprises: the city, the county, the business circle, the cell name, the house type and the orientation of the house property; the second type of attribute comprises: the floor, area, floor type, age and price of the property; the third type of attribute comprises: decoration degree, full five conditions and unique conditions;
when the property corresponding to the property is the first type property, taking the property value of the property corresponding to each first type property as an index, inquiring the corresponding relation between the property value and the digital characteristic value with a set length in the coding rule of the first type property to obtain the digital characteristic value corresponding to the property value of each property of the first type property, and combining the digital characteristic values corresponding to the property values of each property in the first type property in a preset sequence to obtain the digital characteristic value of the first type property of the property;
when the property corresponding to the property is the second type property, sorting the property values of the second type property of each information in the information set, and dividing a value space according to a value range corresponding to a sorting result; wherein, one of the following value space division conditions is satisfied between the divided value spaces: the distance between the value spaces exceeds a distance threshold or the number of the value spaces is at least two; alternatively, the first and second electrodes may be,
sorting the attribute values of the second type of attribute of each information in the information set, taking at least one group of adjacent attribute values with the largest difference value of the adjacent attribute values in a sorting result as a grouping critical, and dividing a value range corresponding to the sorting result into at least two value spaces according to the grouping critical;
taking the value space where the property value of each second type attribute corresponding to the property of the property is located as an index, inquiring the corresponding relation between the value space and the digital characteristic value with a set length in the coding rule of the second type attribute to obtain the digital characteristic value corresponding to the property value of each property of the second type attribute, and combining the digital characteristic values corresponding to the property values of each property in the second type attribute in a preset sequence to obtain the digital characteristic value of the second type attribute of the property;
when the property corresponding to the property is the third type property, taking the property value of each third type property corresponding to the property as an index, inquiring the corresponding relation between the property value and the digital characteristic value with a set length in the coding rule of the third type property to obtain the digital characteristic value corresponding to the property value of each property of the third type property, and combining the digital characteristic values corresponding to the property values of each property in the third type property in a preset sequence to obtain the digital characteristic value of the third type property of the property;
combining the digital characteristic values of the properties corresponding to the properties in the order of the digital characteristic values of the first type of properties, the digital characteristic values of the second type of properties and the digital characteristic values of the third type of properties to form the digital characteristics of the properties;
comparing the digital characteristics of the properties described by the information in the information set, and identifying the properties with the digital characteristic similarity higher than a similarity threshold as the same properties, or identifying a preset number of properties with the highest digital characteristic similarity as the same properties; wherein the predetermined number is determined according to the proportion of repeatedly describing the same property by the information in the information set;
forming new attribute values by different attribute values of the same attribute in the information describing the same object of the information set in a parallel mode, or deleting the attribute value which does not have the maximum information amount in the different attribute values of the same attribute;
and deleting the repeated information corresponding to the same real estate in the information set.
2. The method of claim 1, wherein said extracting from each of said information attribute values for a plurality of attributes of the property described by the respective information comprises:
and querying attribute values of the property corresponding to the plurality of attributes of the property described by the corresponding information in each information of the information set by taking the preset attribute names of the plurality of attributes as keywords.
3. The method of claim 1, wherein said ordering attribute values of said second type of attribute for each of said information in said set of information comprises:
and dividing the information set into groups with the same digital characteristic values of the first type of attributes, and sequencing the attribute values of the second type of attributes in the information of each group.
4. The method of claim 1, wherein said comparing the digital characteristics of the properties described by each of said information in said set of information to identify properties with a digital characteristic similarity above a similarity threshold as the same property comprises:
and comparing the digital characteristics of the properties described by the information in the information set to obtain candidate properties with the same digital characteristic value of the first type of attribute, and identifying the candidate properties with the similarity of the digital characteristic values of the second type of attribute and the similarity of the digital characteristic values of the third type of attribute exceeding a similarity threshold as the same properties.
5. The method of claim 1, wherein said deleting duplicate information in said set of information for the same property comprises:
and deleting the information meeting the deletion condition in the information describing the same property in the information set.
6. An information processing apparatus characterized by comprising:
an acquisition unit configured to acquire an information set composed of a plurality of pieces of information from different information sources; wherein each piece of information in the information set is used for describing a property;
the extracting unit is used for extracting attribute values of a plurality of attributes corresponding to the property described by the corresponding information from each piece of information;
the encoding unit is used for dividing the attributes into a first type of attributes, a second type of attributes and a third type of attributes; the attribute values of the first type of attributes have uniqueness, the attribute values of the second type of attributes have continuous value space, and the attribute values of the third type of attributes have discrete value space; the first type of attribute comprises: the city, the county, the business circle, the cell name, the house type and the orientation of the house property; the second type of attribute comprises: the floor, area, floor type, age and price of the property; the third type of attribute comprises: decoration degree, full five conditions and unique conditions;
when the property corresponding to the property is the first type of property, the encoding unit is further configured to query a corresponding relationship between an attribute value and a digital feature value of a set length in an encoding rule of the first type of property by using the property value of the property corresponding to each first type of property as an index, obtain the digital feature value corresponding to the attribute value of each property of the first type of property, and combine the digital feature values corresponding to the attribute values of each property in the first type of property in a preset order to obtain the digital feature value of the first type of property of the property;
when the property corresponding to the property is a second type property, the encoding unit is further configured to sort the property values of the second type property of each piece of information in the information set, and divide a value space according to a value range corresponding to a sorting result; wherein, one of the following value space division conditions is satisfied between the divided value spaces: the distance between the value spaces exceeds a distance threshold or the number of the value spaces is at least two; or sorting the attribute values of the second type of attribute of each information in the information set, taking at least one group of adjacent attribute values with the largest difference value of the adjacent attribute values in a sorting result as a grouping critical, and dividing a value range corresponding to the sorting result into at least two value spaces according to the grouping critical;
using a value space where the property value of each second type attribute corresponding to the property of the property is located as an index, inquiring a corresponding relation between the value space and a digital characteristic value with a set length in a coding rule of the second type attribute to obtain the digital characteristic value corresponding to the property value of each property of the second type attribute, and combining the digital characteristic values corresponding to the property values of each property in the second type attribute in a preset sequence to obtain the digital characteristic value of the second type attribute of the property;
when the property corresponding to the property is a third-type property, the encoding unit is further configured to query a corresponding relationship between the property value and a digital feature value of a set length in an encoding rule of the third-type property by using the property value of each third-type property corresponding to the property as an index to obtain a digital feature value corresponding to the property value of each property of the third-type property, and combine the digital feature values corresponding to the property values of each property in the third-type property in a preset order to obtain the digital feature value of the third-type property of the property;
the combination unit is used for combining the digital characteristic values of the properties corresponding to the property in the order of the digital characteristic value of the first type of property, the digital characteristic value of the second type of property and the digital characteristic value of the third type of property to form the digital characteristic of the property;
the comparison unit is used for comparing the digital characteristics of the properties described by the information in the information set, identifying the properties with the digital characteristic similarity higher than a similarity threshold value as the same property, or identifying the properties with the highest digital characteristic similarity and with a preset number as the same property; wherein the predetermined number is determined according to the proportion of repeatedly describing the same property by the information in the information set;
the identification unit is used for forming new attribute values of different attribute values of the same attribute in the information describing the same object in a parallel mode, or deleting the attribute value which does not have the maximum information amount in the different attribute values of the same attribute; and deleting the repeated information corresponding to the same real estate in the information set.
7. The apparatus of claim 6,
the encoding unit is further configured to divide the information set into groups with the same digital feature values of the first-class attributes, and sort the attribute values of the second-class attributes in the information of each group.
8. An information processing apparatus characterized by comprising a processor and a memory; the memory stores executable instructions for causing the processor to execute the information processing method according to any one of claims 1 to 5.
9. A computer-readable storage medium having stored thereon executable instructions that, when executed, implement the information processing method of any one of claims 1 to 5.
CN201611036969.XA 2016-11-22 2016-11-22 Information processing method and information processing apparatus Active CN108090082B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611036969.XA CN108090082B (en) 2016-11-22 2016-11-22 Information processing method and information processing apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611036969.XA CN108090082B (en) 2016-11-22 2016-11-22 Information processing method and information processing apparatus

Publications (2)

Publication Number Publication Date
CN108090082A CN108090082A (en) 2018-05-29
CN108090082B true CN108090082B (en) 2021-06-11

Family

ID=62168638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611036969.XA Active CN108090082B (en) 2016-11-22 2016-11-22 Information processing method and information processing apparatus

Country Status (1)

Country Link
CN (1) CN108090082B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110991177B (en) * 2018-09-18 2021-05-04 北京国双科技有限公司 Material weight removing method and device
CN109840080B (en) * 2018-12-28 2022-08-26 东软集团股份有限公司 Character attribute comparison method and device, storage medium and electronic equipment
CN110012150B (en) * 2019-02-20 2021-07-30 维沃移动通信有限公司 Message display method and terminal equipment
CN109920016B (en) * 2019-03-18 2021-06-25 北京市商汤科技开发有限公司 Image generation method and device, electronic equipment and storage medium
CN110244886B (en) * 2019-05-20 2022-05-27 北京百度网讯科技有限公司 Information display method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1679625B1 (en) * 2005-01-10 2012-09-12 Xerox Corporation Method and apparatus for structuring documents based on layout, content and collection
CN104182517A (en) * 2014-08-22 2014-12-03 北京羽乐创新科技有限公司 Data processing method and data processing device
CN105139134A (en) * 2015-08-31 2015-12-09 丁澄天 Registration directory management system for on-line real-estate integrated information management system of real-estate services
CN105740380A (en) * 2016-01-27 2016-07-06 北京邮电大学 Data fusion method and system
CN106033510A (en) * 2015-03-13 2016-10-19 阿里巴巴集团控股有限公司 Method and system for identifying user equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7941442B2 (en) * 2007-04-18 2011-05-10 Microsoft Corporation Object similarity search in high-dimensional vector spaces
CN104281525B (en) * 2014-10-28 2016-12-07 中国人民解放军装甲兵工程学院 A kind of defect data analysis method and the method utilizing its reduction Software Testing Project
CN105279277A (en) * 2015-11-12 2016-01-27 百度在线网络技术(北京)有限公司 Knowledge data processing method and device
CN105488176A (en) * 2015-11-30 2016-04-13 华为软件技术有限公司 Data processing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1679625B1 (en) * 2005-01-10 2012-09-12 Xerox Corporation Method and apparatus for structuring documents based on layout, content and collection
CN104182517A (en) * 2014-08-22 2014-12-03 北京羽乐创新科技有限公司 Data processing method and data processing device
CN106033510A (en) * 2015-03-13 2016-10-19 阿里巴巴集团控股有限公司 Method and system for identifying user equipment
CN105139134A (en) * 2015-08-31 2015-12-09 丁澄天 Registration directory management system for on-line real-estate integrated information management system of real-estate services
CN105740380A (en) * 2016-01-27 2016-07-06 北京邮电大学 Data fusion method and system

Also Published As

Publication number Publication date
CN108090082A (en) 2018-05-29

Similar Documents

Publication Publication Date Title
CN108090082B (en) Information processing method and information processing apparatus
CN104794242B (en) Searching method
CA3059929C (en) Text searching method, apparatus, and non-transitory computer-readable storage medium
CN106951527B (en) Song recommendation method and device
CN104077407A (en) System and method for intelligent data searching
CN107015987B (en) Method and equipment for updating and searching database
CN109002499B (en) Discipline correlation knowledge point base construction method and system
CN106815265B (en) Method and device for searching referee document
CN111782686A (en) User data query method and device, electronic equipment and storage medium
CN108241646B (en) Search matching method and device and recommendation method and device
US8788497B2 (en) Automated criterion-based grouping and presenting
US10169464B2 (en) System and method for a bidirectional search engine and its applications
CN111858922A (en) Service side information query method and device, electronic equipment and storage medium
CN115145871A (en) File query method and device and electronic equipment
WO2019055385A8 (en) Systems and methods for automated harmonized system (hs) code assignment
CN110909266A (en) Deep paging method and device and server
JP2012234343A (en) Similar character code group search supporting method, similar candidate extracting method, similar candidate extracting program, and similar candidate extracting apparatus
CN112860850B (en) Man-machine interaction method, device, equipment and storage medium
CN108874813B (en) Information processing method, device and storage medium
CN107291951B (en) Data processing method, device, storage medium and processor
US20180067938A1 (en) Method and system for determining a measure of overlap between data entries
CN106959960B (en) Data acquisition method and device
WO2014152892A1 (en) In-database connectivity components analysis of data
CN110895590A (en) Candidate object acquisition method and device, electronic equipment and storage medium
US20200311761A1 (en) System and method for analyzing the effectiveness and influence of digital online content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant