FR3108195A1

FR3108195A1 - Process for classifying real estate listings using machine learning

Info

Publication number: FR3108195A1
Application number: FR2002579A
Authority: FR
Inventors: Noel Tchidjo Moyo
Original assignee: Surfyn
Current assignee: Surfyn
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2021-09-17

Abstract

L’invention définit un procédé et un dispositif implémentant le procédé, permettant d’apprendre puis reconnaître les annonces immobilières de location et d’achat qui proposent le même bien physique. Le procédé possède 2 phases. Pendant sa phase d’entraînement, le procédé extrait les caractéristiques des annonces, regroupe les annonces en paires d’annonces, puis classe les paires d’annonces en fonction d’une liste de critères satisfaits et des indications fournis par l’utilisateur sur la similarité des annonces. Pendant sa phase de traitement, le procédé reçoit plusieurs annonces, les regroupe en paires d’annonces, puis identifie les critères identiques entre chaque paire d’annonces. Le système par la suite calcule la probabilité pour chaque paire d’annonces qu’elle soit identique, puis la probabilité qu’elle ne le soit pas, sachant que les critères préalablement identifiés sont satisfaits. C’est la plus grande des deux probabilités qui permet de conclure. Le procédé fusionne enfin les paires identiques qui représentent le même bien physique. Ce procédé a été implémenté sur ordinateur au sein d’une application intelligente d’aide à la recherche de biens immobiliers.The invention defines a method and a device implementing the method, making it possible to learn and then recognize real estate rental and purchase advertisements that offer the same physical good. The process has 2 phases. During its training phase, the process extracts the characteristics of the advertisements, groups the advertisements into pairs of advertisements, then classifies the pairs of advertisements according to a list of satisfied criteria and the indications provided by the user on the similarity of ads. During its processing phase, the process receives multiple advertisements, groups them into ad pairs, and then identifies identical criteria between each ad pair. The system then calculates the probability for each pair of ads that it is identical, then the probability that it is not, knowing that the previously identified criteria are met. The greater of the two probabilities leads to a conclusion. The process finally merges identical pairs which represent the same physical good. This process has been implemented on a computer in a smart application to help find real estate.

Description

Method of classifying real estate ads using machine learning

Domaine de l’inventionField of the invention

L’invention se situe dans le domaine de la classification d’annonces, plus précisément dans la classification d’annonces immobilières retrouvées sur le web, à travers un algorithme d’apprentissage de type Bayésien. L’invention définit un procédé permettant d’apprendre puis de reconnaître des annonces immobilières (de location ou d’achat) qui proposent le même bien immobilier physique.
Problème technique The invention lies in the field of the classification of advertisements, more precisely in the classification of real estate advertisements found on the web, through a learning algorithm of the Bayesian type. The invention defines a method making it possible to learn and then to recognize real estate advertisements (for rental or purchase) which offer the same physical property.
Technical problem

Dans le domaine de la recherche de biens immobiliers, plusieurs annonces peuvent représenter le même bien alors qu’elles ont des descriptions différentes. Par exemple dans l’annonce 1, la superficie est « 29.6 m²», dans l’annonce 2 la superficie est « environ 30 m²» alors que l’annonce 1 et l’annonce 2 décrivent le même bien physique. On peut également avoir pour une annonce 3, une distance par rapport au transport de « 8 mn », dans l’annonce 4 « environ 10 minutes », dans l’annonce 5 : « proche transport », pourtant il s’agit du même bien immobilier physique. Comme différence, nous pouvons aussi avoir l’annonce 6 qui dispose uniquement de la photo de la façade, pendant que l’annonce 7 présente les photos du salon et d’une chambre, et l’annonce 8 ne présente aucune photo.In the field of real estate research, several advertisements may represent the same property even though they have different descriptions. For example in ad 1, the area is "29.6 m ² ", in ad 2 the area is "approximately 30 m ² " while ad 1 and ad 2 describe the same physical property. We can also have for ad 3, a distance from transport of "8 min", in ad 4 "about 10 minutes", in ad 5: "near transport", yet it is the same physical real estate. As a difference, we can also have listing 6 which only has the photo of the facade, while listing 7 has the photos of the living room and a bedroom, and listing 8 has no photos.

Ces différences de descriptions viennent du fait que le même bien a été renseignées par des personnes différentes (agences immobilières différentes, particulier) sur les différentes plateformes d’annonces immobilières.These differences in descriptions come from the fact that the same property has been entered by different people (different real estate agencies, individuals) on the different real estate ad platforms.

Ceci représente un véritable casse-tête pour celui qui recherche un bien immobilier, car ce dernier perd énormément de temps allant d’une annonce à une autre, ne sachant plus s’il s’agit du même bien, et ne sachant pas finalement quels sont les caractéristiques du bien immobilier et si ce bien correspond à ses critères de recherche (ceci est aussi dû au fait que certaines caractéristiques du même bien sont parfois sur une annonce, et ne sont pas sur l’autre).This represents a real headache for anyone looking for a property, because the latter wastes a lot of time going from one ad to another, no longer knowing if it is the same property, and not finally knowing what are the characteristics of the property and if this property corresponds to its search criteria (this is also due to the fact that certain characteristics of the same property are sometimes on one advertisement, and are not on the other).

En résumé, le problème technique est le suivant :In summary, the technical problem is as follows:

Etant donné plusieurs annonces immobilières, peu importe la provenance des annonces (annonces provenant de sites web d’annonces immobilières différents, annonces provenant du même site web, ou provenant de sites web d’agences immobilières), comment pouvons-nous déterminer automatiquement à l’aide d’un ordinateur si un groupe d’annonces représente le même bien immobilier physique?
Etat de l’art Given several real estate ads, regardless of where the ads come from (ads from different real estate ad websites, ads from the same website, or from real estate agency websites), how can we automatically determine using a computer if an ad group represents the same physical property?
State of the art

L’apprentissage par le modèle bayésien a longtemps été utilisé pour effectuer de la classification de texte [1] [2]. La classification consiste en général, à déterminer à partir d’un texte, la catégorie dans laquelle le texte appartient. L’application du modèle bayésien naïf revient donc à trouver la probabilité que ce texte appartient à une catégorie étant donné certaines caractéristiques du texte. Bayesian model learning has long been used to perform text classification [1] [2]. Classification generally consists in determining from a text, the category in which the text belongs. Applying the naive Bayesian model therefore amounts to finding the probability that this text belongs to a category given certain characteristics of the text.

Cette formule est obtenue en partant du théorème de Bayes This formula is obtained by starting from Bayes' theorem

et en considérant une indépendance entre les différentes caractéristiques (caracteristique ₁ ,···,caracteristique _n )and by considering an independence between the different characteristics ( characteristic ₁ , ··· , characteristic _n )

Un des cas utilisations les plus courant aujourd’hui est la détection de spam dans les mails. La technique utilisée consiste à calculer, étant donné le sujet et le corps d’un mail, la probabilité que ce mail soit un spam, puis la probabilité qu’il ne soit pas un spam, enfin c’est la probabilité la plus grande qui permet de conclure [3].One of the most common use cases today is spam detection in emails. The technique used consists in calculating, given the subject and the body of an e-mail, the probability that this e-mail is a spam, then the probability that it is not a spam, finally it is the greatest probability which allows us to conclude [3].

Dans le cadre des annonces immobilières, l’hypothèse d’indépendance entre les caractéristiques des annonces caractérisant le même bien n’est pas acceptable. Ceci est vérifié car en prenant en compte cette hypothèse et en calculant les probabilités suivant la formule présentée au-dessus, on trouve des probabilités largement supérieures à 1. Ce qui est impossible.In the context of real estate advertisements, the assumption of independence between the characteristics of the advertisements characterizing the same property is not acceptable. This is verified because by taking this hypothesis into account and by calculating the probabilities according to the formula presented above, we find probabilities much greater than 1. Which is impossible.

De plus la classification d’annonce immobilière que nous voulons réaliser consiste à retrouver les annonces similaires autrement dit nous devons trouver un moyen de prendre en compte en entrée plusieurs annonces en même temps. En effet pris unitairement une annonce n’a pas de valeur par rapport à notre besoin, elle n’a de sens qu’en comparaison avec d’autres. On ne peut pas également prédire le nombre n d’annonces qu’il y’aura dans un groupe d’annonces similaires. Ceci est différent du modèle utilisé dans la classification de mail car dans ce modèle on a pour chaque entrée un mail à classifier et on s’appuie sur son contenu pour effectuer la prédiction.In addition, the real estate ad classification that we want to achieve consists of finding similar ads, in other words we must find a way to take into account several ads at the same time as input. Indeed, taken individually, an ad has no value in relation to our need, it only makes sense in comparison with others. We also cannot predict the number n of ads that there will be in a group of similar ads. This is different from the model used in the classification of mail because in this model we have for each entry a mail to classify and we rely on its content to make the prediction.

En termes de brevet autour de la classification d’annonce immobilière, nous avons [4], qui porte sur le placement d’annonces immobilières en fonction de leur disponibilité. Ce qui clairement ne correspond pas à notre problème car nous cherchons à identifier les annonces identiques sur une liste d’annonces. Dans [5] il s’agit de l’application d’un ensemble de règles pour classer les annonces papiers par domaines.In terms of the patent around real estate ad classification, we have [4], which deals with the placement of real estate ads according to their availability. Which clearly does not correspond to our problem because we are trying to identify identical ads on a list of ads. In [5] it is the application of a set of rules to classify paper advertisements by domain.

Aujourd’hui, pour reconnaître les annonces identiques, l’utilisateur fait un rapprochement manuel entre les annonces en se basant sur le texte ou sur les photos des annonces. Ceci est sujet à erreur, prends beaucoup de temps et est devenu difficile avec la multiplicité des sources d’annonces (annonces provenant du même site web, de sites web différents, annonces venant de sites web d’agence immobilière).Today, to recognize identical advertisements, the user makes a manual comparison between the advertisements based on the text or on the photos of the advertisements. This is error prone, time consuming and has become difficult with multiple ad sources (ads from same website, different websites, ads from real estate agency websites).

Le besoin existe également d’identifier les annonces similaires pour d’autres applications, comme par exemple :
- identifier, pour une agence immobilière, les annonces entrées en double dans sa base de données, afin d’éviter les doublons et de libérer de l’espace mémoire,
- recenser l’ensemble des biens en vente sur le marché à des fins statistiques, sans biais,
- déclarer de manière centralisée la vente d’un bien à plusieurs sociétés ou sites internet ayant mis en vente le bien, de manière à retirer le bien de la vente sur l’ensemble de ces sites en une seule action,
- supprimer, par un site de petites annonces immobilières, toutes les annonces en lien avec un bien vendu, afin de mettre le site à jour et de libérer de l’espace mémoire sur le serveur,
- réduire les temps de calcul, par exemple pour un site de petites annonces immobilières affichant un taux de rentabilité par annonce, par exemple dans le cadre d’un investissement locatif où une mensualité de crédit est calculée, en évitant de calculer plusieurs fois le taux de rentabilité pour des annonces identiques,
- identifier sans doublons les biens à la location, pour une commune, afin de prélever les taxes de séjour.There is also a need to identify similar ads for other applications, such as:
- identify, for a real estate agency, the advertisements entered in duplicate in its database, in order to avoid duplicates and free up memory space,
- identify all the goods for sale on the market for statistical purposes, without bias,
- centrally declare the sale of a property to several companies or websites that have put the property up for sale, so as to withdraw the property from sale on all of these sites in a single action,
- delete, by a real estate classified ad site, all ads related to a property sold, in order to update the site and free up memory space on the server,
- reduce calculation times, for example for a real estate classified ad site displaying a rate of return per ad, for example in the context of a rental investment where a monthly payment of credit is calculated, by avoiding calculating the rate several times profitability for identical ads,
- identify without duplicates the properties for rent, for a municipality, in order to collect the tourist taxes.

En définitive le problème technique que cherche à résoudre l’invention est le suivant: étant donné une liste d’annonces immobilières comment pouvons-nous à l’aide d’un système automatique regrouper les annonces immobilières décrivant le même bien?Ultimately, the technical problem that the invention seeks to solve is the following: given a list of real estate advertisements, how can we, using an automatic system, group together the real estate advertisements describing the same property?

L’invention permet de répondre à cette question par un procédé de classification d’annonces immobilières mise en œuvre par ordinateur. Le procédé selon l’invention comprend au moins:
- une phase d’entraînement comprenant:
* la récupération d’annonces par l’intermédiaire d’un logiciel robot, leur identification par un numéro unique, l’extraction de leurs caractéristiques, et le stockage des annonces et de leurs caractéristiques,
* l’identification, par un utilisateur, d’annonces similaires,
* la génération d’un tableau décrivant pour chaque caractéristique, la liste de paires d’annonces similaires possédant les valeurs égales et la liste des paires d’annonces non similaires possédant des valeurs égales.
* Le total des paires d’annonces similaires, puis non similaires sont calculés
* La probabilité d’obtenir une paire d’ annonces similaires , la probabilité d’obtenir une paire d’annonces non similaires sont calculés
* Les probabilités d’avoir n critères égaux entre deux paires d’annonces sont calculés * Les probabilités d’avoir n critères égaux entre deux paires d’annonces sachant que les deux annonces sont similaires sont calculés
* Les probabilités d’avoir n critères égaux entre deux paires d’annonces sachant que les deux annonces ne sont pas similaires sont calculés
- une phase de traitement comprenant:
* la récupération d’un ensemble de M annonces à traiter,
* le regroupement des M annonces en M2 couples d’annonces, puis pour chaque couple d’annonce:
@ la détermination des critères communs aux deux annonces,
@ la détermination d’une probabilité pour que les deux annonces soient similaires, à partir des critères communs aux deux annonces et des probabilités calculées lors de la phase d’entrainement .
@ la détermination d’une probabilité pour que les deux annonces soient différentes, à partir des critères communs aux deux annonces et des probabilités calculées lors de la phase d’entrainement .
@ la prise d’une décision sur le fait que les deux annonces désignent un bien identique, en comparant la probabilité pour que les deux annonces du couple soient similaires et la probabilité pour que les deux annonces du couple soient différentes,
* la génération d’une liste de biens identiques, à partir des décisions prises sur les couples d’annonces.The invention makes it possible to answer this question by means of a method for classifying real estate advertisements implemented by computer. The method according to the invention comprises at least:
- a training phase comprising:
* the retrieval of advertisements via robot software, their identification by a unique number, the extraction of their characteristics, and the storage of advertisements and their characteristics,
* the identification, by a user, of similar advertisements,
* the generation of a table describing for each characteristic, the list of pairs of similar advertisements having equal values and the list of pairs of non-similar advertisements having equal values.
* The total of similar and then not similar ad pairs are calculated
* The probability of getting a pair of similar ads , the probability of getting a pair of unsimilar ads are calculated
* The probabilities of having n equal criteria between two ad pairs are calculated * The probabilities of having n equal criteria between two pairs of ads knowing that the two ads are similar are calculated
* The probabilities of having n equal criteria between two pairs of ads knowing that the two ads are not similar are calculated
- a treatment phase comprising:
* the recovery of a set of M advertisements to be processed,
* the grouping of M ads into M2 ad pairs, then for each ad pair:
@ the determination of the criteria common to the two advertisements,
@ the determination of a probability for the two announcements to be similar, based on the criteria common to the two announcements and the probabilities calculated during the training phase .
@ the determination of a probability for the two announcements to be different, based on the criteria common to the two announcements and the probabilities calculated during the training phase .
@ making a decision on the fact that the two advertisements designate an identical property, by comparing the probability that the couple's two advertisements are similar and the probability that the couple's two advertisements are different,
* the generation of a list of identical properties, from the decisions taken on the pairs of advertisements.

Selon un mode de réalisation, l’étape de détermination d’une probabilité pour que les deux annonces soient similaires comprend le calcul de la probabilité avec crit1, crit2, ..., critn, un ensemble de critères communs aux deux annonces du couple d’annonces, et ce calcul utilise
* la probabilité d’avoir cet ensemble de critères communs sachant que les annonces sont similaires calculée pendant la phase d’entrainement de la manière suivante * la probabilité d’avoir cet ensemble de critères communs calculé durant la phase d’entrainement de la façon suivante * la probabilité d’obtenir un couple d’annonces similaires; calculé pendant la phase d’entrainement de la façon suivante et dans lequel: est calculé en sommant les paires d’annonces similaires vérifiant crit1 et crit2 … et critn, correspond à la somme de toutes les paires d’annonces similaires et est la somme de toutes les paires d’annonces.According to one embodiment, the step of determining a probability that the two advertisements are similar comprises calculating the probability with crit1, crit2, ..., critn, a set of criteria common to both ads in the pair of ads, and this calculation uses
* the probability of having this set of common criteria given that the ads are similar calculated during the training phase as follows * the probability of having this set of common criteria calculated during the training phase as follows * the probability of getting a couple of similar ads; calculated during the training phase as follows and in which: is calculated by summing the pairs of similar ads verifying crit1 and crit2 … and critn, is the sum of all similar ad pairs and is the sum of all ad pairs.

Selon un mode de réalisation, l’étape de détermination d’une probabilité pour que les deux annonces soient différentes comprend le calcul de la probabilité
avec crit1, crit2, ..., critn, un ensemble de critères communs aux deux annonces du couple d’annonces, et ce calcul utilise
* la probabilité d’avoir cet ensemble de critères communs sachant que les annonces sont différentes calculée pendant la phase d’entrainement de la manière suivante: * la probabilité d’obtenir un couple d’annonces non similaires; calculé pendant la phase d’entrainement de la façon suivante et dans lequel:
- est calculé en sommant les paires d’annonces non similaires vérifiant crit1 et crit2 … et critn,
- correspond à la somme de toutes les paires d’annonces non similaires.
Selon un mode de réalisation, il est décidé que deux annonces désignent un bien identique lorsque la probabilité qu’elles soit similaire est supérieure à la probabilité qu’elles soient différentes, et inversement.According to one embodiment, the step of determining a probability for the two advertisements to be different comprises calculating the probability
with crit1, crit2, ..., critn, a set of criteria common to both ads in the pair of ads, and this calculation uses
* the probability of having this set of common criteria knowing that the ads are different calculated during the training phase as follows: * the probability of getting a couple of unsimilar ads; calculated during the training phase as follows and in which:
- is calculated by summing the pairs of dissimilar ads verifying crit1 and crit2 … and critn,
- is the sum of all unsimilar ad pairs.
According to one embodiment, it is decided that two advertisements designate an identical good when the probability that they are similar is greater than the probability that they are different, and vice versa.

Selon un mode de réalisation, pendant la phase d’entraînement les données sont stockées dans une base de données ou dans des fichiers au format JSON.According to one embodiment, during the training phase the data is stored in a database or in files in JSON format.

Selon un mode de réalisation, pendant la phase d’entraînement les annonces similaires sont indiquées par l’utilisateur dans un fichier au format csv.According to one embodiment, during the training phase the similar announcements are indicated by the user in a file in csv format.

Dispositif configuré pour implémenter au moins un parmi une phase d’entrainement et une phase de traitement d’un procédé de classification d’annonces immobilières tel que décrit précédemment.Device configured to implement at least one of a training phase and a processing phase of a method for classifying real estate advertisements as described above.

La figure 1 décrit le procédé mis en œuvre sur 2 ordinateurs. Le logiciel robot récupère les annonces sur plusieurs sites d’annonces immobilières, ces annonces sont ensuite fournies à la phase d’entrainement implémenté sur l’ordinateur1. La phase de traitement est implémentée sur l’ordinateur2. Ainsi l’ordinateur2 reçoit un groupe d’annonces et renvoie des couples correspondant aux annonces similaires.Figure 1 describes the process implemented on 2 computers. The robot software retrieves ads from several real estate ad sites, these ads are then provided to the training phase implemented on the computer1. The processing phase is implemented on computer2. Thus computer2 receives a group of advertisements and returns pairs corresponding to similar advertisements.

La figure 2 est un exemple de fichier JSON après récupération automatique sur des sites d’annonces immobilières.Figure 2 is an example of a JSON file after automatic retrieval from real estate listing sites.

La figure 3 est un exemple de fichier csv fourni par un utilisateur durant la phase d’entrainement.
Spécification de l’invention Figure 3 is an example of a csv file provided by a user during the training phase.
Specification of the invention

Le procédé de classification d’annonces immobilières selon l’invention comprend tout d’abord une phase d’entraînement.The method for classifying real estate advertisements according to the invention firstly comprises a training phase.

Pendant cette phase les annonces sont récupérées par l’intermédiaire d’un logiciel robot. Le procédé attribue un numéro unique à chaque annonce et les caractéristiques de chacune d’elles sont extraites. Il s’agit pour chaque annonce de:

le prix du bien
la superficie du bien
l’année de construction
la distance par rapport aux transports en commun
le nombre de chambres

la ville où se situe le bien
le type de bien
le nombre de pièces
le type de chauffage
le nombre de caves

l’étage où se situe le bien
le nombre de parkings
la superficie du terrain
la présence d’un balcon
la présence d’une terrasse
le type de cuisine

During this phase the advertisements are retrieved using robot software. The method assigns a unique number to each advertisement and the characteristics of each of them are extracted. This is for each ad:

the price of the good
the area of the property
the year of construction
the distance from public transport
the number of bedrooms

the city where the property is located
the type of property
the number of rooms
the type of heating
the number of cellars

the floor where the property is located
the number of car parks
the area of the land
the presence of a balcony
the presence of a terrace
the type of cuisine

La liste n’est pas exhaustive, d’autres caractéristiques peuvent être ajoutées.The list is not exhaustive, other features may be added.

Ces informations sont stockées dans une base de données, ou sous forme de fichiers au format JSON (Figure 2).This information is stored in a database, or in the form of files in JSON format (Figure 2).

Un utilisateur identifie les annonces similaires. Les annonces similaires renseignées par l’utilisateur durant cette phase sont transmises au procédé sous forme de fichier au format csv (Figure 3).A user identifies similar ads. Similar announcements entered by the user during this phase are transmitted to the process in the form of a file in csv format (Figure 3).

Le procédé utilise ces informations pour calculer des probabilités intermédiaires caractérisant deux annonces similaires et deux annonces non similaires. Pour ce faire il génère d’abord un tableau de paires d’annonces avec toutes les annonces récupérées lors de la phase d’entrainement. Un exemple de tableau généré est le suivant: critères annonces similaires annonces non similaires Prix égaux (1;2) (3;4) Superficies égales (1;2) (10;12) (4;5) Nombre de pièces égal (1;2) (10;12) (3;5) ... ... ... ... ... ... Total 300 700 The method uses this information to calculate intermediate probabilities characterizing two similar advertisements and two dissimilar advertisements. To do this, it first generates an array of ad pairs with all the ads retrieved during the training phase. An example of a generated table is as follows: criteria similar ads unsimilar ads Equal prices (1;2) (3;4) Equal areas (1;2) (10;12) (4;5) Equal number of pieces (1;2) (10;12) (3;5) ... ... ... ... ... ... Total 300 700

La première colonne correspond aux caractéristiques extraits du jeu d’annonces d’entrainement, la deuxième colonne correspond aux paires d’annonces similaires, et la troisième colonne aux paires d’annonces non similaires. Par exemple, l’annonce 1 et 2 sont similaires et ont des prix égaux, des superficies égales, un nombre de pièces égal etc. En revanche l’annonce 3 et 4 ne sont pas similaires bien qu’elles aient des prix égaux. La dernière ligne du tableau correspond à la somme de toutes les paires d’annonces similaires et non similaires.The first column corresponds to the features extracted from the training ad set, the second column corresponds to the similar ad pairs, and the third column corresponds to the unsimilar ad pairs. For example, listing 1 and 2 are similar and have equal prices, equal areas, equal number of rooms etc. On the other hand, ad 3 and 4 are not similar although they have equal prices. The last row of the table is the sum of all similar and unsimilar ad pairs.

Ainsi la probabilité d’obtenir des annonces similaires est calculée de la manière suivante Avec

Correspondant à la somme de toutes les paires d’annonces similaires et

est la somme de toutes les paires d’annonces

So the probability of getting similar ads is calculated as follows With

Corresponding to the sum of all pairs of similar ads and

is the sum of all ad pairs

De même la probabilité d’obtenir des annonces non similaires est calculée et dans lequel:

correspond à la somme de toutes les paires d’annonces non similaires.

Similarly the probability of getting unsimilar ads is calculated and in which:

is the sum of all unsimilar ad pairs.

Les probabilités d’avoir n critères égaux entre deux paires d’annonces sont calculés de la manière suivante et dans lequel:The probabilities of having n equal criteria between two ad pairs are calculated as follows and in which:

est calculé en sommant les paires d’annonces similaires vérifiant crit1 et crit2 … et critn, is calculated by summing the pairs of similar ads verifying crit1 and crit2 … and critn,

Les probabilités d’avoir n critères égaux entre deux paires d’annonces sachant que les deux annonces sont similaires sont calculés ainsi Les probabilités d’avoir n critères égaux entre deux paires d’annonces sachant que les deux annonces ne sont pas similaires sont calculés et dans lequel:

est calculé en sommant les paires d’annonces non similaires vérifiant crit1 et crit2 … et critn,

- correspond à la somme de toutes les paires d’annonces non similaires.The probabilities of having n equal criteria between two pairs of ads knowing that the two ads are similar are calculated as follows: The probabilities of having n equal criteria between two pairs of ads knowing that the two ads are not similar are calculated and in which:

is calculated by summing the pairs of dissimilar ads verifying crit1 and crit2 … and critn,

- is the sum of all unsimilar ad pairs.

Le procédé de classification d’annonces immobilières selon l’invention comprend ensuite une phase de traitement.The method for classifying real estate advertisements according to the invention then comprises a processing phase.

La phase de traitement opère comme suit: pour une liste comprenant M annonces, le système les regroupe d’abord en paire d’annonces, puis détermine pour chacune des paires, la probabilité que les deux annonces soient similaires. Pour cela le système calcule les critères de ressemblance entre les deux annonces, puis il calcule la probabilité que la paire d’annonces soient similaires de la manière suivante: Avec , et déterminés à partir des calculs effectués durant la phase d’entrainement. De même le système calcule la probabilité que chacune des paires d’annonces ne soient pas similaires sachant quecrit ₁ , crit ₂ , crit ₃ ,···, crit _n sont satisfait.
Ceci de la manière suivante The processing phase operates as follows: for a list comprising M advertisements, the system first groups them into pairs of advertisements, then determines for each of the pairs, the probability that the two advertisements are similar. To do this, the system calculates the similarity criteria between the two ads, then it calculates the probability that the pair of ads are similar as follows: With , And determined from calculations made during the training phase. Similarly, the system calculates the probability that each of the pairs of advertisements are not similar knowing that crit ₁ , crit ₂ , crit ₃ , ··· , crit _n are satisfied.
This in the following way

Avec et et déterminés à partir des calculs effectués durant la phase d’entrainement.With And And determined from calculations made during the training phase.

Le système dira qu’une paire d’annonces est similaire si est supérieure à dans le cas contraire, le système dira que le couple d’annonces n’est pas similaire.The system will say that an ad pair is similar if is greater than otherwise, the system will say that the ad pair is not similar.

Une fois que chaque paire d’annonces est identifiée comme étant une annonce similaire ou non, le système regroupe les paires d’annonces représentant le même bien. Par exemple, pour la liste suivante de paires d’annonces similaires: (1;2) (2;3) (4;10) (11; 12) (12;16) , le système renverra (1; 2; 3), (4;10) ( 11, 12, 16) pour ainsi indiquer au final que les annonces 1, 2 et 3 désignent le même bien, les annonces 4 et 10 désignent le même bien, les annonces 11, 12 et 16 désignent le même bien.Once each pair of ads is identified as being a similar ad or not, the system groups the pairs of ads representing the same property. For example, for the following list of similar ad pairs: (1;2) (2;3) (4;10) (11; 12) (12;16) , the system will return (1; 2; 3) , (4;10) ( 11, 12, 16) to thus indicate in the end that ads 1, 2 and 3 designate the same good, ads 4 and 10 designate the same good, ads 11, 12 and 16 designate the even good.

Dans le cas où le système commence à produire des réponses inexactes, le système peut repasser par la phase d’entrainement.In the event that the system begins to produce inaccurate responses, the system may revert to the training phase.

Le procédé peut être mis en œuvre sur un serveur, un ordinateur. Les phases d’entrainement et traitement peuvent être sur le même ordinateur ou sur des ordinateurs différents. Typiquement, la phase d’entrainement pourrait être implémenté sur un ordinateur performant en puissance de calcul pour le calcul de l’ensemble des probabilités intermédiaires (comme décrit sur la Figure 1). Ces valeurs de probabilités peuvent être stockés dans un fichier puis transmis à un autre ordinateur qui met en œuvre la phase de traitement. Ainsi plusieurs ordinateurs différents peuvent exécuter chacun une instance de la phase de traitement, permettant ainsi de traiter plusieurs demandes de classification en parallèle.The method can be implemented on a server, a computer. The training and treatment phases can be on the same computer or on different computers. Typically, the training phase could be implemented on a computationally powerful computer for computing the set of intermediate probabilities (as depicted in Figure 1). These probability values can be stored in a file and then transmitted to another computer which implements the processing phase. Thus several different computers can each execute an instance of the processing phase, thus making it possible to process several classification requests in parallel.

Le procédé mis en œuvre dans un serveur peut être utilisé dans un site web moteur de recherche de biens immobiliers. Il permet au moteur de recherche de regrouper les annonces identiques afin qu’un utilisateur obtienne rapidement toutes les informations d’un même bien à vendre.
[1] Stuart Russel and Peter Norvig.Intelligence Artificielle. Pearson, 3eme edition, 2010.
[2] Eibe Frank and Remco R. Bouckaert. Naive bayes for text classification with unbalanced classes. Inproceedings of 10th European conference on Principle and Practice of Knowledge Discovery in Databases, Berlin (Germany), september 2006.
[3] Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A bayesian approach to filtering junk e-mail. Inproceedings of AAI-98 Workshop on Learning for Text Categorization, 1998.
[4] Cohen Omar J. Management system and method for placing property announcements, WO 2013005124, 2012.
[5] Gabriel-Antoine Brouze, Didier Durand, Didier Rey, Maurice Sarrasin, and Jean-Luc Vuattoux. Procédé et système pour le traitement des petites annonces, EP 1361524, 2003.The method implemented in a server can be used in a real estate search engine website. It allows the search engine to group together identical ads so that a user quickly obtains all the information about the same property for sale.
[1] Stuart Russell and Peter Norvig. Artificial Intelligence . Pearson, 3rd edition, 2010.
[2] Eibe Frank and Remco R. Bouckaert. Naive bayes for text classification with unbalanced classes. In proceedings of 10th European conference on Principle and Practice of Knowledge Discovery in Databases , Berlin (Germany), September 2006.
[3] Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A Bayesian approach to filtering junk email. In proceedings of AAI-98 Workshop on Learning for Text Categorization , 1998.
[4] Cohen Omar J. Management system and method for placing property announcements, WO 2013005124, 2012.
[5] Gabriel-Antoine Brouze, Didier Durand, Didier Rey, Maurice Sarrasin, and Jean-Luc Vuattoux. Method and system for processing classified advertisements, EP 1361524, 2003.

Claims

A computer-implemented real estate advertisement classification method comprising at least:
- a training phase comprising:
* retrieval of advertisements, their identification by a unique number, the extraction of their characteristics, and the storage of advertisements and their characteristics,
* the identification and counting, by a user, of pairs of similar advertisements and pairs of dissimilar advertisements,
* the generation of a table describing for each characteristic, the list of pairs of similar advertisements having equal values and the list of pairs of non-similar advertisements having equal values.
* The calculation of the total of similar and then not similar ad pairs
* The calculation of the probability that two ads are similar and the probability that two ads are not similar ,
* The calculation of the probability of having n equal criteria between two pairs of ads ,
* The calculation of the probability of having n equal criteria between two pairs of ads knowing that the two ads are similar ,
* The calculation of the probability of having n equal criteria between two pairs of ads knowing that the two ads are not similar ,
- a treatment phase comprising:
* the recovery of a set of M advertisements to be processed,
* the grouping of M advertisements into M ² pairs of advertisements, then for each pair of advertisements:
@ the determination of the criteria common to the two announcements,
@ the determination of a probability for the two announcements to be similar, based on the criteria common to the two announcements and the probabilities calculated during the training phase .
@ the determination of a probability for the two announcements to be different, based on the criteria common to the two announcements and the probabilities calculated during the training phase .
@ making a decision on the fact that the two advertisements designate an identical property, by comparing the probability that the couple's two advertisements are similar and the probability that the couple's two advertisements are different,
* the generation of a list of identical properties, from the decisions taken on the pairs of advertisements.

A method of classifying real estate advertisements according to claim 1, wherein the step of determining a probability that the two advertisements are similar comprises calculating the probability

with
- , a set of criteria common to the two ads in the pair of ads, with
-
the probability of having a set of common criteria knowing that the ads are similar, calculated during the training phase,
-
the probability of having a set of common criteria, calculated during the training phase,
-
the probability of obtaining a pair of similar ads, calculated during the training phase,
-
is calculated by summing the pairs of similar ads verifying crit1 and crit2 … and critn,
- is the sum of all similar ad pairs and
- is the sum of all ad pairs.

Method for classifying real estate advertisements according to one of the preceding claims, in which the step of determining a probability that the two advertisements are different given a set of common criteria comprises calculating the probability

with crit1, crit2, ..., critn, a set of criteria common to both ads in the pair of ads, and this calculation uses
* the probability of having this set of common criteria knowing that the ads are different calculated during the training phase as follows

* the probability of getting a couple of dissimilar ads; calculated during the training phase as follows

and in which:

- is the sum of all unsimilar ad pairs.

Method of classifying real estate advertisements according to one of the preceding claims, in which it is decided that two advertisements designate an identical property when the probability that they are similar is greater than the probability that they are different, and vice versa.

Method for classifying real estate advertisements according to one of the preceding claims, wherein during the training phase the data is stored in a database or in files in JSON format.

Method for classifying real estate advertisements according to one of the preceding claims, where during the training phase similar advertisements are indicated by the user in a file in csv format.

Method for classifying real estate advertisements according to one of the preceding claims, where during the training phase the advertisements are retrieved by means of robot software.

Device configured to implement at least one of a training phase and a processing phase of a method for classifying real estate advertisements according to one of the preceding claims.