FR3011108A1

FR3011108A1 - METHOD FOR MEMORY MANAGEMENT IN HYBRID SIMULATION FOR SYSTEMS-ON-CHIP

Info

Publication number: FR3011108A1
Application number: FR1359163A
Authority: FR
Inventors: Eric Paire; Anthony Cuccaro
Original assignee: STMicroelectronics Grenoble 2 SAS
Current assignee: STMicroelectronics Grenoble 2 SAS
Priority date: 2013-09-24
Filing date: 2013-09-24
Publication date: 2015-03-27

Abstract

L'invention est relative à un procédé de simulation d'un circuit électronique numérique comprenant un processeur hôte (CPU), un dispositif périphérique (UC), et une mémoire (MEM) partagée entre le processeur hôte et le dispositif périphérique par un bus (B). Le procédé comprend les étapes suivantes : partitionner le circuit en un domaine de modélisation de haut niveau comprenant des modèles fonctionnels (TLM) et un domaine de modélisation de bas niveau comprenant des modèles précis au cycle d'horloge près (RTL) ; définir le processeur hôte dans le domaine de haut niveau (TLM) ; définir la mémoire partagée (MEM) et le dispositif périphérique (UC) dans le domaine de bas niveau (RTL) ; définir une mémoire fantôme (10) dans le domaine de haut niveau (TLM) ; intercepter les accès du processeur hôte à la mémoire partagée, et rediriger les accès vers la mémoire fantôme (10), sans utiliser le bus, par une interface mémoire directe (DMI) définie pour le domaine de modélisation de haut niveau ; et quand un accès à la mémoire partagée (MEM) par le processeur hôte concerne des données partagées avec le dispositif périphérique, reproduire cet accès dans la mémoire partagée en utilisant le bus.The invention relates to a method for simulating a digital electronic circuit comprising a host processor (CPU), a peripheral device (CPU), and a memory (MEM) shared between the host processor and the peripheral device by a bus ( B). The method comprises the steps of: partitioning the circuit into a high level modeling domain including functional models (TLMs) and a low level modeling domain including clock-accurate models (RTL); Define the host processor in the high-level domain (TLM) define the shared memory (MEM) and the peripheral device (UC) in the low level domain (RTL); define a phantom memory (10) in the high level domain (TLM); intercepting accesses of the host processor to the shared memory, and redirecting the accesses to the phantom memory (10), without using the bus, by a direct memory interface (DMI) defined for the high-level modeling domain; and when access to the shared memory (MEM) by the host processor relates to data shared with the peripheral device, reproducing that access in the shared memory using the bus.

Description

PROCEDE DE GESTION DE MEMOIRE EN SIMULATION HYBRIDE POUR SYSTEMES-SUR-PUCE Domaine technique de l'invention L'invention est relative à la validation de systèmes-sur-puce à l'aide de techniques d'analyse hybrides permettant de reproduire le comportement du système en utilisant conjointement plusieurs outils d'analyse différents. De telles techniques permettent, par exemple, de simuler certaines parties du système par du logiciel et d' émuler conjointement d'autres parties du système par du matériel. État de la technique L'ensemble de la partie numérique d'un système-sur-puce est en général défini dans un langage de description fonctionnelle de bas niveau, comme le Verilog ou le VHDL, langage que l'on désigne génériquement par RTL (de l'anglais « Register Transfer Level », ou niveau de transfert de registres). Les langages RTL sont précis au cycle d'horloge près (« cycle accurate » en anglais), c'est-à-dire qu'ils décrivent tous les événements qui se produisent à chaque cycle d'une horloge. Un fichier écrit dans un tel langage permet à des outils de synthèse de générer de manière automatisée les portes logiques et autres composants élémentaires nécessaires à réaliser les fonctions décrites. Un simulateur de ce niveau de description reproduit la fonctionnalité de manière logicielle, et s'avère bien trop lent lorsque le circuit est complexe. On lui préfère un émulateur, qui peut être programmé pour réaliser les fonctions logiques de manière matérielle. Cependant, l'horloge cadençant un émulateur étant notablement moins rapide que celle prévue pour cadencer le système-sur-puce, l'émulation trouve également ses limites lorsque le système-sur-puce devient très complexe. Dans ce cas on a recours à des techniques de simulation ou émulation hybride. Ces techniques consistent à partitionner un système en deux domaines de modélisation : un domaine de modélisation de bas niveau, du niveau d'un langage RTL, regroupe les circuits dont on veut analyser le comportement fin, au cycle près ; et un domaine de modélisation de haut niveau regroupe les circuits dont on veut utiliser les fonctions génériques, notamment les processeurs exécutant un programme.TECHNICAL FIELD OF THE INVENTION The invention relates to the validation of systems-on-chip using hybrid analysis techniques making it possible to reproduce the behavior of the system. system by jointly using several different analytical tools. Such techniques allow, for example, to simulate certain parts of the system with software and to emulate other parts of the system together with hardware. State of the art The entire digital part of a system-on-chip is generally defined in a low-level functional description language, such as Verilog or VHDL, a language that is generically designated by RTL ( "Register Transfer Level" or register transfer level). RTL languages are accurate to the clock cycle, that is, they describe all the events that occur each cycle of a clock. A file written in such a language allows synthesis tools to automatically generate the logical gates and other elementary components necessary to perform the functions described. A simulator of this level of description reproduces the functionality in a software way, and turns out to be too slow when the circuit is complex. It is preferred an emulator, which can be programmed to perform the logical functions in a material way. However, the clock setting an emulator is noticeably slower than that expected to clock the system-on-chip, emulation also finds its limits when the system-on-chip becomes very complex. In this case we use simulation techniques or hybrid emulation. These techniques consist of partitioning a system into two modeling domains: a low-level modeling domain, from the level of an RTL language, groups the circuits whose end-of-cycle behavior is to be analyzed; and a high level modeling domain groups the circuits whose generic functions are to be used, in particular the processors executing a program.

Le langage SystemC, tel que défini dans la norme IEEE 1666-2011, comporte une classe ou extension dénommée TLM (« Transaction Level Modeling ») pouvant servir dans le domaine de modélisation de haut niveau. Cette extension permet notamment de simuler rapidement la fonctionnalité d'un système processeur dans l'exécution d'un programme. Pour relier les deux domaines de modélisation, le domaine de haut niveau qu'on désignera par TLM, et le domaine de bas niveau qu'on désignera par RTL, on a développé une interface normalisée dénommée SCE-MI, décrite dans le manuel intitulé « Standard Co-Emulation Modeling Interface (SCE-MI) Reference Manual », publié sur Internet par le consortium Accellera Systems Initiative. La figure 1 est un schéma-bloc d'un exemple de système-sur-puce, ou SoC, partitionné pour mettre en oeuvre ces techniques. Le système comprend un processeur hôte d'usage général CPU, une mémoire partagée MEM, un dispositif périphérique IP, et une interface d'entrée/sortie IO. Tous ces éléments sont interconnectés par un bus B. Le processeur CPU, la mémoire MEM et l'interface IO se trouvent dans le domaine TLM, tandis que le dispositif périphérique IP se trouve dans le domaine RTL. Le bus B traverse les domaines par une interface SCE-MI, par exemple.The SystemC language, as defined in the IEEE 1666-2011 standard, has a class or extension called TLM (Transaction Level Modeling) that can be used in the high-level modeling domain. This extension makes it possible to quickly simulate the functionality of a processor system in the execution of a program. To link the two modeling domains, the high-level domain that will be designated by TLM, and the low-level domain that will be designated by RTL, we have developed a standardized interface called SCE-MI, described in the manual entitled " Standard Co-Emulation Modeling Interface (SCE-MI) Reference Manual ", published on the Internet by the Accellera Systems Initiative consortium. Figure 1 is a block diagram of an exemplary system-on-chip, or SoC, partitioned to implement these techniques. The system includes a CPU general purpose host processor, a shared memory MEM, an IP peripheral device, and an IO input / output interface. All these elements are interconnected by a bus B. The CPU processor, the MEM memory and the IO interface are in the TLM domain, while the IP peripheral device is in the RTL domain. The bus B crosses the domains via an SCE-MI interface, for example.

Cette configuration permet d'analyser de manière fine le comportement du dispositif IP, par exemple un nouveau circuit qui n'a pas encore été réalisé sur silicium. Cette analyse peut être faite par émulation ou simulation de son comportement décrit en RTL. De manière générale, on place dans le domaine TLM des circuits dont la fiabilité a été vérifiée ou qui n'ont pas besoin d'une analyse fine. On y place en particulier, comme 20 cela est représenté, un système processeur d'usage général CPU, avec sa mémoire MEM et une interface permettant de fournir le programme au processeur. Dans une telle configuration, toute transaction sur le bus B, initiée par le processeur CPU ou le dispositif IP, est répercutée d'un domaine à l'autre par l'interface SCE-MI. Dans le monde de la simulation ou de l'émulation RTL, une transaction sur le bus est 25 plus lente, de plusieurs ordres de grandeur, par rapport à la transaction dans le circuit réel. Les accès à la mémoire MEM, occupant en pratique la majeure partie de la bande passante du bus, ralentiraient considérablement la simulation si chaque transaction était en effet simulée. Afin d'éviter un tel ralentissement, l'extension TLM prévoit une interface mémoire directe DMI, permettant de court-circuiter le bus pour les accès 30 mémoire effectués par le processeur CPU. Le gain en temps de simulation procuré par l'interface DMI est considérable, mais cela impose que la mémoire partagée soit placée dans le domaine TLM. Dans certaines situations, on souhaite placer la mémoire partagée dans le domaine RTL, par exemple pour analyser une nouvelle structure de mémoire, ou pour analyser des circuits qui ont une interaction poussée avec la mémoire, comme un circuit de gestion de la mémoire ou un circuit DMA. La figure 2 est un schéma-bloc illustrant un exemple de configuration où la mémoire 5 partagée est placée dans le domaine RTL, avec, par exemple, un circuit de gestion de la mémoire MCTRL, et un circuit DMA. Le dispositif périphérique (IP) peut comprendre un microcontrôleur UC conçu pour partager des données avec le processeur CPU par l'intermédiaire de la mémoire partagée MEM. Dans cette situation, l'interface DMI, propre au domaine TLM, n'est plus utilisable, et tous les accès du processeur CPU à la 10 mémoire passent par le bus B pour être pris en compte dans le domaine RTL. Cela peut ralentir la simulation d'un facteur 50 à 100. Résumé de l'invention On souhaiterait accélérer la simulation lorsque la mémoire partagée est placée dans le domaine RTL. 15 On tend à satisfaire ce besoin en prévoyant un procédé de simulation d'un circuit électronique numérique comprenant un processeur hôte, un dispositif périphérique, et une mémoire partagée entre le processeur hôte et le dispositif périphérique par un bus. Le procédé comprenant les étapes suivantes : partitionner le circuit en un domaine de modélisation de haut niveau comprenant des modèles fonctionnels et un domaine de 20 modélisation de bas niveau comprenant des modèles précis au cycle d'horloge près ; définir le processeur hôte dans le domaine de haut niveau ; définir la mémoire partagée et le dispositif périphérique dans le domaine de bas niveau ; définir une mémoire fantôme dans le domaine de haut niveau ; intercepter les accès du processeur hôte à la mémoire partagée, et rediriger les accès vers la mémoire fantôme, sans utiliser le bus, 25 par une interface mémoire directe (DMI) définie pour le domaine de modélisation de haut niveau ; et quand un accès à la mémoire partagée par le processeur hôte concerne des données partagées avec le dispositif périphérique, reproduire cet accès dans la mémoire partagée en utilisant le bus. Selon un mode de mise en oeuvre du procédé, un programme exécuté par le processeur 30 hôte comprend des instructions de gestion de mémoire cache ayant comme paramètres des emplacements mémoire. Le procédé comprend alors les étapes suivantes : définir la mémoire fantôme comme une mémoire cache pour le processeur hôte, de même taille au moins que la mémoire partagée ; et identifier les données partagées avec le dispositif périphérique à l'aide de l'emplacement mémoire fourni pour chaque instruction de gestion de mémoire cache. Selon un mode de mise en oeuvre du procédé, chaque instruction de gestion de mémoire cache est l'une parmi : - une instruction de nettoyage, conçue pour synchroniser la mémoire partagée avec la mémoire cache, - une instruction d'invalidation, conçue à l'origine pour marquer des données de la mémoire cache comme nécessitant un rafraîchissement depuis la mémoire partagée lors d'un accès subséquent, et - une instruction combinée de nettoyage et d'invalidation, conçue à l'origine pour synchroniser la mémoire partagée avec les données sales de la mémoire cache, puis invalider des données de la mémoire cache. Selon un mode de mise en oeuvre, le procédé comprend, pour traiter une instruction d'invalidation, l'étape consistant à rafraîchir la mémoire cache à partir de la mémoire 15 partagée dès le traitement de l'instruction d'invalidation. Selon un mode de mise en oeuvre, le procédé comprend, pour traiter une instruction combinée de nettoyage et d'invalidation, les étapes suivantes : définir une mémoire de traçage dans le domaine de modélisation de haut niveau, de même taille au moins que la mémoire partagée ; lors de l'exécution d'une instruction de nettoyage, dupliquer dans la 20 mémoire de traçage les données résultantes transférées de la mémoire cache vers la mémoire partagée ; lors de l'exécution d'une instruction d'invalidation, dupliquer dans la mémoire de traçage les données résultantes transférées de la mémoire partagée vers la mémoire cache ; et lors de l'exécution de l'instruction combinée de nettoyage et d'invalidation, comparer les données de la mémoire cache aux données correspondantes 25 de la mémoire de traçage. Pour les données inégales, les données de la mémoire cache sont transférées dans la mémoire partagée et dans la mémoire de traçage. Pour les données égales, les données de la mémoire partagée sont transférées dans la mémoire cache et dans la mémoire de traçage. Description sommaire des dessins 30 Des modes de réalisation seront exposés dans la description suivante, faite à titre non limitatif en relation avec les figures jointes parmi lesquelles : - la figure 1, précédemment décrite, représente un exemple de configuration de simulation hybride dans laquelle une mémoire partagée est placée dans un domaine de modélisation de haut niveau ; - la figure 2, précédemment décrite, représente un exemple de configuration de simulation hybride dans laquelle la mémoire partagée est placée dans un domaine de modélisation de bas niveau ; - la figure 3 représente un mode de réalisation de configuration de simulation hybride permettant de diminuer le temps de simulation lorsque la mémoire partagée est placée dans le domaine de modélisation de bas niveau ; et - les figures 4A à 4C symbolisent des étapes de procédé illustrant diverses possibilités d'utilisation de la configuration de la figure 3. Description d'un mode de réalisation préféré de l'invention Certains revendeurs de code RTL pour des processeurs à embarquer, comme la société ARM, proposent des modèles encapsulés au niveau TLM pour leurs processeurs, 15 permettant de simuler également la mémoire cache. Le fait de prévoir une mémoire cache dans le domaine TLM permet de réduire les accès mémoire par le bus B, ce qui tendrait à réduire le temps de simulation lorsque la mémoire partagée se trouve dans le domaine RTL (figure 2). Cependant, une mémoire cache n'offre pas un comportement prévisible, car son efficacité dépend des propriétés 20 du traitement effectué par le processeur, et elle n'est pas efficace lorsque le processeur traite des flots de données non-répétitives. Ainsi, on a constaté que le temps de simulation ne diminuait pas de manière significative dans la plupart des cas, et qu'il pouvait même augmenter dans certains cas, du fait que la simulation d'une mémoire cache est complexe. 25 En pratique, dans un système réel, la majeure partie de la bande passante du bus peut être occupée par le processeur CPU qui exécute un programme gourmand en ressources, comme une interface utilisateur graphique. Les données effectivement partagées entre le processeur CPU et le dispositif périphérique, ou microcontrôleur UC, peuvent ainsi représenter une partie infime de l'occupation du bus. On propose ci-après de tirer partie 30 de cette situation, pour ne faire apparaître sur le bus que les transactions concernant les données partagées.This configuration makes it possible to analyze in a fine manner the behavior of the IP device, for example a new circuit which has not yet been realized on silicon. This analysis can be done by emulation or simulation of its behavior described in RTL. In general, one places in the TLM domain circuits whose reliability has been verified or which do not need a fine analysis. In particular, as shown, a CPU general purpose processor system is provided with its MEM memory and an interface for providing the program to the processor. In such a configuration, any transaction on the bus B, initiated by the processor CPU or the IP device, is passed from one domain to another by the SCE-MI interface. In the world of simulation or RTL emulation, a transaction on the bus is slower, by several orders of magnitude, compared to the transaction in the actual circuit. Access to the MEM memory, which in practice occupies most of the bandwidth of the bus, would considerably slow the simulation if each transaction was indeed simulated. In order to avoid such a slowdown, the TLM extension provides a DMI direct memory interface, making it possible to short-circuit the bus for the memory accesses made by the CPU processor. The gain in simulation time provided by the DMI interface is considerable, but this requires that the shared memory is placed in the TLM domain. In certain situations, it is desired to place the shared memory in the RTL domain, for example to analyze a new memory structure, or to analyze circuits that have a strong interaction with the memory, such as a memory management circuit or a circuit DMA. FIG. 2 is a block diagram illustrating an exemplary configuration where the shared memory is placed in the RTL domain, with, for example, an MCTRL memory management circuit, and a DMA circuit. The peripheral device (IP) may comprise a microcontroller UC designed to share data with the CPU processor via the MEM shared memory. In this situation, the DMI interface, specific to the TLM domain, is no longer usable, and all the accesses of the processor CPU to the memory pass through the bus B to be taken into account in the RTL domain. This can slow down the simulation by a factor of 50 to 100. Summary of the invention We would like to accelerate the simulation when the shared memory is placed in the RTL domain. This need is met by providing a method of simulating a digital electronic circuit comprising a host processor, a peripheral device, and a shared memory between the host processor and the peripheral device over a bus. The method comprising the steps of: partitioning the circuit into a high level modeling domain including functional models and a low level modeling domain including accurate models at the clock cycle; define the host processor in the high-level domain; define the shared memory and the peripheral device in the low-level domain; define a ghost memory in the high-level domain; intercept host processor accesses to shared memory, and redirect access to phantom memory, without using the bus, through a direct memory interface (DMI) defined for the high-level modeling domain; and when shared memory access by the host processor relates to data shared with the peripheral device, reproducing that access in the shared memory using the bus. According to one embodiment of the method, a program executed by the host processor includes cache memory management instructions having memory locations as parameters. The method then comprises the following steps: defining the phantom memory as a cache memory for the host processor, at least the same size as the shared memory; and identifying the data shared with the peripheral device using the memory location provided for each cache management instruction. According to one embodiment of the method, each cache management instruction is one of: - a cleaning instruction, designed to synchronize the shared memory with the cache memory, - an invalidation instruction, designed to: source for marking cache data as requiring refresh from shared memory upon subsequent access, and - a combined cleanup and disable statement, originally designed to synchronize shared memory with data dirty the cache, and then invalidate data from the cache. According to one embodiment, the method comprises, for processing an invalidation instruction, the step of refreshing the cache memory from the shared memory as soon as the invalidation instruction is processed. According to one embodiment, the method comprises, for processing a combined cleaning and invalidation instruction, the following steps: defining a tracing memory in the high-level modeling domain, at least the same size as the memory shared; when performing a cleanup instruction, duplicate in the tracing memory the resulting data transferred from the cache memory to the shared memory; when executing an invalidation instruction, duplicating in the tracing memory the resulting data transferred from the shared memory to the cache memory; and when performing the combined cleaning and invalidation instruction, comparing the data of the cache memory with the corresponding data of the trace memory. For uneven data, cache data is transferred to shared memory and tracing memory. For equal data, the data in the shared memory is transferred to the cache memory and the tracing memory. BRIEF DESCRIPTION OF THE DRAWINGS Embodiments will be set forth in the following description, given in a nonlimiting manner, with reference to the appended figures in which: FIG. 1, previously described, represents an example of a hybrid simulation configuration in which a memory shared is placed in a high-level modeling domain; FIG. 2, previously described, represents an exemplary hybrid simulation configuration in which the shared memory is placed in a low level modeling domain; FIG. 3 represents a hybrid simulation configuration embodiment making it possible to reduce the simulation time when the shared memory is placed in the low level modeling domain; and FIGS. 4A to 4C symbolize process steps illustrating various possibilities of using the configuration of FIG. 3. DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION Certain RTL code resellers for embedded processors, such as the ARM company, propose models encapsulated at the TLM level for their processors, 15 also making it possible to simulate the cache memory. The provision of a cache memory in the TLM domain makes it possible to reduce the memory accesses by the bus B, which would tend to reduce the simulation time when the shared memory is in the RTL domain (FIG. 2). However, a cache does not provide predictable behavior because its efficiency depends on the processing properties of the processor, and it is not efficient when the processor processes non-repetitive data streams. Thus, it was found that the simulation time did not significantly decrease in most cases, and that it could even increase in some cases, because the simulation of a cache memory is complex. In practice, in a real system, most of the bus bandwidth may be occupied by the CPU that runs a resource-intensive program, such as a graphical user interface. The data actually shared between the CPU and the peripheral device, or microcontroller UC, can thus represent a tiny part of the occupation of the bus. It is proposed below to take advantage of this situation, to make appear on the bus only the transactions concerning the shared data.

La figure 3 est un schéma-bloc d'une configuration de simulation hybride, du type de la figure 2, incorporant, dans le domaine TLM, une mémoire fantôme 10 gérée de manière particulière : de façon générale, toute transaction mémoire opérée par le processeur CPU, normalement destinée à la mémoire partagée MEM située dans le domaine RTL, est interceptée et redirigée vers la mémoire fantôme 10 en utilisant l'interface DMI. Cependant, quelques transactions marquées comme devant aller sans délai en mémoire partagée MEM, ou étant à destination d'un dispositif périphérique, comme le circuit DMA sont, effectivement placées sur le bus B, et donc prises en compte directement par leur destinataire (MEM ou DMA). Il en résulte que la mémoire partagée MEM ne contient finalement que les données partagées émises soit par le dispositif périphérique DMA, soit par les transactions placées sur le bus B Les autres données (ou instructions de programme) utilisées exclusivement dans le domaine TLM, notamment par le processeur CPU, ne sont contenues que dans la mémoire fantôme, où elles sont manipulées exclusivement par l'interface DMI.FIG. 3 is a block diagram of a hybrid simulation configuration, of the type of FIG. 2, incorporating, in the TLM domain, a phantom memory 10 managed in a particular way: in general, any memory transaction operated by the processor CPU, normally intended for shared memory MEM located in the RTL domain, is intercepted and redirected to the phantom memory 10 using the DMI interface. However, some transactions marked as having to go without delay in MEM shared memory, or being intended for a peripheral device, such as the DMA circuit, are actually placed on the bus B, and therefore directly taken into account by their recipient (MEM or DMA). As a result, the shared memory MEM finally only contains the shared data transmitted either by the peripheral device DMA or by the transactions placed on the bus B. The other data (or program instructions) used exclusively in the TLM domain, in particular by the CPU processor, are contained only in the phantom memory, where they are handled exclusively by the DMI interface.

La mémoire fantôme 10 se comporte en fait comme une mémoire cache de taille illimitée, ou au moins de taille égale à celle de la mémoire partagée, qui serait configurée pour garder une vue locale de la mémoire partagée en RTL; c'est-à-dire qu'elle contient une copie locale des données lues ou écrites par le processeur dans la mémoire RTL non forcément cohérentes instantanément avec les données de celle-ci, car ne tenant pas compte des modifications effectuées par les modèles internes au domaine RTL. Dans chaque architecture de processeur sachant gérer une mémoire cache, il existe une manière d'identifier des données que le développeur ne souhaite pas mettre en cache. En marquant les données partagées de cette manière, la fonctionnalité souhaitée pourrait être obtenue, car les données partagées seraient toujours échangées par des transactions sur le bus B, et celles non-partagées par des transactions DMI.. Toutefois, cette solution nécessiterait une modification du programme exécuté par le processeur, ce qui n'est pas compatible avec la volonté de tester le système avec son programme original, dont le code source n'est pas forcément disponible.The phantom memory 10 behaves in fact like a cache memory of unlimited size, or at least of size equal to that of the shared memory, which would be configured to keep a local view of the shared memory in RTL; that is to say, it contains a local copy of the data read or written by the processor in the RTL memory not necessarily instantly coherent with the data thereof, because it does not take into account the modifications made by the internal models to the RTL domain. In each processor architecture that manages a cache, there is a way to identify data that the developer does not want to cache. By marking the data shared in this way, the desired functionality could be achieved because the shared data would still be exchanged for transactions on bus B, and those not shared by DMI transactions. However, this solution would require a modification of program run by the processor, which is not compatible with the desire to test the system with its original program, the source code is not necessarily available.

Afin d'utiliser le programme original, sans modification, on propose d'exploiter des instructions de gestion de mémoire cache, forcément présentes dans le programme original, puisque tout système à processeur complexe comprend aujourd'hui une mémoire cache. Ces instructions peuvent être : - Une instruction de « nettoyage », souvent désignée par « clean », servant à forcer la synchronisation de la mémoire partagée avec la mémoire cache. Cette instruction permet notamment de mettre à jour la mémoire partagée avec des données, dites « sales », qui viennent d'être écrites dans la mémoire cache et qui n'ont pas encore été écrites dans la mémoire partagée. Une telle instruction est utilisée, par exemple, lorsque le programme exécuté par le processeur a écrit des données partagées en mémoire et s'apprête à signaler leur disponibilité au dispositif périphérique. L'instruction « clean » permet ainsi de garantir que ces données sont à jour dans la mémoire partagée avant que le dispositif périphérique ne les lise. - Une instruction d'invalidation, souvent désignée par « invalidate », servant à marquer des données en cache comme invalides et à forcer le processeur à lire la mémoire partagée sans accéder à la mémoire cache lors de la prochaine lecture de ces données. A la suite de cette lecture, la mémoire cache se trouve mise à jour avec le contenu de la mémoire partagée. Une telle instruction est utilisée, par exemple, lorsque le dispositif périphérique a mis à disposition des données partagées dans la mémoire, et qu'il a signalé ce fait au processeur par une interruption. En effet, les données contenues dans la mémoire cache ne correspondent alors plus au contenu de la mémoire partagée - l'instruction d'invalidation permet d'assurer que le processeur aille lire les données les plus récentes dans la mémoire partagée. - Une instruction combinée de nettoyage et d'invalidation, souvent désignée par « clean and invalidate ». Cette instruction réalise les deux opérations précédentes de manière atomique. La fonction de nettoyage n'affecte cependant que les données marquées comme « sales » dans la mémoire cache. Une telle instruction est utilisée, par exemple, lorsque le processeur écrit et lit des données dans un même emplacement mémoire. Les données écrites peuvent être une commande pour le dispositif périphérique qui renvoie un résultat pour le processeur dans le même emplacement.In order to use the original program, without modification, it is proposed to use cache management instructions, necessarily present in the original program, since any complex processor system now includes a cache memory. These instructions can be: - A "cleaning" instruction, often called "clean", used to force the synchronization of the shared memory with the cache memory. This instruction allows in particular to update the shared memory with data, called "dirty", which have just been written to the cache memory and have not yet been written in the shared memory. Such an instruction is used, for example, when the program executed by the processor has written shared data in memory and is about to signal their availability to the peripheral device. The "clean" instruction thus ensures that this data is up to date in the shared memory before the peripheral device reads it. - An invalidation instruction, often referred to as "invalidate", to mark cached data as invalid and to force the processor to read the shared memory without accessing the cache on the next reading of that data. Following this reading, the cache memory is updated with the contents of the shared memory. Such an instruction is used, for example, when the peripheral device has made available shared data in the memory, and has signaled this fact to the processor by an interrupt. In fact, the data contained in the cache memory no longer corresponds to the content of the shared memory - the invalidation instruction makes it possible to ensure that the processor will read the most recent data in the shared memory. - A combined cleaning and invalidation instruction, often referred to as "clean and invalidate". This instruction performs both previous operations atomically. The cleaning function, however, only affects the data marked as "dirty" in the cache memory. Such an instruction is used, for example, when the processor writes and reads data in the same memory location. The written data may be a command for the peripheral device that returns a result for the processor in the same location.

Chacune des instructions de gestion de mémoire cache ci-dessus est exécutée avec un paramètre identifiant un emplacement mémoire. Selon les architectures de processeur, l'emplacement mémoire peut être une ligne de cache individuelle ou toutes les lignes de cache correspondant à une plage d'adresses.Each of the above cache management instructions is executed with a parameter identifying a memory location. Depending on the processor architectures, the memory location may be an individual cache line or all cache lines corresponding to a range of addresses.

Chacune de ces instructions identifie ainsi un emplacement mémoire que l'on peut considérer comme correspondant à des données partagées. La mémoire fantôme 10 peut être conçue comme une mémoire cache qui répond à ces instructions de manière plus ou moins détournée, comme on le verra ci-après. Comme la mémoire fantôme 10 est au 5 moins de même taille que la mémoire partagée, on peut omettre dans son modèle les mécanismes à l'origine de la complexité d'une mémoire cache, comme les mécanismes d'indexation, d'éviction, et de gestion des drapeaux des lignes de cache - cela permet d'accélérer la simulation. Le modèle de la mémoire fantôme 10 n'est donc plus un modèle de mémoire cache classique - on continuera cependant à l'appeler « mémoire 10 cache » ci-après. La fonctionnalité souhaitée avec une instruction « clean » est celle offerte par une mémoire cache classique. Les instructions « invalidate » et « clean and invalidate » sont toutefois détournées. Pour gérer les instructions « clean » et « invalidate », on prévoit un module de gestion de cohérence CCTRL. Pour gérer l'instruction « clean and 15 invalidate », on prévoit en outre une mémoire de traçage TRK de même taille, au moins, que la mémoire partagée MEM. La mémoire 10 est utilisée par le processeur CPU par l'intermédiaire de l'interface DMI. Lorsque cela est requis, le module de gestion CCTRL assure les transferts entre la mémoire 10 et le bus B, et entre les mémoires 10 et TRK. 20 Les figures 4A à 4C illustrent l'utilisation de chacune des trois instructions de gestion de mémoire cache ci-dessus. La figure 4A illustre une manière de traiter l'instruction « clean » à l'aide d'un exemple. La mémoire cache 10 contient un certain nombre de données qui ont été écrites par le processeur, notamment des données « CPU DATA» exclusives au processeur, et des 25 données à l'emplacement d'adresse Al, destinées à être partagées avec le dispositif périphérique UC. L'instruction CLEAN(A1) est exécutée par le processeur pour préparer le partage de ces données avec le dispositif périphérique. Cette instruction est transmise au module de gestion CCTRL qui la traite en transférant le contenu de l'emplacement Al de la mémoire cache vers la mémoire partagée MEM par le bus B. 30 Le module CCTRL duplique également ce contenu dans la mémoire de traçage TRK, dont le rôle sera décrit ultérieurement. La lecture de la mémoire cache par le module CCTRL et la duplication dans la mémoire TRK n'ont pas besoin d'être faites selon une procédure de simulation normalisée - ces opérations peuvent être faites de manière logicielle interne qui ne ralentit pas la simulation. A la fin du traitement, la mémoire partagée MEM contient donc les mêmes données que la mémoire cache 10; un dispositif périphérique DMA peut donc utiliser ces données de manière fiable. On remarque que la mémoire partagée MEM peut contenir des données « UC DATA» qui ont été écrites par le microcontrôleur du dispositif périphérique en empruntant le bus B. Ces données, si elles sont exclusives au dispositif périphérique, ne seront jamais dupliquées dans la mémoire cache. Dans le système réel, comme cela est illustré en pointillés, elles partageraient la même mémoire avec les données « CPU DATA» et l'emplacement Al. De ce fait, les données « CPU DATA », « UC DATA» et Al utilisent des plages d'adresses disjointes, comme cela est illustré.Each of these instructions thus identifies a memory location that can be considered as corresponding to shared data. The phantom memory 10 may be designed as a cache memory that responds to these instructions more or less diverted, as will be seen below. Since the phantom memory 10 is at least the same size as the shared memory, the mechanisms underlying the complexity of a cache memory, such as the indexing, eviction, and to manage flags in cache lines - this speeds up the simulation. The model of the phantom memory 10 is therefore no longer a conventional cache model - however, we will continue to call it "cache memory" below. The desired functionality with a "clean" instruction is that offered by a conventional cache. However, the "invalidate" and "clean and invalidate" instructions are diverted. To manage the "clean" and "invalidate" instructions, a CCTRL coherence management module is provided. To manage the "clean and 15 invalidate" instruction, provision is also made for a TRK tracing memory of the same size, at least, as the MEM shared memory. The memory 10 is used by the CPU processor via the DMI interface. When required, the CCTRL management module transfers between the memory 10 and the bus B, and between the memories 10 and TRK. Figures 4A to 4C illustrate the use of each of the above three cache management instructions. Figure 4A illustrates a way of processing the "clean" instruction with an example. The cache memory 10 contains a number of data that has been written by the processor, including processor-specific CPU data, and data at the Al address location, to be shared with the peripheral device. UC. The CLEAN (A1) instruction is executed by the processor to prepare the sharing of this data with the peripheral device. This instruction is transmitted to the management module CCTRL which processes it by transferring the content of the location Al from the cache memory to the shared memory MEM by the bus B. The module CCTRL also duplicates this content in the tracing memory TRK, whose role will be described later. The cache read by the CCTRL module and the duplication in the TRK memory do not need to be done according to a standardized simulation procedure - these operations can be done in internal software that does not slow down the simulation. At the end of the processing, the shared memory MEM therefore contains the same data as the cache memory 10; a DMA peripheral device can therefore reliably use this data. Note that the shared memory MEM may contain "UC DATA" data that was written by the microcontroller of the peripheral device by taking the bus B. This data, if they are exclusive to the peripheral device, will never be duplicated in the cache memory . In the actual system, as shown in dashed lines, they would share the same memory with the "CPU DATA" data and the Al location. As a result, the "CPU DATA", "UC DATA", and Al data use of disjoint addresses, as illustrated.

La figure 4B illustre une manière de traiter l'instruction « invalidate » à l'aide d'un exemple. Le microcontrôleur UC du dispositif périphérique vient d'écrire des données à partager dans un emplacement d'adresse A2 de la mémoire MEM. Cette écriture a été faite par le bus B. Selon la conception du programme, le microcontrôleur peut émettre une interruption vers le processeur CPU pour signaler la disponibilité de ces données, ou bien le processeur se charge de scruter régulièrement un emplacement particulier pour y trouver les nouvelles données (en utilisant des transactions vers la mémoire partagée MEM. Le processeur prépare la lecture de l'emplacement A2 en exécutant l'instruction INVALIDATE(A2), destinée à l'origine à marquer les données correspondantes dans la mémoire cache comme « invalides ».Figure 4B illustrates one way of handling the invalidate instruction with an example. The microcontroller UC of the peripheral device has just written data to be shared in an address location A2 of the memory MEM. This writing was done by the bus B. According to the design of the program, the microcontroller can issue an interruption to the CPU processor to signal the availability of this data, or the processor is responsible for regularly scanning a particular location to find the new data (using transactions to the MEM shared memory.) The processor prepares reading of the A2 location by executing the INVALIDATE (A2) instruction, originally intended to mark the corresponding data in the cache memory as "invalid". ".

Dans un système réel, lorsque le processeur lit l'emplacement A2, la mémoire cache émet un échec de lecture (« cache miss »), provoquant une lecture directe des données à partir de la mémoire partagée, et par la même occasion, un rafraîchissement des lignes de cache correspondantes. Dans le système simulé, pour éviter de mettre en oeuvre ce mécanisme complexe, le module CCTRL traite l'instruction en transférant immédiatement les données de l'emplacement A2 de la mémoire partagée MEM vers la mémoire cache par le bus B. L'emplacement A2 est par la même occasion dupliqué dans la mémoire de traçage TRK, dont le rôle sera décrit ultérieurement. A la fin du traitement, la mémoire cache 10 contient donc les mêmes données que la mémoire partagée MEM; le processeur CPU peut donc utiliser ces données de manière fiable avec des transactions « rapides » DMI. La figure 4C illustre une manière de traiter l'instruction « clean and invalidate » à l'aide d'un exemple. On considère que le processeur CPU et le dispositif périphérique DMA partagent un même emplacement d'adresse Al. Les contenus des différentes copies de l'emplacement Al sont illustrés côte à côte dans trois étapes successives. Lors d'une première étape, le processeur écrit des données CPU-D à l'emplacement Al. Ces données apparaissent d'abord dans la mémoire cache. Les emplacements Al dans la 5 mémoire partagée MEM et la mémoire TRK conservent leur contenu d'origine. Le processeur exécute une instruction CLEAN&INVALIDATE(A1). Dans un système réel, cette instruction vérifierait un drapeau associé aux données dans la mémoire cache et ne provoquerait une mise à jour de la mémoire partagée que si le drapeau indique que les données sont sales. Dans tous les cas, le système réel 10 indiquerait ensuite que les données sont invalides par un autre drapeau. Dans le système simulé, ces drapeaux ne sont pas mis en oeuvre - pour savoir si les données sont sales, on compare le contenu de l'emplacement Al de la mémoire cache au contenu du même emplacement dans la mémoire de traçage TRK. En effet, la fonction de la mémoire TRK, telle qu'elle a été définie jusqu'ici, est de contenir la trace 15 des dernières données partagées échangées par le bus B pour pouvoir effectuer cette comparaison, en fait une image des données partagées telles qu'elles seraient dans la mémoire partagée réelle. Dans cette première étape, les contenus de la mémoire cache et de la mémoire TRK sont différents. Cela provoque le transfert des données CPU-D de la mémoire cache vers la 20 mémoire partagée par le bus B, et vers la mémoire TRK, comme pour le traitement d'une simple instruction « clean » (figure 4A). La composante « invalidation » de l'instruction n'est pas mise en oeuvre. Les différentes mémoires contiennent alors les données CPU-D à l'emplacement Al, comme cela est illustré. Le processeur attend une réponse dans l'emplacement Al. Cette réponse peut être 25 indiquée par une interruption émise par le dispositif périphérique, ou bien le processeur scrute régulièrement un emplacement particulier pour savoir si le contenu de l'emplacement Al a été modifié. Avant de lire le contenu de l'emplacement Al, le processeur exécute une nouvelle instruction CLEAN&INVALIDATE(A1). Cette fois, les emplacements Al de la mémoire cache et de la mémoire de traçage contiennent la 30 même donnée CPU-D - la donnée en mémoire cache est considérée comme « propre » et n'est pas écrite dans la mémoire partagée. Au lieu de cela, on opère une invalidation de l'emplacement Al dans la mémoire cache, comme cela est fait pour une instruction « invalidate » seule (figure 4B) - le contenu de l'emplacement Al de la mémoire partagée est dupliqué dans la mémoire cache, par le bus B, et dans la mémoire de traçage TRK. Si le dispositif périphérique a au préalable modifié le contenu de cet emplacement, le contenu modifié UC-D se trouve ainsi dupliqué dans la mémoire cache et dans la mémoire de traçage. Le processeur peut alors poursuivre son traitement.In a real system, when the processor reads the A2 slot, the cache emits a "cache miss", causing direct reading of the data from the shared memory, and at the same time, a refresh corresponding cache lines. In the simulated system, to avoid implementing this complex mechanism, the CCTRL module processes the instruction by immediately transferring the data from the location A2 of the shared memory MEM to the cache memory by the bus B. The location A2 is at the same time duplicated in the tracing memory TRK, whose role will be described later. At the end of the processing, the cache memory 10 thus contains the same data as the shared memory MEM; the CPU can therefore reliably use this data with "fast" DMI transactions. Figure 4C illustrates a way of handling the clean and invalidate instruction with an example. The CPU processor and the peripheral device DMA are considered to share the same address location A1. The contents of the different copies of the location Al are illustrated side by side in three successive steps. In a first step, the processor writes CPU-D data to location A1. This data first appears in the cache memory. The slots Al in the shared memory MEM and the memory TRK retain their original contents. The processor executes a CLEAN & INVALIDATE statement (A1). In an actual system, this instruction would check a flag associated with the data in the cache memory and only cause the shared memory to update if the flag indicates that the data is dirty. In any case, the actual system 10 would then indicate that the data is invalid by another flag. In the simulated system, these flags are not implemented - to see if the data is dirty, we compare the contents of the location Al of the cache memory to the contents of the same location in the tracing memory TRK. Indeed, the function of the memory TRK, as it has been defined so far, is to contain the trace 15 of the last shared data exchanged by the bus B to perform this comparison, in fact an image of the shared data such as that they would be in the actual shared memory. In this first step, the contents of the cache memory and the memory TRK are different. This causes the transfer of the CPU-D data from the cache memory to the shared memory by the bus B, and to the TRK memory, as for the processing of a simple "clean" instruction (FIG. 4A). The "invalidation" component of the instruction is not implemented. The different memories then contain the CPU-D data at the location A1, as illustrated. The processor waits for a response in the slot A1. This response may be indicated by an interrupt sent by the peripheral device, or the processor periodically scans a particular location to see if the contents of the slot Al have been modified. Before reading the contents of slot Al, the processor executes a new CLEAN & INVALIDATE statement (A1). This time, the Al locations of the cache and trace memory contain the same CPU-D data - the cached data is considered "clean" and is not written to the shared memory. Instead, the location Al is invalidated in the cache memory, as is done for a "invalidate" instruction alone (FIG. 4B) - the content of the shared memory slot A1 is duplicated in FIG. cache, by the bus B, and in the tracing memory TRK. If the peripheral device has previously modified the contents of this location, the modified content UC-D is thus duplicated in the cache memory and in the tracing memory. The processor can then continue processing.

De nombreuses variantes et modifications du procédé décrit ici apparaîtront à l'homme du métier. Le procédé a été décrit à titre d'exemple en relation avec des instructions de gestion de mémoire cache pouvant être spécifiques à certains processeurs. Les enseignements décrits sont toutefois applicables à toute instruction de gestion de mémoire cache ou autre permettant une discrimination entre des données exclusives et des données partagées.Many variations and modifications of the method described herein will be apparent to those skilled in the art. The method has been described by way of example in connection with cache management instructions that may be specific to certain processors. The teachings described are, however, applicable to any caching or other management instruction that discriminates between proprietary data and shared data.

Claims

REVENDICATIONS1. A method of simulating a digital electronic circuit comprising a host processor (CPU), a peripheral device (CPU), and a memory (MEM) shared between the host processor and the peripheral device by a bus (B), the method comprising the next steps: - partition the circuit into a high-level modeling domain including functional models (TLMs) and a low-level modeling domain including accurate clock-close models (RTL); - define the host processor in the high-level domain (TLM); - define the shared memory (MEM) and the peripheral device (UC) in the low level domain (RTL); define a phantom memory (10) in the high level domain (TLM); - Intercept access host processor access to shared memory, and redirect access to the phantom memory (10), without using the bus, by a direct memory interface (DMI) defined for the high-level modeling domain; and when shared memory (MEM) access by the host processor relates to data shared with the peripheral device, reproduce that access in the shared memory using the bus.

The method of claim 1, wherein a program executed by the host processor includes cache management instructions having memory locations as parameters, the method comprising the steps of: - defining the phantom memory (10) as a memory cache for the host processor, at least the same size as the shared memory (MEM); and - identifying the data shared with the peripheral device using the memory location provided for each cache management instruction.

The method of claim 2, wherein each cache management instruction is one of: - a cleanup instruction, designed to synchronize the shared memory with the cache memory, - an invalidation instruction, designed to source for marking cache data as requiring refresh from shared memory upon subsequent access, and - a combined cleanup and disable statement, originally designed to synchronize shared memory with data dirty the cache, and then invalidate data from the cache.

4. Method according to claim 3, comprising, for processing an invalidation instruction, the following step: - refresh the cache memory from the shared memory from the processing of the invalidation instruction.

The method according to claim 3, comprising, for processing a combined cleaning and invalidation instruction, the following steps: defining a tracing memory (TRK) in the high level modeling domain of at least the same size as shared memory (MEM); when executing a cleaning instruction, duplicating in the tracing memory the resulting data transferred from the cache memory (10) to the shared memory (MEM); when executing an invalidation instruction, duplicating in the tracing memory (TRK) the resulting data transferred from the shared memory (MEM) to the cache memory (10); and - when executing the combined cleaning and invalidation instruction: comparing the data of the cache memory (10) with the corresponding data of the tracing memory (TRK), for the unequal data, transferring the data of the cache memory in the shared memory and in the tracing memory, and - for the equal data, transfer the data from the shared memory to the cache memory and the tracing memory.