12345678901234567890123456789012345678901234567890123456789012345678901234567890 mjs 12 Jan 2001 Early Draft mjs 14 Feb 2001 XScale survey revised, ARMop reentrancy defined RISC OS Kernel ARM core support =============================== This document is concerned with the design of open ended support for multiple ARM cores within the RISC OS kernel, as part of the work loosely termed hardware abstraction. Note that the ARM core support is part of the OS kernel, and so is not part of the hardware abstraction layer (HAL) itself. Background ---------- ARM core support (including caches and MMU) has historically been coded in a tailored way for one or two specific variants. Since version 3.7 this has meant just two variants; ARM 6/7 and StrongARM SA110. A more generic approach is required for the next generation. This aims both to support several cores in a more structured way, and to cover minor variants (eg. cache size) with the same support code. The natural approach is to set up run-time vectors to a set of ARM support routines. Note that it is currently assumed that the ARM MMU architecture will not change radically in future ARM cores. Hence, the kernel memory management algorithms remain largely unchanged. This is believed to be a reasonable assumption, since the last major memory management change was with Risc PC and ARM 610 (when the on-chip MMU was introduced). Note that all ARM support code must be 32-bit clean, as part of the 32-bit clean kernel. Survey of ARM core requirements ------------------------------- At present, five broad ARM core types can be considered to be of interest; ARM7 (and ARM6), ARM9, ARM10, StrongARM (SA1) and XScale. These divide primarily in terms of cache types, and cache and TLB maintenance requirements. They also span a range of defined ARM architecture variants, which introduced variants for system operations (primarily coprocessor 15 instructions). The current ARM architecture is version 5. This (and version 4) has some open ended definitions to allow code to determine cache size and types from CP15 registers. Hence, the design of the support code can hope to be at least tolerant of near future variations that are introduced. ARM7 ---- ARM7 cores may be architecture 3 or 4. They differ in required coprocessor 15 operations for the same cache and TLB control. ARM6 cores are much the same as architecture 3 ARM7. The general character of all these cores is of unified write-through caches that can only be invalidated on a global basis. The TLBs are also unified, and can be invalidated per entry or globally. ARM9 ---- ARM9 cores are architecture 4. We ignore ARM9 variants without an MMU. The kernel can read cache size and features. The ARM 920 or 922 have harvard caches, with writeback and writethrough capable data caches (on a page or section granularity). Data and instruction caches can be invalidated by individual lines or globally. The data cache can be cleaned by virtual address or cache segment/index, allowing for efficient cache maintenance. Data and instruction TLBs can be invalidated by entry or globally. ARM10 ----- ARM 10 is architecture 5. Few details available at present. Likely to be similar to ARM9 in terms of cache features and available operations. StrongARM --------- StrongARM is architecture 4. StrongARMs have harvard caches, the data cache being writeback only (no writethrough option). The data cache can only be globally cleaned in an indirect manner, by reading from otherwise unused address space. This is inefficient because it requires external (to the core) reads on the bus. In particular, the minimum cost of a clean, for a nearly clean cache, is high. The data cache supports clean and invalidate by individual virtual lines, so this is reasonably efficient for small ranges of address. The data TLB can be invalidated by entry or globally. The instruction cache can only be invalidated globally. This is inefficient for cases such as IMBs over a small range (dynamic code). The instruction TLB can only be invalidated globally. Some StrongARM variants have a mini data cache. This is selected over the main cache on a section or page by using the cachable/bufferable bits set to C=1,B=0 in the MMU (this is not standard ARM architecture). The mini data cache is writeback and must be cleaned in the same manner as the main data cache. XScale ------ XScale is architecture 5. It implements harvard caches, the data cache being writeback or writethrough (on a page or section granularity). Data and instruction caches can be invalidated by individual lines or globally. The data cache can be fully cleaned by allocating lines from otherwise unused address space. Unlike StrongARM, no external reads are needed for the clean operation, so that cache maintenance is efficient. XScale has a mini data cache. This is only available by using extension bits in the MMU. This extension is not documented in the current manual for architecture 5, but will presumably be properly recognised by ARM. It should be a reasonably straightforward extension for RISC OS. The mini data cache can only be cleaned by inefficient indirect reads as on StrongARM. For XScale, the whole mini data cache can be configured as writethrough. The most likely use for RISC OS is to map screen memory as mini cacheable, so writethrough caching will be selected to prevent problems with delayed screen update (and hence intricate screen/cache management code as in Ursula for StrongARM). With writethrough configured, most operations can ignore the mini cache, because invalidation by virtual address will invalidate mini or main cache entries as appropriate. Unfortunately, for global cache invalidatation, things are very awkward. RISC OS cannot use the global cache invalidate operation (which globally invalidates both data caches), unless it is very careful to 100% clean the main cache with all interrupts (IRQs and FIQs) disabled. This is to avoid fatal loss of uncleaned lines from the writeback main cache. Disabling interrupts for the duration of a main cache clean is an unacceptable latency. Therefore, reluctantly, RISC OS must do the equivalent of cleaning the mini cache (slow physical reads) in order to globally invalidate it as a side effect. The instruction and data TLBs can each be invalidated by entry or globally. Kernel ARM operations --------------------- This section lists the definitions and API of the set of ARM operations (ARMops) required by the kernel for each major ARM type that is to be supported. Some operations may be very simple on some ARMs. Others may need support from the kernel environment - for example, readable parameters that have been determined at boot, or address space available for cache clean operations. The general rules for register usage and preservation in calling these ARMops iare: - any parameters are passed in r0,r1 etc. as required - r0 may be used as a scratch register - the routines see a valid stack via sp, at least 16 words are available - lr is the return link as required - on exit, all registers except r0 and lr must be preserved Note that where register values are given as logical addresses, these are RISC OS logical addresses. The equivalent ARM terminology is virtual address (VA), or modified virtual address (MVA) for architectures with the fast context switch extension. Note also that where cache invalidation is required, it is implicit that any associated operations for a particular ARM should be performed also. The most obvious example is for an ARM with branch prediction, where it may be necessary to invalidate a branch cache anywhere where instruction cache invalidation is to be performed. Any operation that is a null operation on the given ARM should be implemented as a single return instruction: MOV pc, lr ARMop reentrancy ---------------- In general, the operations will be called from SVC mode with interrupts enabled. However, some use of some operations from interrupt mode is expected. Notably, it is desirable for the IMB operations to be available from interrupt mode. Therefore, it is intended that all implementations of all ARMops be reentrant. Most will be so with no difficulty. For ARMs with writeback data caches, the cleaning algorithm may have to be constructed carefully to handle reentrancy (and to avoid turning off interrupts for the duration of a clean). Cache ARMops ------------ -- Cache_CleanInvalidateAll The cache or caches are to be globally invalidated, with cleaning of any writeback data being properly performed. entry: - exit: - Note that any write buffer draining should also be performed by this operation, so that memory is fully updated with respect to any writeaback data. The OS only expects the invalidation to be with respect to instructions/data that are not involved in any currently active interrupts. In other words, it is expected and desirable that interrupts remain enabled during any extended clean operation, in order to avoid impact on interrupt latency. -- Cache_CleanInvalidateRange The cache or caches are to be invalidated for (at least) the given range, with cleaning of any writeback data being properly performed. entry: r0 = logical address of start of range r1 = logical address of end of range (exclusive) Note that r0 and r1 are aligned on cache line boundaries exit: - Note that any write buffer draining should also be performed by this operation, so that memory is fully updated with respect to any writeaback data. The OS only expects the invalidation to be with respect to instructions/data that are not involved in any currently active interrupts. In other words, it is expected and desirable that interrupts remain enabled during any extended clean operation, in order to avoid impact on interrupt latency. -- Cache_CleanAll The unified cache or data cache are to be globally cleaned (any writeback data updated to memory). Invalidation is not required. entry: - exit: - Note that any write buffer draining should also be performed by this operation, so that memory is fully updated with respect to any writeaback data. The OS only expects the cleaning to be with respect to data that are not involved in any currently active interrupts. In other words, it is expected and desirable that interrupts remain enabled during any extended clean operation, in order to avoid impact on interrupt latency. -- Cache_CleanRange The cache or caches are to be cleaned for (at least) the given range. Invalidation is not required. entry: r0 = logical address of start of range r1 = logical address of end of range (exclusive) Note that r0 and r1 are aligned on cache line boundaries exit: - Note that any write buffer draining should also be performed by this operation, so that memory is fully updated with respect to any writeaback data. The OS only expects the invalidation to be with respect to instructions/data that are not involved in any currently active interrupts. In other words, it is expected and desirable that interrupts remain enabled during any extended clean operation, in order to avoid impact on interrupt latency. -- Cache_InvalidateAll The cache or caches are to be globally invalidated. Cleaning of any writeback data is not to be performed. entry: - exit: - This call is only required for special restart use, since it implies that any writeback data are either irrelevant or not valid. It should be a very simple operation on all ARMs. -- Cache_InvalidateRange The cache or caches are to be invalidated for the given range. Cleaning of any writeback data is not to be performed. entry: r0 = logical address of start of range r1 = logical address of end of range (exclusive) Note that r0 and r1 are aligned on cache line boundaries exit: - This call is intended for use in situations where another bus master (e.g. DMA) has written to an area of cacheable memory, and stale data is to be cleared from the ARM's cache so that software can see the new values. It is important that only the indicated region is invalidated - neighbouring cache lines may contain valid data that has not yet been written back. Because software should not have been writing to the DMA buffer while the DMA was in progress, it is permissible for this operation to both clean and invalidate. E.g. if a write-back cache is in use, it would be incorrect to promote a large invalidate to a global invalidate, and so an implementation could instead perform a global clean+invalidate. The OS only expects the invalidation to be with respect to instructions/data that are not involved in any currently active interrupts. In other words, it is expected and desirable that interrupts remain enabled during any extended clean operation, in order to avoid impact on interrupt latency. -- Cache_RangeThreshold Return a threshold value for an address range, above which it is advisable to globally clean and/or invalidate caches, for performance reasons. For a range less than or equal to the threshold, a ranged cache operation is recommended. entry: - exit: r0 = threshold value (bytes) This call returns a value that the kernel may use to select between strategies in some cache operations. This threshold may also be of use to some of the ARM operations themselves (although they should typically be able to read the parameter more directly). The exact value is unlikely to be critical, but a sensible value may depend on both the ARM and external factors such as memory bus speed. -- Cache_Examine Return information about a given cache level entry: r1 = cache level (0-based) exit: r0 = Flags bits 0-2: cache type: 000 -> none 001 -> instruction 010 -> data 011 -> split 100 -> unified 1xx -> reserved Other bits: reserved r1 = D line length r2 = D size r3 = I line length r4 = I size r0-r4 = zero if cache level not present For unified caches, r1-r2 will match r3-r4. This call mainly exists for the benefit of OS_PlatformFeatures 33. -- ICache_InvalidateAll The instruction cache is to be globally invalidated. entry: - exit: - This operation should only act on instruction caches - not data or unified caches. If only data or unified caches are present then the operation can be implemented as a NOP. -- ICache_InvalidateRange The instruction cache is to be invalidated for the given range. entry: r0 = logical address of start of range r1 = logical address of end of range (exclusive) Note that r0 and r1 are aligned on cache line boundaries exit: - This operation should only act on instruction caches - not data or unified caches. If only data or unified caches are present then the operation can be implemented as a NOP. Memory barrier ARMops ===================== -- DSB_ReadWrite (previously, WriteBuffer_Drain) This call is roughly equivalent to the ARMv7 "DSB SY" instruction: * Writebuffers are drained * Full read/write barrier - no data load/store will cross the instruction * Instructions following the barrier will only begin execution once the barrier is passed - but any prefetched instructions are not flushed entry: - exit: - -- DSB_Write This call is roughly equivalent to the ARMv7 "DSB ST" instruction: * Writebuffers are drained * Write barrier - reads may cross the instruction * Instructions following the barrier will only begin execution once the barrier is passed - but any prefetched instructions are not flushed entry: - exit: - -- DSB_Read There is no direct equivalent to this in ARMv7 (barriers are either W or RW). However it's useful to define a read barrier, as (e.g.) on Cortex-A9 a RW barrier would require draining the write buffer of the external PL310 cache, while a R barrier can simply be an ordinary DSB instruction. * Read barrier - writes may cross the instruction * Instructions following the barrier will only begin execution once the barrier is passed - but any prefetched instructions are not flushed entry: - exit: - -- DMB_ReadWrite This call is roughly equivalent to the ARMv7 "DMB SY" instruction: * Ensures in-order operation of data load/store instructions * Does not stall instruction execution * Does not guarantee that any preceeding memory operations complete in a timely manner (or at all) entry: - exit: - Although this call doesn't guarantee that any memory operation completes, it's usually all that's required when interacting with hardware devices which use memory-mapped IO. E.g. fill a buffer with data, issue a DMB, then write to a hardware register to start some external DMA. The writes to the buffer will have been guaranteed to complete by the time the write to the hardware register completes. -- DMB_Write This call is roughly equivalent to the ARMv7 "DMB ST" instruction: * Ensures in-order operation of data store instructions * Does not stall instruction execution * Does not guarantee that any preceeding memory operations complete in a timely manner (or at all) entry: - exit: - Although this call doesn't guarantee that any memory operation completes, it's usually all that's required when interacting with hardware devices which use memory-mapped IO. E.g. fill a buffer with data, issue a DMB, then write to a hardware register to start some external DMA. The writes to the buffer will have been guaranteed to complete by the time the write to the hardware register completes. -- DMB_Read There is no direct equivalent to this in ARMv7 (barriers are either W or RW). However it's useful to define a read barrier, as (e.g.) on Cortex-A9 a RW barrier would require draining the write buffer of the external PL310 cache, while a R barrier can simply be an ordinary DMB instruction. * Ensures in-order operation of data load instructions * Does not stall instruction execution * Does not guarantee that any preceeding memory operations complete in a timely manner (or at all) entry: - exit: - Although this call doesn't guarantee that any memory operation completes, it's usually all that's required when interacting with hardware devices which use memory-mapped IO. E.g. after reading a hardware register to detect that a DMA write to RAM has completed, issue a read barrier to ensure that any reads from the data buffer see the final data. TLB ARMops ---------- -- TLB_InvalidateAll The TLB or TLBs are to be globally invalidated. entry: - exit: - -- TLB_InvalidateEntry The TLB or TLBs are to be invalidated for the entry at the given logical address. entry: r0 = logical address of entry to invalidate (page aligned) exit: - The address will always be page aligned (4k). IMB ARMops ---------- -- IMB_Full A global instruction memory barrier (IMB) is to be performed. entry: - exit: - An IMB is an operation that should be performed after new instructions have been stored and before they are executed. It guarantees correct operation for code modification (eg. something as simple as loading code to be executed). On some ARMs, this operation may be null. On ARMs with harvard architecture this typically consists of: 1) clean data cache 2) drain write buffer 3) invalidate instruction cache There may be other considerations such as invalidating branch prediction caches. -- IMB_Range An instruction memory barrier (IMB) is to be performed over a logical address range. entry: r0 = logical address of start of range r1 = logical address of end of range (exclusive) Note that r0 and r1 are aligned on cache line boundaries exit: - An IMB is an operation that should be performed after new instructions have been stored and before they are executed. It guarantees correct operation for code modification (eg. something as simple as loading code to be executed). On some ARMs, this operation may be null. On ARMs with harvard architecture this typically consists of: 1) clean data cache over the range 2) drain write buffer 3) invalidate instruction cache over the range There may be other considerations such as invalidating branch prediction caches. Note that the range may be very large. The implementation of this call is typically expected to use a threshold (related to Cache_RangeThreshold) to decide when to perform IMB_Full instead, being faster for large ranges. -- IMB_List A variant of IMB_Range that accepts a list of address ranges. entry: r0 = pointer to word-aligned list of (start, end) address pairs r1 = pointer to end of list (past last valid entry) r2 = total amount of memory to be synchronised If you have several areas to synchronise then using this call may result in significant performance gains, both from reducing the function call overhead and from optimisations in the algorithm itself (e.g. only flushing instruction cache once for StrongARM). As with IMB_Range, start & end addresses are inclusive-exclusive and must be cache line aligned. The list must contain at least one entry, and must not contain zero-length entries. MMU mapping ARMops ------------------ -- MMU_Changing The global MMU mapping has just changed. entry: - exit: - The operation must typically perform the following: 1) globally invalidate TLB or TLBs 2) globally clean and invalidate all caches 3) drain write buffer Note that it should not be necessary to disable IRQs. The OS ensures that remappings do not affect currently active interrupts. This operation should typically be used when a large number of cacheable pages have had their attributes changed in a way which will affect cache behaviour. -- MMU_ChangingEntry The MMU mapping has just changed for a single page entry (4k). entry: r0 = logical address of entry (page aligned) exit: - The operation must typically perform the following: 1) invalidate TLB or TLBs for the entry 2) clean and invalidate all caches over the 4k range of the page 3) drain write buffer Note that it should not be necessary to disable IRQs. The OS ensures that remappings do not affect currently active interrupts. This operation should typically be used when a cacheable page has had its attributes changed in a way which will affect cache behaviour. -- MMU_ChangingUncached The MMU mapping has just changed in a way that globally affects uncacheable space. entry: - exit: - The operation must typically globally invalidate the TLB or TLBs. The OS guarantees that cacheable space is not affected, so cache operations are not required. However, there may still be considerations such as fill buffers that operate in uncacheable space on some ARMs. -- MMU_ChangingUncachedEntry The MMU mapping has just changed for a single uncacheable page entry (4k). entry: r0 = logical address of entry (page aligned) exit: - The operation must typically invalidate the TLB or TLBs for the entry. The OS guarantees that cacheable space is not affected, so cache operations are not required. However, there may still be considerations such as fill buffers that operate in uncacheable space on some ARMs. -- MMU_ChangingEntries The MMU mapping has just changed for a contiguous range of page entries (multiple of 4k). entry: r0 = logical address of first page entry (page aligned) r1 = number of page entries ( >= 1) exit: - The operation must typically perform the following: 1) invalidate TLB or TLBs over the range of the entries 2) clean and invalidate all caches over the range of the pages 3) drain write buffer Note that it should not be necessary to disable IRQs. The OS ensures that remappings do not affect currently active interrupts. Note that the number of entries may be large. The operation is typically expected to use a reasonable threshold, above which it performs a global operation instead for speed reasons. This operation should typically be used when cacheable pages have had their attributes changed in a way which will affect cache behaviour. -- MMU_ChangingUncachedEntries The MMU mapping has just changed for a contiguous range of uncacheable page entries (multiple of 4k). entry: r0 = logical address of first page entry (page aligned) r1 = number of page entries ( >= 1) exit: - The operation must typically invalidate the TLB or TLBs over the range of the entries. The OS guarantees that cacheable space is not affected, so cache operations are not required. However, there may still be considerations such as fill buffers that operate in uncacheable space on some ARMs. Note that the number of entries may be large. The operation is typically expected to use a reasonable threshold, above which it performs a global operation instead for speed reasons.