442 lines
16 KiB
ReStructuredText
442 lines
16 KiB
ReStructuredText
===============================
|
|
Porting Drivers to the STM32 F7
|
|
===============================
|
|
|
|
.. warning::
|
|
Migrated from:
|
|
https://cwiki.apache.org/confluence/display/NUTTX/Porting+Drivers+to+the+STM32+F7
|
|
|
|
Problem Statement
|
|
=================
|
|
|
|
I recently completed a port to the STMicro STM32F746G Discovery board.
|
|
That MCU is clearly a derivative of the STM32 F3/F4 and many peripherals
|
|
are, in fact, essentially identical to the STM32F429. The biggest
|
|
difference is that the STM32F746 sports a Cortex-M7 which includes
|
|
several improvements over the Cortex-M4 and including, most relevant
|
|
to this discussion, a fully integrated data cache (`D-Cache`).
|
|
|
|
Because of this one difference, I chose to provide the STM32 F7 code its
|
|
own directories separate from the STM32 F1, F2, F3, and F4.
|
|
|
|
Porting Simple Drivers
|
|
======================
|
|
|
|
Some of the STM32 F4 drivers can be used with the STM32 F7 can be ported
|
|
very simply; many ports would just be a matter of copying files and some
|
|
search-and-replacement. Like:
|
|
|
|
* Compare the two register definitions files; make sure that the STM32
|
|
F4 peripheral is identical (or nearly identical) to the F7 peripheral.
|
|
If so then,
|
|
* Copy the register definition file from the ``stm32/hardware`` to the
|
|
``stm32f7/hardware`` directory, making name changes as appropriate and
|
|
updating any minor register differences.
|
|
* Copy the corresponding C file (and possibly a ``.h`` file) from the
|
|
``stm32/`` directory to the ``stm32f7/`` directory, again making any naming
|
|
changes and modifications for any register differences.
|
|
* Update the ``Make.defs`` file to include the new C file in the build.
|
|
|
|
Porting Complex Drivers
|
|
=======================
|
|
|
|
The Cortex-M7 D-Cache, however, does raise issues with the compatibility
|
|
of most complex STM32 F4 and F7 drivers. Even though the peripheral
|
|
registers may be essentially the same between the STM32F429 and the
|
|
the STM32F746, many drivers for the STM32F429 will not be directly
|
|
compatible with the STM32F746, particularly drivers that use DMA.
|
|
And that includes most complex STM32 drivers!
|
|
|
|
Cache Coherency
|
|
===============
|
|
|
|
With DMA, physical RAM memory contents is accessed directly by peripheral
|
|
hardware without intervention from the CPU. The CPU itself deals only the
|
|
indirectly with RAM through the D-Cache: When you read data from RAM, it
|
|
is first loaded in the D-Cache then accessed by the CPU. If the RAM
|
|
contents is already in the D-Cache, then physical RAM is not accessed
|
|
at all! Similarly, when you write data into RAM (with write buffering
|
|
enabled), it may actually not be written to physical RAM but may just
|
|
remain in the D-Cache in a `dirty` cache line until that cache line is
|
|
flushed to memory. Thus, there may be inconsistencies in the contents
|
|
of the D-Cache and in the contents of contents of physical RAM due
|
|
related to DMA. Such issues are referred to as `Cache Coherency` problems.
|
|
|
|
DMA
|
|
===
|
|
|
|
DMA Read Accesses
|
|
-----------------
|
|
|
|
A DMA read access occurs when we program DMA hardware to read data
|
|
from a peripheral and store that data into RAM. This happens, for
|
|
example, when we read a packet from the network, when we read a
|
|
serial byte of data from a UART, when we read a block from an
|
|
MMC/SD card, and so on.
|
|
|
|
In this case, the DMA hardware will change the contents of physical
|
|
RAM without knowledge of the CPU. So if that same memory that was
|
|
modified by the DMA read operation is also in the D-Cache, then
|
|
the contents of the D-Cache will no longer be valid; it will no
|
|
longer match the physical contents of the memory. In order to fix
|
|
this, the Cortex-M7 supports a special `cache operation` that can be
|
|
used to `invalidate` the D-Cache contents associate with the read DMA
|
|
buffer address range. Invalidation simply means discarding the
|
|
currently cached D-Cache lines so that they will be refetched
|
|
from physical RAM. **Rule 1a**: Always invalidate RX DMA buffers
|
|
sometime before or after starting the read DMA but certainly `before`
|
|
accessing the read buffer data. **Rule 1b**: Never read from the read
|
|
DMA buffer before the read DMA buffer completes, or otherwise you
|
|
will re-cache the DMA buffer content.
|
|
|
|
`What if the D-Cache line is also dirty? What if we have writes to
|
|
the DMA buffer that were never flushed to physical RAM?` Those writes
|
|
will then never make it to physical memory if the D-Cache is
|
|
invalidated. **Rule 2**: Never write to read DMA buffer memory!
|
|
**Rule 3**: Make sure that all DMA read buffers are aligned to the
|
|
D-Cache line size so that there are no spill-over cache effects
|
|
at the boarders of the invalidated cache line.
|
|
|
|
DMA Write Accesses
|
|
------------------
|
|
|
|
A DMA write access occurs when we program DMA hardware to write data from
|
|
RAM into a peripheral. This happen for example, when we send a packet on
|
|
a network or when we write a block of data to an MMC/SD card. In this,
|
|
the hardware expects the correct data to be in physical RAM when write
|
|
DMA is performed. If not then, the wrong data will be sent.
|
|
|
|
We assure that we do not have pending writes in a `dirty` cache line by
|
|
`cleaning` (or `flushing`) the `dirty` cache lines; i.e., for forcing any
|
|
pending writes in the D-Cache lines to be written to physical RAM.
|
|
**Rule 4**: Always `clean` (or `flush`) the D-Cache to force all data to
|
|
be written from the D-Cache into physical RAM.
|
|
|
|
`What if you had two adjacent DMA buffers side-by-side? Couldn't the
|
|
cleaning of the write buffer force writing into the adjacent read
|
|
buffer?`` Yes! **Rule 5**: Make sure that all DMA write buffers are
|
|
aligned to the D-Cache line size so that there are no spill-over
|
|
cache effects at the borders of the cleaned cache line.
|
|
|
|
Write-back vs. Write-through D-Cache
|
|
------------------------------------
|
|
|
|
The Cortex-M7 supports both `write-back` and `write-through` data cache
|
|
configurations. The write-back D-Cache works just as described above:
|
|
`dirty` cache lines are not written to physical memory until the cache
|
|
line is flushed. But write-through D-Cache works just as without the
|
|
D-Cache. Writes always go directly to physical RAM.
|
|
|
|
`If I am using a write-through D-Cache, can't I just forget about
|
|
cleaning the D-Cache?` No, because you don't know how a user is going
|
|
to configuration the D-Cache. **Rule 6**: Always assume that `write-back`
|
|
caching is being performed; otherwise, your driver will not be portable.
|
|
|
|
You may notice in ``/arch/arm/src/armv7-m/cache.h``:
|
|
|
|
.. code-block:: c
|
|
|
|
#if defined(CONFIG_ARMV7M_DCACHE) && !defined(CONFIG_ARMV7M_DCACHE_WRITETHROUGH)
|
|
void arch_clean_dcache(uintptr_t start, uintptr_t end);
|
|
#else
|
|
# define arch_clean_dcache(s,e)
|
|
#endif
|
|
|
|
NOTE: I have experienced other cases (on the SAMV7) where write buffering
|
|
`must` be disabled: In one case, a certain peripheral used 16-byte DMA
|
|
descriptors in an array. Clearly it is impossible to manage the
|
|
caching of the 16-byte DMA descriptors with a 32-byte cache line in
|
|
this case: I think that the only option is to disabled the write buffer.
|
|
|
|
And what if the driver receives arbitrarily aligned buffers from the
|
|
application? Then what? Should write buffering be disabled in that
|
|
case too? And what is the performance cost for disabling the write
|
|
buffer?
|
|
|
|
|
|
DMA Module
|
|
----------
|
|
|
|
Some STM32 F7 peripherals have built in DMA. The STM32 F7 Ethernet
|
|
driver discussed below is a good example of such a peripheral with
|
|
built in DMA capability. Most STM32 F7 peripherals, however, have
|
|
no built-in DMA capability and, instead, must use a common STM32
|
|
F7 DMA module to perform DMA data transfers. The interfaces to that
|
|
common DMA module are described in ``arch/arm/src/stm32f7/stm32_dma.h``.
|
|
|
|
The DMA modules `does not do any cache operations`. Rather, the client
|
|
of the DMA module must perform the cache operations. Here are the
|
|
basic rules:
|
|
|
|
* TX DMA Transfers. Before calling ``stm32_dmastart()`` to start an TX
|
|
transfer, the DMA client must clean the DMA buffer so that the
|
|
content to be DMA'ed is present in physical memory.
|
|
* RX DMA transfers. At the completion of all DMAs, the DMA client
|
|
will receive a callback providing the final status of the DMA
|
|
transfer. For the case of RX DMA completion callbacks, logic in
|
|
the callback handler should invalidate the RX buffer before any
|
|
attempt is made to access new RX buffer content.
|
|
|
|
Converting an STM32F429 Driver for the STM32F746
|
|
================================================
|
|
|
|
Since the STM32 F7 is so similar to the STM32 F4, we have a wealth
|
|
of working drivers to port from. Only a little effort is required.
|
|
Below is a summary of the kinds of things that you would have to do
|
|
to convert an STM32F429 driver to the STM32F746.
|
|
|
|
An Example
|
|
----------
|
|
|
|
There is a good example in the STM32 Ethernet driver. The STM32 F7
|
|
Ethernet driver (``arch/arm/src/stm32f7/stm32_ethernet.c``) derives
|
|
directly from the STM32 F4 Ethernet driver
|
|
(``arch/arm/src/stm32/stm32_eth.c``). These two Ethernet MAC peripherals
|
|
are nearly identical. Only changes that are a direct consequence of the
|
|
STM32 F7 D-Cache were required to make the driver work on the STM32 F7.
|
|
Those changes are summarized below.
|
|
|
|
Reorganize DMA Data Structure
|
|
-----------------------------
|
|
|
|
The STM32 Ethernet driver has four different kinds DMA buffers:
|
|
|
|
* RX DMA descriptor,
|
|
* TX DMA descriptors,
|
|
* RX packet buffers, and
|
|
* TX packet buffers,
|
|
|
|
In the STM32F429 driver, these are simply implemented as part of the
|
|
driver data structure:
|
|
|
|
.. code-block:: c
|
|
|
|
struct stm32_ethmac_s
|
|
{
|
|
...
|
|
/* Descriptor allocations */
|
|
|
|
struct eth_rxdesc_s rxtable[CONFIG_STM32_ETH_NRXDESC];
|
|
struct eth_txdesc_s txtable[CONFIG_STM32_ETH_NTXDESC];
|
|
|
|
/* Buffer allocations */
|
|
|
|
uint8_t rxbuffer[CONFIG_STM32_ETH_NRXDESC*CONFIG_STM32_ETH_BUFSIZE];
|
|
uint8_t alloc[STM32_ETH_NFREEBUFFERS*CONFIG_STM32_ETH_BUFSIZE];
|
|
};
|
|
|
|
There are potentially three problems with this: (1) We don't know what
|
|
kind of memory the data structure will be defined in. What if it is
|
|
DTCM memory? Then the DMAs will fail. (2) We don't know the alignment
|
|
of the DMA buffers. They must be aligned on D-Cache line boundaries.
|
|
(3a) The size of RX or TX descriptor is either 16- or 32-bytes. In
|
|
order to individually clean or invalidate the cache line, they must
|
|
be sized in multiples of the cache line size and (3b) the same applies
|
|
to the DMA buffers.
|
|
|
|
To fix this, several things were done:
|
|
|
|
* The buffer allocations were moved from the device structure into
|
|
separate declarations that can have attributes.
|
|
* One attribute that could be added would be a section name to assure
|
|
that the structures are linked into DMA-able memory (via definitions
|
|
in the linker script).
|
|
* Another attribute is that we can force the alignment of the structure
|
|
to the D-Cache line size.
|
|
|
|
The following definitions were added to support aligning the sizes of
|
|
the buffers to the Cortex-M7 D-Cache line size:
|
|
|
|
.. code-block:: c
|
|
|
|
/* Buffers use fro DMA access must begin on an address aligned with the
|
|
* D-Cache line and must be an even multiple of the D-Cache line size.
|
|
* These size/alignment requirements are necessary so that D-Cache flush
|
|
* and invalidate operations will not have any additional effects.
|
|
*
|
|
* The TX and RX descriptors are normally 16 bytes in size but could be
|
|
* 32 bytes in size if the enhanced descriptor format is used (it is not).
|
|
*/
|
|
|
|
#define DMA_BUFFER_MASK (ARMV7M_DCACHE_LINESIZE - 1)
|
|
#define DMA_ALIGN_UP(n) (((n) + DMA_BUFFER_MASK) & ~DMA_BUFFER_MASK)
|
|
#define DMA_ALIGN_DOWN(n) ((n) & ~DMA_BUFFER_MASK)
|
|
|
|
#ifndef CONFIG_STM32F7_ETH_ENHANCEDDESC
|
|
# define RXDESC_SIZE 16
|
|
# define TXDESC_SIZE 16
|
|
#else
|
|
# define RXDESC_SIZE 32
|
|
# define TXDESC_SIZE 32
|
|
#endif
|
|
|
|
#define RXDESC_PADSIZE DMA_ALIGN_UP(RXDESC_SIZE)
|
|
#define TXDESC_PADSIZE DMA_ALIGN_UP(TXDESC_SIZE)
|
|
#define ALIGNED_BUFSIZE DMA_ALIGN_UP(ETH_BUFSIZE)
|
|
|
|
#define RXTABLE_SIZE (STM32F7_NETHERNET * CONFIG_STM32F7_ETH_NRXDESC)
|
|
#define TXTABLE_SIZE (STM32F7_NETHERNET * CONFIG_STM32F7_ETH_NTXDESC)
|
|
|
|
#define RXBUFFER_SIZE (CONFIG_STM32F7_ETH_NRXDESC * ALIGNED_BUFSIZE)
|
|
#define RXBUFFER_ALLOC (STM32F7_NETHERNET * RXBUFFER_SIZE)
|
|
|
|
#define TXBUFFER_SIZE (STM32_ETH_NFREEBUFFERS * ALIGNED_BUFSIZE)
|
|
#define TXBUFFER_ALLOC (STM32F7_NETHERNET * TXBUFFER_SIZE)
|
|
|
|
The RX and TX descriptor types are replace with a union type
|
|
that assures that the allocations will be aligned in size:
|
|
|
|
.. code-block:: c
|
|
|
|
/* This union type forces the allocated size of RX descriptors to be the
|
|
* padded to a exact multiple of the Cortex-M7 D-Cache line size.
|
|
*/
|
|
|
|
union stm32_txdesc_u
|
|
{
|
|
uint8_t pad[TXDESC_PADSIZE];
|
|
struct eth_txdesc_s txdesc;
|
|
};
|
|
|
|
union stm32_rxdesc_u
|
|
{
|
|
uint8_t pad[RXDESC_PADSIZE];
|
|
struct eth_rxdesc_s rxdesc;
|
|
};
|
|
|
|
Then, finally, the new buffers are defined by the following globals:
|
|
|
|
.. code-block:: c
|
|
|
|
/* DMA buffers. DMA buffers must:
|
|
*
|
|
* 1. Be a multiple of the D-Cache line size. This requirement is assured
|
|
* by the definition of RXDMA buffer size above.
|
|
* 2. Be aligned a D-Cache line boundaries, and
|
|
* 3. Be positioned in DMA-able memory (*NOT* DTCM memory). This must
|
|
* be managed by logic in the linker script file.
|
|
*
|
|
* These DMA buffers are defined sequentially here to best assure optimal
|
|
* packing of the buffers.
|
|
*/
|
|
|
|
/* Descriptor allocations */
|
|
|
|
static union stm32_rxdesc_u g_rxtable[RXTABLE_SIZE]
|
|
__attribute__((aligned(ARMV7M_DCACHE_LINESIZE)));
|
|
static union stm32_txdesc_u g_txtable[TXTABLE_SIZE]
|
|
__attribute__((aligned(ARMV7M_DCACHE_LINESIZE)));
|
|
|
|
/* Buffer allocations */
|
|
|
|
static uint8_t g_rxbuffer[RXBUFFER_ALLOC]
|
|
__attribute__((aligned(ARMV7M_DCACHE_LINESIZE)));
|
|
static uint8_t g_txbuffer[TXBUFFER_ALLOC]
|
|
__attribute__((aligned(ARMV7M_DCACHE_LINESIZE)));
|
|
|
|
This does, of course, force additional changes to the functions
|
|
that initialize the buffer chains, but I will leave that to the
|
|
interested reader to discover.
|
|
|
|
Add Cache Operations
|
|
--------------------
|
|
|
|
The Cortex-M7 cache operations are available the following file is included:
|
|
|
|
|
|
.. code-block:: c
|
|
|
|
#include "cache.h"
|
|
|
|
Here is an example where the RX descriptors are invalidated:
|
|
|
|
.. code-block:: c
|
|
|
|
static int stm32_recvframe(struct stm32_ethmac_s *priv)
|
|
{
|
|
...
|
|
/* Scan descriptors owned by the CPU. */
|
|
|
|
rxdesc = priv->rxhead;
|
|
|
|
/* Forces the first RX descriptor to be re-read from physical memory */
|
|
|
|
arch_invalidate_dcache((uintptr_t)rxdesc,
|
|
(uintptr_t)rxdesc + sizeof(struct eth_rxdesc_s));
|
|
|
|
for (i = 0;
|
|
(rxdesc->rdes0 & ETH_RDES0_OWN) == 0 &&
|
|
i < CONFIG_STM32F7_ETH_NRXDESC &&
|
|
priv->inflight < CONFIG_STM32F7_ETH_NTXDESC;
|
|
i++)
|
|
{
|
|
...
|
|
/* Try the next descriptor */
|
|
|
|
rxdesc = (struct eth_rxdesc_s *)rxdesc->rdes3;
|
|
|
|
/* Force the next RX descriptor to be re-read from physical memory */
|
|
|
|
arch_invalidate_dcache((uintptr_t)rxdesc,
|
|
(uintptr_t)rxdesc + sizeof(struct eth_rxdesc_s));
|
|
}
|
|
...
|
|
}
|
|
|
|
Here is an example where a TX descriptor is cleaned:
|
|
|
|
.. code-block:: c
|
|
|
|
static int stm32_transmit(struct stm32_ethmac_s *priv)
|
|
{
|
|
...
|
|
/* Give the descriptor to DMA */
|
|
|
|
txdesc->tdes0 |= ETH_TDES0_OWN;
|
|
|
|
/* Flush the contents of the modified TX descriptor into physical
|
|
* memory.
|
|
*/
|
|
|
|
arch_clean_dcache((uintptr_t)txdesc,
|
|
(uintptr_t)txdesc + sizeof(struct eth_txdesc_s));
|
|
...
|
|
}
|
|
|
|
Here is where the read buffer is invalidated just after
|
|
completed a read DMA:
|
|
|
|
.. code-block:: c
|
|
|
|
static int stm32_recvframe(struct stm32_ethmac_s *priv)
|
|
{
|
|
...
|
|
/* Force the completed RX DMA buffer to be re-read from
|
|
* physical memory.
|
|
*/
|
|
|
|
arch_invalidate_dcache((uintptr_t)dev->d_buf,
|
|
(uintptr_t)dev->d_buf + dev->d_len);
|
|
|
|
nllvdbg("rxhead: %p d_buf: %p d_len: %d\n",
|
|
priv->rxhead, dev->d_buf, dev->d_len);
|
|
|
|
/* Return success*/
|
|
|
|
return OK;
|
|
...
|
|
}
|
|
|
|
Here is where the write buffer in clean prior to starting a write DMA:
|
|
|
|
.. code-block:: c
|
|
|
|
static int stm32_transmit(struct stm32_ethmac_s *priv)
|
|
{
|
|
...
|
|
/* Flush the contents of the TX buffer into physical memory */
|
|
|
|
arch_clean_dcache((uintptr_t)priv->dev.d_buf,
|
|
(uintptr_t)priv->dev.d_buf + priv->dev.d_len);
|
|
...
|
|
} |