docs/fs/mnemofs: Update new design and NAND flash structure

Add documentation about the new mnemofs design and the NAND flash structure.

Signed-off-by: Saurav <resyfer.dev@gmail.com>
This commit is contained in:
Saurav Pal 2024-08-23 06:47:48 +00:00 committed by Xiang Xiao
parent 099ced4c83
commit 4a4725386d
1 changed files with 218 additions and 72 deletions

View File

@ -14,9 +14,7 @@ you can mount it with ``mnemofs`` to a location like ``/mydir`` using::
The above command will only work if the device was already formatted using
mnemofs. For a brand new device, or if you want to switch from an existing
file system, this won't work, and would need a format.
Instead try this::
file system, this won't work, and would need a format::
mount -t mnemofs -o forceformat /dev/nand /mydir
@ -33,86 +31,234 @@ system after all...to hide the storage device's pecularities behind an
abstraction. A file system is considered good if you don't have to think
about its existence during regular usage.
NAND Flashes
============
Programmatically, the NAND flash has some problems. The whole device can be
condensed into three layers: blocks, pages and cells.
Cells represent the smallest unit of storage in NAND flashes, but are often
ignored, as direct access is not allowed. If a cell stores one bit, it's a
Single Level Cell. There are MLC, TLC, etc. for more bits per cell. Often,
the more bits per cell, the lesser is the wear resilience. Thus, higher
bits per cell are easier to wear out and become unreliable.
Pages are the smallest readable or writable unit of the NAND flash. It's
made up of several cells, and can be expected to have a size of the similar
order of 512 B.
Blocks are the smallest erasable unit of NAND flash. They are made up of
several pages. If a page is already written, it needs to be erased before it
can be written again. And since blocks are the smallest erasable unit, the
entire block needs to be erased if the user wants to update the contents of
one page.
The erase operation is what causes a block to wear out. If a block is worn
out too much, it will lose its ability to reliably store data. An unreliable
block can not guarantee that the data read from the pages in it is the same
as what was written to it. This state is called as a bad block.
A manufacturer can also deem a block to be unreliable from their testing,
and can mark them as bad blocks right from manufacture.
A good file system will aim to level out the wear between blocks as much as
it can.
Design
======
mnemofs is designed to be a middle ground between flash storage consumption,
memory consumption, wear and speed. It sacrifices a little bit of everything,
and ends up being acceptably good in all of them, instead of sacrificing
multiple aspects, and being good in one.
There are various layers and components in mnemofs, and they interact with
various layers on abstraction over each other.
mnemofs consists of several components, however, a walkthrough of the process
where a change requested by a user ends up being written to the NAND flash
would serve well for an introduction. The details will be explained further
below.
Mnemofs works on a Copy-On-Write (CoW) basis, which means, if a page needs to
be updated, it is copied over in memory, and then the change is applied, and
the new data is written to a new location.
The user requests some changes, say, add ``x`` bytes to ``y`` offset in a file.
This change is copied into the LRU cache of mnemofs. This LRU cache exists
in-memory, and serves as a tool for wear reduction.
R/W Layer
---------
This LRU cache is a kernel list of nodes. Each node represents a file or a
directory. When the LRU is full, the last node is popped from this list and
the changes it contains, which is an accumulation of changes requested by
the user for that particular file or directory since the node was added to
the LRU cache, is written to the flash.
This works with the NAND flash device driver directly. It can write an
entire page, read an entire page, erase an entire block, check if a block is
bad (from it's bad block marker), or set a block as bad. It's the simplest
layer.
Each file or directory is represented by a `CTZ skip list <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_,
and the only attributes required to access the list is the index of the last
CTZ skip list block, the page number of that CTZ skip list block, and the
size of the file. In mnemofs, CTZ skip list blocks take up exactly one page
on the flash.
Block Allocator
---------------
Mnemofs works in a Copy-On-Write manner, similar to littlefs. When a CTZ
skip list is updated, the new location is added to the Journal of mnemofs
as a log. This log contains some information about the location of the new
CTZ list, the path it belongs to, etc. and then the updated location is
added as an update to its parent's CTZ skip list, and it undergoes the same
process. This log is appended with a checksum of the entire log, which
gives an assurance that the saved log was indeed saved completely before a
power loos.
The block allocator contains two arrays. One is a bit mask is for tracking
the free pages, while the other is an array of numbers, one number for each
block, denoting the number of pages in that block that are ready to be
erased.
The journal is a modified singly linked list of blocks on the flash that
contains logs of changes in the file system. The last two blocks of the
journal is reserved for master blocks, hence the number of blocks in the
journal will be referred to as ``n + 2`` blocks.
The block allocator allocates pages or blocks in a sequential manner to keep
it fair for all pages, thus, ensuring wear levelling. It also starts from a
random offset to prevent bias to the front of the device in case of multiple
power losses and reinitialization in such casses. If a block is required it
skips pages to the start of the next block. Since block allocations happen
only in the journal, they happen in bulk and the number of skipped pages is
very minimal.
The area on storage other than the journal contains a certain "base" state of
the file system. All changes to the base state since is written to the
journal. The first block of the journal starts with an 8 byte magic sequence
to identify the start of the journal (on mount), followed by the number of
blocks in the journal and then finally an array of all the ``n + 2`` block
numbers that are part of the journal. After this part, the entire area in the
``n`` blocks contain logs and their checksums.
Once reaching the end of the device, it cycles back to the front. Thus any
skipped pages get the chance to be allocated in the next cycle.
The last two blocks of a journal are called the master blocks, and they store
multiple instances of the master node. They are duplicates of each other, and
each instance of the master node takes one page each, and are written to
these master blocks in a sequential manner. The master node points to the
root.
CTZ Layer
---------
When the first ``n`` blocks of the journal are full, then they are flushed
and since the root updates here as well, a new master node is written. Once
the new master node is written, the file system's base state is updated and
thus the old obsolete pages can be erased (if possible). The first ``n``
blocks of the journal move more than the master nodes.
This works with the R/W Layer, and acts as an abstraction layer for other
components in mnemofs. Mnemofs uses
`CTZ lists <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_
to represent both files and directories in flash. CTZ lists of files contain
only the data of the file, while CTZ lists of directories contain directory
entries (direntries) for each FS Object (file or directory) grouped into it.
The block allocator of mnemofs is havily inspired from littlefs. It starts
from a random block, and starts allocating pages or blocks sequentially in a
circular manner. It skips pages upon block requirement, but since block
requirements are only required by internal structures, they are always
requested in bulk, and minimize wastage. However, unlike in littlefs, mnemofs
keeps a bitmap in memory about the pages that are currently being used, as
well as the count of pages inside each block that want to be erased.
This layer abstracts away the complex division of flash space that's present
in CTZ skip lists, and allows users of this layer to not worry about the
complexities of a CTZ skip list, and infact, to feel that the data is like a
contiguous space.
In mnemofs, the bind might take a lot of time in the worst possible
theoretical case, as it's an ``O(n)`` mounting process, however, it's not the
case in real life. Mnemofs only needs to scan the first page of every block
in the device to look for the start of the journal. Leaving the actual
location of the page aside, this will be pretty fast in real life as the
larger the storage capacity is, the larger are the pages and the larger are
the number of pages per block, and thus the number of blocks in the device
do not increase at a rate similar to the increase in storage capacity of the
device. Further, the journal has the journal array, which contains block
numbers of each block in it, very close to the start of the array, and
mnemofs can quickly jump from there to the latest master node, and scan
the file system for used pages.
This layer allows the user to specify a data offset, which refers to the
offset into the actual data stored in the CTZ skip list (ie. excluding the
pointers), and the number of bytes, and perform operations on the CTZ list
almost like if it were a single array.
In mnemofs, each CTZ block takes up the space of exactly 1 page, and each
pointer takes up 4 bytes.
Littlefs design document shows how a CTZ list can be identified using the
index of the last CTZ block, and the page number of that CTZ block.
Journal
-------
The journal in mnemofs is made out of ``n + 2`` blocks. The last two block
concern the master node. These blocks are arranged in a singly linked list.
Due to CoW policy, when a CTZ list is updated, it now has a new location. The
first ``n`` blocks of the journal is responsible for storing logs containing
information about this very update. It will contain the old location of the
CTZ skip list, and the new location.
Thus, when the user requires an updated location of a CTZ list, they will
first find the old location by traversing the FS tree in the flash, and then
will traverse the journal to find the latest location.
So, the FS tree on the flash acts like a "base" state with updates stored in
the journal. Each log in journal is followed by a checksum to verify if all
of it was written properly. This helps in making it power loss resilient.
The journal, when combined with CoW, plays another important role. In pure
CoW, any update to a CTZ file will result in it having a new location. This
new location wil need to be updated in the parent, which itself will have a
new location after the update, and so on till it reaches the root. The
journal stops this propagation immediately. When the journal is full above
a certain limit, it will flush, and apply all of these changes to the FS
tree in one go. This helps in wear reduction.
The journal mainly works with the CTZ layer, and any updates to a CTZ list
using this layer automatically adds a log for it in the journal.
The journal starts with a magic sequence, then the number of blocks in the
journal (excluding master blocks), and then follows an array with the block
numbers of the blocks in the journal (including the master blocks). Following
this, logs are stored in the blocks.
Master Node and Root
--------------------
The root of the file system is treated like any directory as far as its
storage on the flash is concerned. This is because the master node acts as a
parent to the root, and contains information of the root in a way identical
to direntries.
The master node is stored in the master blocks. There are two master blocks,
and both are duplicated of each other for backup. Each master block is a
block, and thus have multiple pages in them. Each page contains one revision
of the master node. The master nodes are written sequentially, and have a
timestamp on them as well.
When a CTZ list is moved to a new location, the obsolete pages of the old
CTZ list are marked for deletion.
LRU Cache
---------
Mnemofs has a Least Recently Used (LRU) Cache component. The main use of this
component is to reduce wear on the flash at the expense of memory.
The LRU is a kernel list of nodes. Each node represents an FS object. Each
node also contains a kernel list of deltas. Each delta contains information
about an update or deletion from the user (which is what all of the VFS write
operations can be condensed to).
There's a pre-configured limit for both deltas per node and nodes in the LRU.
If the delta list in a node is full, and another is to be added, all the
existing deltas in the list are clubbed together and written to the flash
using the CTZ layer. The layer also automatically adds a log for this update.
When a node receives a delta, it is bumped from its current location in the
LRU to be at the front. This way, the last node in the LRU is always the
least used node.
If the node limit is reached in the LRU, and a new node is to be added to the
LRU, then the final node (which is also the least recently used node), is
popped from the LRU to make space for the new node. This popped node is then
written to the flash using the CTZ layer as well.
The LRU helps in clubbing updates to a single FS object and thus helps in
reducing the wear of the flash.
Journal Flush
-------------
The latest master node revision is the most useful out of the revisions. As
in CoW it's prudent to update the FS tree from bottom up, the root is the
last one to get updated in the case of a journal flush.
The logs are just location updates. So, when a journal flush occurs, it will
update the locations of all the children in the parent. This updates the
parent, and then this update goes through the same procedure as any other
update.
This is why it's best to start the flush operation when the journal is filled
up over a certain limit, instead of waiting it to be full. Why? Any log of
a parent makes any log of its children written **before** it useless, as the
updated location of the parent can be read to get the updated location of the
child till that point in the logs.
So, it will be best to move up the FS tree from bottom during update and
update the root last, since the root is the parent of every FS object.
Once the root is updated, all other journal logs become useless, and can be
erased. The root's log is not written in the first ``n`` blocks of the
journal, but written as a new master node entry in the master blocks.
Once the new root is written, the first ``n`` blocks can be erased, and
reallocated (due to the rules of wear levelling). The master blocks however
have some conditions for reallocation. This is called moving of the journal.
Every time the first ``n`` blocks are cleared, a new master node is added.
The only time a master block needs to be erased is when it becomes full.
Thus, if there are ``p`` pages in a block, the master blocks will be
moved along with the rest of the journal for every ``p`` journal flushes.
Before the new master node is written, none of the old pages should be erased
to allow rollback to the previous FS tree state. The moment the new master
node is updated, any block which has all of the pages in it ready for
deletion will be erased to make space.
FS Object Layer
---------------
This layer provides an abstraction for iterating, adding, deleting or
reading direntries.
This works with the LRU and the journal to get the latest data and thus the
user of this layer does not have to worry about these underlying mnemofs
components.
VFS Method Layer
----------------
VFS method layer contains methods exposed to the VFS. This layer works with
the FS Object layer for direntry related tasks, or the LRU for file level
read/write tasks.