docs/fs/mnemofs: Update new design and NAND flash structure
Add documentation about the new mnemofs design and the NAND flash structure. Signed-off-by: Saurav <resyfer.dev@gmail.com>
This commit is contained in:
parent
099ced4c83
commit
4a4725386d
|
@ -14,9 +14,7 @@ you can mount it with ``mnemofs`` to a location like ``/mydir`` using::
|
|||
|
||||
The above command will only work if the device was already formatted using
|
||||
mnemofs. For a brand new device, or if you want to switch from an existing
|
||||
file system, this won't work, and would need a format.
|
||||
|
||||
Instead try this::
|
||||
file system, this won't work, and would need a format::
|
||||
|
||||
mount -t mnemofs -o forceformat /dev/nand /mydir
|
||||
|
||||
|
@ -33,86 +31,234 @@ system after all...to hide the storage device's pecularities behind an
|
|||
abstraction. A file system is considered good if you don't have to think
|
||||
about its existence during regular usage.
|
||||
|
||||
NAND Flashes
|
||||
============
|
||||
|
||||
Programmatically, the NAND flash has some problems. The whole device can be
|
||||
condensed into three layers: blocks, pages and cells.
|
||||
|
||||
Cells represent the smallest unit of storage in NAND flashes, but are often
|
||||
ignored, as direct access is not allowed. If a cell stores one bit, it's a
|
||||
Single Level Cell. There are MLC, TLC, etc. for more bits per cell. Often,
|
||||
the more bits per cell, the lesser is the wear resilience. Thus, higher
|
||||
bits per cell are easier to wear out and become unreliable.
|
||||
|
||||
Pages are the smallest readable or writable unit of the NAND flash. It's
|
||||
made up of several cells, and can be expected to have a size of the similar
|
||||
order of 512 B.
|
||||
|
||||
Blocks are the smallest erasable unit of NAND flash. They are made up of
|
||||
several pages. If a page is already written, it needs to be erased before it
|
||||
can be written again. And since blocks are the smallest erasable unit, the
|
||||
entire block needs to be erased if the user wants to update the contents of
|
||||
one page.
|
||||
|
||||
The erase operation is what causes a block to wear out. If a block is worn
|
||||
out too much, it will lose its ability to reliably store data. An unreliable
|
||||
block can not guarantee that the data read from the pages in it is the same
|
||||
as what was written to it. This state is called as a bad block.
|
||||
|
||||
A manufacturer can also deem a block to be unreliable from their testing,
|
||||
and can mark them as bad blocks right from manufacture.
|
||||
|
||||
A good file system will aim to level out the wear between blocks as much as
|
||||
it can.
|
||||
|
||||
Design
|
||||
======
|
||||
|
||||
mnemofs is designed to be a middle ground between flash storage consumption,
|
||||
memory consumption, wear and speed. It sacrifices a little bit of everything,
|
||||
and ends up being acceptably good in all of them, instead of sacrificing
|
||||
multiple aspects, and being good in one.
|
||||
There are various layers and components in mnemofs, and they interact with
|
||||
various layers on abstraction over each other.
|
||||
|
||||
mnemofs consists of several components, however, a walkthrough of the process
|
||||
where a change requested by a user ends up being written to the NAND flash
|
||||
would serve well for an introduction. The details will be explained further
|
||||
below.
|
||||
Mnemofs works on a Copy-On-Write (CoW) basis, which means, if a page needs to
|
||||
be updated, it is copied over in memory, and then the change is applied, and
|
||||
the new data is written to a new location.
|
||||
|
||||
The user requests some changes, say, add ``x`` bytes to ``y`` offset in a file.
|
||||
This change is copied into the LRU cache of mnemofs. This LRU cache exists
|
||||
in-memory, and serves as a tool for wear reduction.
|
||||
R/W Layer
|
||||
---------
|
||||
|
||||
This LRU cache is a kernel list of nodes. Each node represents a file or a
|
||||
directory. When the LRU is full, the last node is popped from this list and
|
||||
the changes it contains, which is an accumulation of changes requested by
|
||||
the user for that particular file or directory since the node was added to
|
||||
the LRU cache, is written to the flash.
|
||||
This works with the NAND flash device driver directly. It can write an
|
||||
entire page, read an entire page, erase an entire block, check if a block is
|
||||
bad (from it's bad block marker), or set a block as bad. It's the simplest
|
||||
layer.
|
||||
|
||||
Each file or directory is represented by a `CTZ skip list <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_,
|
||||
and the only attributes required to access the list is the index of the last
|
||||
CTZ skip list block, the page number of that CTZ skip list block, and the
|
||||
size of the file. In mnemofs, CTZ skip list blocks take up exactly one page
|
||||
on the flash.
|
||||
Block Allocator
|
||||
---------------
|
||||
|
||||
Mnemofs works in a Copy-On-Write manner, similar to littlefs. When a CTZ
|
||||
skip list is updated, the new location is added to the Journal of mnemofs
|
||||
as a log. This log contains some information about the location of the new
|
||||
CTZ list, the path it belongs to, etc. and then the updated location is
|
||||
added as an update to its parent's CTZ skip list, and it undergoes the same
|
||||
process. This log is appended with a checksum of the entire log, which
|
||||
gives an assurance that the saved log was indeed saved completely before a
|
||||
power loos.
|
||||
The block allocator contains two arrays. One is a bit mask is for tracking
|
||||
the free pages, while the other is an array of numbers, one number for each
|
||||
block, denoting the number of pages in that block that are ready to be
|
||||
erased.
|
||||
|
||||
The journal is a modified singly linked list of blocks on the flash that
|
||||
contains logs of changes in the file system. The last two blocks of the
|
||||
journal is reserved for master blocks, hence the number of blocks in the
|
||||
journal will be referred to as ``n + 2`` blocks.
|
||||
The block allocator allocates pages or blocks in a sequential manner to keep
|
||||
it fair for all pages, thus, ensuring wear levelling. It also starts from a
|
||||
random offset to prevent bias to the front of the device in case of multiple
|
||||
power losses and reinitialization in such casses. If a block is required it
|
||||
skips pages to the start of the next block. Since block allocations happen
|
||||
only in the journal, they happen in bulk and the number of skipped pages is
|
||||
very minimal.
|
||||
|
||||
The area on storage other than the journal contains a certain "base" state of
|
||||
the file system. All changes to the base state since is written to the
|
||||
journal. The first block of the journal starts with an 8 byte magic sequence
|
||||
to identify the start of the journal (on mount), followed by the number of
|
||||
blocks in the journal and then finally an array of all the ``n + 2`` block
|
||||
numbers that are part of the journal. After this part, the entire area in the
|
||||
``n`` blocks contain logs and their checksums.
|
||||
Once reaching the end of the device, it cycles back to the front. Thus any
|
||||
skipped pages get the chance to be allocated in the next cycle.
|
||||
|
||||
The last two blocks of a journal are called the master blocks, and they store
|
||||
multiple instances of the master node. They are duplicates of each other, and
|
||||
each instance of the master node takes one page each, and are written to
|
||||
these master blocks in a sequential manner. The master node points to the
|
||||
root.
|
||||
CTZ Layer
|
||||
---------
|
||||
|
||||
When the first ``n`` blocks of the journal are full, then they are flushed
|
||||
and since the root updates here as well, a new master node is written. Once
|
||||
the new master node is written, the file system's base state is updated and
|
||||
thus the old obsolete pages can be erased (if possible). The first ``n``
|
||||
blocks of the journal move more than the master nodes.
|
||||
This works with the R/W Layer, and acts as an abstraction layer for other
|
||||
components in mnemofs. Mnemofs uses
|
||||
`CTZ lists <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_
|
||||
to represent both files and directories in flash. CTZ lists of files contain
|
||||
only the data of the file, while CTZ lists of directories contain directory
|
||||
entries (direntries) for each FS Object (file or directory) grouped into it.
|
||||
|
||||
The block allocator of mnemofs is havily inspired from littlefs. It starts
|
||||
from a random block, and starts allocating pages or blocks sequentially in a
|
||||
circular manner. It skips pages upon block requirement, but since block
|
||||
requirements are only required by internal structures, they are always
|
||||
requested in bulk, and minimize wastage. However, unlike in littlefs, mnemofs
|
||||
keeps a bitmap in memory about the pages that are currently being used, as
|
||||
well as the count of pages inside each block that want to be erased.
|
||||
This layer abstracts away the complex division of flash space that's present
|
||||
in CTZ skip lists, and allows users of this layer to not worry about the
|
||||
complexities of a CTZ skip list, and infact, to feel that the data is like a
|
||||
contiguous space.
|
||||
|
||||
In mnemofs, the bind might take a lot of time in the worst possible
|
||||
theoretical case, as it's an ``O(n)`` mounting process, however, it's not the
|
||||
case in real life. Mnemofs only needs to scan the first page of every block
|
||||
in the device to look for the start of the journal. Leaving the actual
|
||||
location of the page aside, this will be pretty fast in real life as the
|
||||
larger the storage capacity is, the larger are the pages and the larger are
|
||||
the number of pages per block, and thus the number of blocks in the device
|
||||
do not increase at a rate similar to the increase in storage capacity of the
|
||||
device. Further, the journal has the journal array, which contains block
|
||||
numbers of each block in it, very close to the start of the array, and
|
||||
mnemofs can quickly jump from there to the latest master node, and scan
|
||||
the file system for used pages.
|
||||
This layer allows the user to specify a data offset, which refers to the
|
||||
offset into the actual data stored in the CTZ skip list (ie. excluding the
|
||||
pointers), and the number of bytes, and perform operations on the CTZ list
|
||||
almost like if it were a single array.
|
||||
|
||||
In mnemofs, each CTZ block takes up the space of exactly 1 page, and each
|
||||
pointer takes up 4 bytes.
|
||||
|
||||
Littlefs design document shows how a CTZ list can be identified using the
|
||||
index of the last CTZ block, and the page number of that CTZ block.
|
||||
|
||||
Journal
|
||||
-------
|
||||
|
||||
The journal in mnemofs is made out of ``n + 2`` blocks. The last two block
|
||||
concern the master node. These blocks are arranged in a singly linked list.
|
||||
|
||||
Due to CoW policy, when a CTZ list is updated, it now has a new location. The
|
||||
first ``n`` blocks of the journal is responsible for storing logs containing
|
||||
information about this very update. It will contain the old location of the
|
||||
CTZ skip list, and the new location.
|
||||
|
||||
Thus, when the user requires an updated location of a CTZ list, they will
|
||||
first find the old location by traversing the FS tree in the flash, and then
|
||||
will traverse the journal to find the latest location.
|
||||
|
||||
So, the FS tree on the flash acts like a "base" state with updates stored in
|
||||
the journal. Each log in journal is followed by a checksum to verify if all
|
||||
of it was written properly. This helps in making it power loss resilient.
|
||||
|
||||
The journal, when combined with CoW, plays another important role. In pure
|
||||
CoW, any update to a CTZ file will result in it having a new location. This
|
||||
new location wil need to be updated in the parent, which itself will have a
|
||||
new location after the update, and so on till it reaches the root. The
|
||||
journal stops this propagation immediately. When the journal is full above
|
||||
a certain limit, it will flush, and apply all of these changes to the FS
|
||||
tree in one go. This helps in wear reduction.
|
||||
|
||||
The journal mainly works with the CTZ layer, and any updates to a CTZ list
|
||||
using this layer automatically adds a log for it in the journal.
|
||||
|
||||
The journal starts with a magic sequence, then the number of blocks in the
|
||||
journal (excluding master blocks), and then follows an array with the block
|
||||
numbers of the blocks in the journal (including the master blocks). Following
|
||||
this, logs are stored in the blocks.
|
||||
|
||||
Master Node and Root
|
||||
--------------------
|
||||
|
||||
The root of the file system is treated like any directory as far as its
|
||||
storage on the flash is concerned. This is because the master node acts as a
|
||||
parent to the root, and contains information of the root in a way identical
|
||||
to direntries.
|
||||
|
||||
The master node is stored in the master blocks. There are two master blocks,
|
||||
and both are duplicated of each other for backup. Each master block is a
|
||||
block, and thus have multiple pages in them. Each page contains one revision
|
||||
of the master node. The master nodes are written sequentially, and have a
|
||||
timestamp on them as well.
|
||||
|
||||
When a CTZ list is moved to a new location, the obsolete pages of the old
|
||||
CTZ list are marked for deletion.
|
||||
|
||||
LRU Cache
|
||||
---------
|
||||
|
||||
Mnemofs has a Least Recently Used (LRU) Cache component. The main use of this
|
||||
component is to reduce wear on the flash at the expense of memory.
|
||||
|
||||
The LRU is a kernel list of nodes. Each node represents an FS object. Each
|
||||
node also contains a kernel list of deltas. Each delta contains information
|
||||
about an update or deletion from the user (which is what all of the VFS write
|
||||
operations can be condensed to).
|
||||
|
||||
There's a pre-configured limit for both deltas per node and nodes in the LRU.
|
||||
|
||||
If the delta list in a node is full, and another is to be added, all the
|
||||
existing deltas in the list are clubbed together and written to the flash
|
||||
using the CTZ layer. The layer also automatically adds a log for this update.
|
||||
When a node receives a delta, it is bumped from its current location in the
|
||||
LRU to be at the front. This way, the last node in the LRU is always the
|
||||
least used node.
|
||||
|
||||
If the node limit is reached in the LRU, and a new node is to be added to the
|
||||
LRU, then the final node (which is also the least recently used node), is
|
||||
popped from the LRU to make space for the new node. This popped node is then
|
||||
written to the flash using the CTZ layer as well.
|
||||
|
||||
The LRU helps in clubbing updates to a single FS object and thus helps in
|
||||
reducing the wear of the flash.
|
||||
|
||||
Journal Flush
|
||||
-------------
|
||||
|
||||
The latest master node revision is the most useful out of the revisions. As
|
||||
in CoW it's prudent to update the FS tree from bottom up, the root is the
|
||||
last one to get updated in the case of a journal flush.
|
||||
|
||||
The logs are just location updates. So, when a journal flush occurs, it will
|
||||
update the locations of all the children in the parent. This updates the
|
||||
parent, and then this update goes through the same procedure as any other
|
||||
update.
|
||||
|
||||
This is why it's best to start the flush operation when the journal is filled
|
||||
up over a certain limit, instead of waiting it to be full. Why? Any log of
|
||||
a parent makes any log of its children written **before** it useless, as the
|
||||
updated location of the parent can be read to get the updated location of the
|
||||
child till that point in the logs.
|
||||
|
||||
So, it will be best to move up the FS tree from bottom during update and
|
||||
update the root last, since the root is the parent of every FS object.
|
||||
|
||||
Once the root is updated, all other journal logs become useless, and can be
|
||||
erased. The root's log is not written in the first ``n`` blocks of the
|
||||
journal, but written as a new master node entry in the master blocks.
|
||||
|
||||
Once the new root is written, the first ``n`` blocks can be erased, and
|
||||
reallocated (due to the rules of wear levelling). The master blocks however
|
||||
have some conditions for reallocation. This is called moving of the journal.
|
||||
|
||||
Every time the first ``n`` blocks are cleared, a new master node is added.
|
||||
The only time a master block needs to be erased is when it becomes full.
|
||||
Thus, if there are ``p`` pages in a block, the master blocks will be
|
||||
moved along with the rest of the journal for every ``p`` journal flushes.
|
||||
|
||||
Before the new master node is written, none of the old pages should be erased
|
||||
to allow rollback to the previous FS tree state. The moment the new master
|
||||
node is updated, any block which has all of the pages in it ready for
|
||||
deletion will be erased to make space.
|
||||
|
||||
FS Object Layer
|
||||
---------------
|
||||
|
||||
This layer provides an abstraction for iterating, adding, deleting or
|
||||
reading direntries.
|
||||
|
||||
This works with the LRU and the journal to get the latest data and thus the
|
||||
user of this layer does not have to worry about these underlying mnemofs
|
||||
components.
|
||||
|
||||
VFS Method Layer
|
||||
----------------
|
||||
|
||||
VFS method layer contains methods exposed to the VFS. This layer works with
|
||||
the FS Object layer for direntry related tasks, or the LRU for file level
|
||||
read/write tasks.
|
Loading…
Reference in New Issue