docs/fs/mnemofs: Update new design and NAND flash structure
Add documentation about the new mnemofs design and the NAND flash structure. Signed-off-by: Saurav <resyfer.dev@gmail.com>
This commit is contained in:
parent
099ced4c83
commit
4a4725386d
|
@ -14,9 +14,7 @@ you can mount it with ``mnemofs`` to a location like ``/mydir`` using::
|
||||||
|
|
||||||
The above command will only work if the device was already formatted using
|
The above command will only work if the device was already formatted using
|
||||||
mnemofs. For a brand new device, or if you want to switch from an existing
|
mnemofs. For a brand new device, or if you want to switch from an existing
|
||||||
file system, this won't work, and would need a format.
|
file system, this won't work, and would need a format::
|
||||||
|
|
||||||
Instead try this::
|
|
||||||
|
|
||||||
mount -t mnemofs -o forceformat /dev/nand /mydir
|
mount -t mnemofs -o forceformat /dev/nand /mydir
|
||||||
|
|
||||||
|
@ -33,86 +31,234 @@ system after all...to hide the storage device's pecularities behind an
|
||||||
abstraction. A file system is considered good if you don't have to think
|
abstraction. A file system is considered good if you don't have to think
|
||||||
about its existence during regular usage.
|
about its existence during regular usage.
|
||||||
|
|
||||||
|
NAND Flashes
|
||||||
|
============
|
||||||
|
|
||||||
|
Programmatically, the NAND flash has some problems. The whole device can be
|
||||||
|
condensed into three layers: blocks, pages and cells.
|
||||||
|
|
||||||
|
Cells represent the smallest unit of storage in NAND flashes, but are often
|
||||||
|
ignored, as direct access is not allowed. If a cell stores one bit, it's a
|
||||||
|
Single Level Cell. There are MLC, TLC, etc. for more bits per cell. Often,
|
||||||
|
the more bits per cell, the lesser is the wear resilience. Thus, higher
|
||||||
|
bits per cell are easier to wear out and become unreliable.
|
||||||
|
|
||||||
|
Pages are the smallest readable or writable unit of the NAND flash. It's
|
||||||
|
made up of several cells, and can be expected to have a size of the similar
|
||||||
|
order of 512 B.
|
||||||
|
|
||||||
|
Blocks are the smallest erasable unit of NAND flash. They are made up of
|
||||||
|
several pages. If a page is already written, it needs to be erased before it
|
||||||
|
can be written again. And since blocks are the smallest erasable unit, the
|
||||||
|
entire block needs to be erased if the user wants to update the contents of
|
||||||
|
one page.
|
||||||
|
|
||||||
|
The erase operation is what causes a block to wear out. If a block is worn
|
||||||
|
out too much, it will lose its ability to reliably store data. An unreliable
|
||||||
|
block can not guarantee that the data read from the pages in it is the same
|
||||||
|
as what was written to it. This state is called as a bad block.
|
||||||
|
|
||||||
|
A manufacturer can also deem a block to be unreliable from their testing,
|
||||||
|
and can mark them as bad blocks right from manufacture.
|
||||||
|
|
||||||
|
A good file system will aim to level out the wear between blocks as much as
|
||||||
|
it can.
|
||||||
|
|
||||||
Design
|
Design
|
||||||
======
|
======
|
||||||
|
|
||||||
mnemofs is designed to be a middle ground between flash storage consumption,
|
There are various layers and components in mnemofs, and they interact with
|
||||||
memory consumption, wear and speed. It sacrifices a little bit of everything,
|
various layers on abstraction over each other.
|
||||||
and ends up being acceptably good in all of them, instead of sacrificing
|
|
||||||
multiple aspects, and being good in one.
|
|
||||||
|
|
||||||
mnemofs consists of several components, however, a walkthrough of the process
|
Mnemofs works on a Copy-On-Write (CoW) basis, which means, if a page needs to
|
||||||
where a change requested by a user ends up being written to the NAND flash
|
be updated, it is copied over in memory, and then the change is applied, and
|
||||||
would serve well for an introduction. The details will be explained further
|
the new data is written to a new location.
|
||||||
below.
|
|
||||||
|
|
||||||
The user requests some changes, say, add ``x`` bytes to ``y`` offset in a file.
|
R/W Layer
|
||||||
This change is copied into the LRU cache of mnemofs. This LRU cache exists
|
---------
|
||||||
in-memory, and serves as a tool for wear reduction.
|
|
||||||
|
|
||||||
This LRU cache is a kernel list of nodes. Each node represents a file or a
|
This works with the NAND flash device driver directly. It can write an
|
||||||
directory. When the LRU is full, the last node is popped from this list and
|
entire page, read an entire page, erase an entire block, check if a block is
|
||||||
the changes it contains, which is an accumulation of changes requested by
|
bad (from it's bad block marker), or set a block as bad. It's the simplest
|
||||||
the user for that particular file or directory since the node was added to
|
layer.
|
||||||
the LRU cache, is written to the flash.
|
|
||||||
|
|
||||||
Each file or directory is represented by a `CTZ skip list <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_,
|
Block Allocator
|
||||||
and the only attributes required to access the list is the index of the last
|
---------------
|
||||||
CTZ skip list block, the page number of that CTZ skip list block, and the
|
|
||||||
size of the file. In mnemofs, CTZ skip list blocks take up exactly one page
|
|
||||||
on the flash.
|
|
||||||
|
|
||||||
Mnemofs works in a Copy-On-Write manner, similar to littlefs. When a CTZ
|
The block allocator contains two arrays. One is a bit mask is for tracking
|
||||||
skip list is updated, the new location is added to the Journal of mnemofs
|
the free pages, while the other is an array of numbers, one number for each
|
||||||
as a log. This log contains some information about the location of the new
|
block, denoting the number of pages in that block that are ready to be
|
||||||
CTZ list, the path it belongs to, etc. and then the updated location is
|
erased.
|
||||||
added as an update to its parent's CTZ skip list, and it undergoes the same
|
|
||||||
process. This log is appended with a checksum of the entire log, which
|
|
||||||
gives an assurance that the saved log was indeed saved completely before a
|
|
||||||
power loos.
|
|
||||||
|
|
||||||
The journal is a modified singly linked list of blocks on the flash that
|
The block allocator allocates pages or blocks in a sequential manner to keep
|
||||||
contains logs of changes in the file system. The last two blocks of the
|
it fair for all pages, thus, ensuring wear levelling. It also starts from a
|
||||||
journal is reserved for master blocks, hence the number of blocks in the
|
random offset to prevent bias to the front of the device in case of multiple
|
||||||
journal will be referred to as ``n + 2`` blocks.
|
power losses and reinitialization in such casses. If a block is required it
|
||||||
|
skips pages to the start of the next block. Since block allocations happen
|
||||||
|
only in the journal, they happen in bulk and the number of skipped pages is
|
||||||
|
very minimal.
|
||||||
|
|
||||||
The area on storage other than the journal contains a certain "base" state of
|
Once reaching the end of the device, it cycles back to the front. Thus any
|
||||||
the file system. All changes to the base state since is written to the
|
skipped pages get the chance to be allocated in the next cycle.
|
||||||
journal. The first block of the journal starts with an 8 byte magic sequence
|
|
||||||
to identify the start of the journal (on mount), followed by the number of
|
|
||||||
blocks in the journal and then finally an array of all the ``n + 2`` block
|
|
||||||
numbers that are part of the journal. After this part, the entire area in the
|
|
||||||
``n`` blocks contain logs and their checksums.
|
|
||||||
|
|
||||||
The last two blocks of a journal are called the master blocks, and they store
|
CTZ Layer
|
||||||
multiple instances of the master node. They are duplicates of each other, and
|
---------
|
||||||
each instance of the master node takes one page each, and are written to
|
|
||||||
these master blocks in a sequential manner. The master node points to the
|
|
||||||
root.
|
|
||||||
|
|
||||||
When the first ``n`` blocks of the journal are full, then they are flushed
|
This works with the R/W Layer, and acts as an abstraction layer for other
|
||||||
and since the root updates here as well, a new master node is written. Once
|
components in mnemofs. Mnemofs uses
|
||||||
the new master node is written, the file system's base state is updated and
|
`CTZ lists <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_
|
||||||
thus the old obsolete pages can be erased (if possible). The first ``n``
|
to represent both files and directories in flash. CTZ lists of files contain
|
||||||
blocks of the journal move more than the master nodes.
|
only the data of the file, while CTZ lists of directories contain directory
|
||||||
|
entries (direntries) for each FS Object (file or directory) grouped into it.
|
||||||
|
|
||||||
The block allocator of mnemofs is havily inspired from littlefs. It starts
|
This layer abstracts away the complex division of flash space that's present
|
||||||
from a random block, and starts allocating pages or blocks sequentially in a
|
in CTZ skip lists, and allows users of this layer to not worry about the
|
||||||
circular manner. It skips pages upon block requirement, but since block
|
complexities of a CTZ skip list, and infact, to feel that the data is like a
|
||||||
requirements are only required by internal structures, they are always
|
contiguous space.
|
||||||
requested in bulk, and minimize wastage. However, unlike in littlefs, mnemofs
|
|
||||||
keeps a bitmap in memory about the pages that are currently being used, as
|
|
||||||
well as the count of pages inside each block that want to be erased.
|
|
||||||
|
|
||||||
In mnemofs, the bind might take a lot of time in the worst possible
|
This layer allows the user to specify a data offset, which refers to the
|
||||||
theoretical case, as it's an ``O(n)`` mounting process, however, it's not the
|
offset into the actual data stored in the CTZ skip list (ie. excluding the
|
||||||
case in real life. Mnemofs only needs to scan the first page of every block
|
pointers), and the number of bytes, and perform operations on the CTZ list
|
||||||
in the device to look for the start of the journal. Leaving the actual
|
almost like if it were a single array.
|
||||||
location of the page aside, this will be pretty fast in real life as the
|
|
||||||
larger the storage capacity is, the larger are the pages and the larger are
|
In mnemofs, each CTZ block takes up the space of exactly 1 page, and each
|
||||||
the number of pages per block, and thus the number of blocks in the device
|
pointer takes up 4 bytes.
|
||||||
do not increase at a rate similar to the increase in storage capacity of the
|
|
||||||
device. Further, the journal has the journal array, which contains block
|
Littlefs design document shows how a CTZ list can be identified using the
|
||||||
numbers of each block in it, very close to the start of the array, and
|
index of the last CTZ block, and the page number of that CTZ block.
|
||||||
mnemofs can quickly jump from there to the latest master node, and scan
|
|
||||||
the file system for used pages.
|
Journal
|
||||||
|
-------
|
||||||
|
|
||||||
|
The journal in mnemofs is made out of ``n + 2`` blocks. The last two block
|
||||||
|
concern the master node. These blocks are arranged in a singly linked list.
|
||||||
|
|
||||||
|
Due to CoW policy, when a CTZ list is updated, it now has a new location. The
|
||||||
|
first ``n`` blocks of the journal is responsible for storing logs containing
|
||||||
|
information about this very update. It will contain the old location of the
|
||||||
|
CTZ skip list, and the new location.
|
||||||
|
|
||||||
|
Thus, when the user requires an updated location of a CTZ list, they will
|
||||||
|
first find the old location by traversing the FS tree in the flash, and then
|
||||||
|
will traverse the journal to find the latest location.
|
||||||
|
|
||||||
|
So, the FS tree on the flash acts like a "base" state with updates stored in
|
||||||
|
the journal. Each log in journal is followed by a checksum to verify if all
|
||||||
|
of it was written properly. This helps in making it power loss resilient.
|
||||||
|
|
||||||
|
The journal, when combined with CoW, plays another important role. In pure
|
||||||
|
CoW, any update to a CTZ file will result in it having a new location. This
|
||||||
|
new location wil need to be updated in the parent, which itself will have a
|
||||||
|
new location after the update, and so on till it reaches the root. The
|
||||||
|
journal stops this propagation immediately. When the journal is full above
|
||||||
|
a certain limit, it will flush, and apply all of these changes to the FS
|
||||||
|
tree in one go. This helps in wear reduction.
|
||||||
|
|
||||||
|
The journal mainly works with the CTZ layer, and any updates to a CTZ list
|
||||||
|
using this layer automatically adds a log for it in the journal.
|
||||||
|
|
||||||
|
The journal starts with a magic sequence, then the number of blocks in the
|
||||||
|
journal (excluding master blocks), and then follows an array with the block
|
||||||
|
numbers of the blocks in the journal (including the master blocks). Following
|
||||||
|
this, logs are stored in the blocks.
|
||||||
|
|
||||||
|
Master Node and Root
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
The root of the file system is treated like any directory as far as its
|
||||||
|
storage on the flash is concerned. This is because the master node acts as a
|
||||||
|
parent to the root, and contains information of the root in a way identical
|
||||||
|
to direntries.
|
||||||
|
|
||||||
|
The master node is stored in the master blocks. There are two master blocks,
|
||||||
|
and both are duplicated of each other for backup. Each master block is a
|
||||||
|
block, and thus have multiple pages in them. Each page contains one revision
|
||||||
|
of the master node. The master nodes are written sequentially, and have a
|
||||||
|
timestamp on them as well.
|
||||||
|
|
||||||
|
When a CTZ list is moved to a new location, the obsolete pages of the old
|
||||||
|
CTZ list are marked for deletion.
|
||||||
|
|
||||||
|
LRU Cache
|
||||||
|
---------
|
||||||
|
|
||||||
|
Mnemofs has a Least Recently Used (LRU) Cache component. The main use of this
|
||||||
|
component is to reduce wear on the flash at the expense of memory.
|
||||||
|
|
||||||
|
The LRU is a kernel list of nodes. Each node represents an FS object. Each
|
||||||
|
node also contains a kernel list of deltas. Each delta contains information
|
||||||
|
about an update or deletion from the user (which is what all of the VFS write
|
||||||
|
operations can be condensed to).
|
||||||
|
|
||||||
|
There's a pre-configured limit for both deltas per node and nodes in the LRU.
|
||||||
|
|
||||||
|
If the delta list in a node is full, and another is to be added, all the
|
||||||
|
existing deltas in the list are clubbed together and written to the flash
|
||||||
|
using the CTZ layer. The layer also automatically adds a log for this update.
|
||||||
|
When a node receives a delta, it is bumped from its current location in the
|
||||||
|
LRU to be at the front. This way, the last node in the LRU is always the
|
||||||
|
least used node.
|
||||||
|
|
||||||
|
If the node limit is reached in the LRU, and a new node is to be added to the
|
||||||
|
LRU, then the final node (which is also the least recently used node), is
|
||||||
|
popped from the LRU to make space for the new node. This popped node is then
|
||||||
|
written to the flash using the CTZ layer as well.
|
||||||
|
|
||||||
|
The LRU helps in clubbing updates to a single FS object and thus helps in
|
||||||
|
reducing the wear of the flash.
|
||||||
|
|
||||||
|
Journal Flush
|
||||||
|
-------------
|
||||||
|
|
||||||
|
The latest master node revision is the most useful out of the revisions. As
|
||||||
|
in CoW it's prudent to update the FS tree from bottom up, the root is the
|
||||||
|
last one to get updated in the case of a journal flush.
|
||||||
|
|
||||||
|
The logs are just location updates. So, when a journal flush occurs, it will
|
||||||
|
update the locations of all the children in the parent. This updates the
|
||||||
|
parent, and then this update goes through the same procedure as any other
|
||||||
|
update.
|
||||||
|
|
||||||
|
This is why it's best to start the flush operation when the journal is filled
|
||||||
|
up over a certain limit, instead of waiting it to be full. Why? Any log of
|
||||||
|
a parent makes any log of its children written **before** it useless, as the
|
||||||
|
updated location of the parent can be read to get the updated location of the
|
||||||
|
child till that point in the logs.
|
||||||
|
|
||||||
|
So, it will be best to move up the FS tree from bottom during update and
|
||||||
|
update the root last, since the root is the parent of every FS object.
|
||||||
|
|
||||||
|
Once the root is updated, all other journal logs become useless, and can be
|
||||||
|
erased. The root's log is not written in the first ``n`` blocks of the
|
||||||
|
journal, but written as a new master node entry in the master blocks.
|
||||||
|
|
||||||
|
Once the new root is written, the first ``n`` blocks can be erased, and
|
||||||
|
reallocated (due to the rules of wear levelling). The master blocks however
|
||||||
|
have some conditions for reallocation. This is called moving of the journal.
|
||||||
|
|
||||||
|
Every time the first ``n`` blocks are cleared, a new master node is added.
|
||||||
|
The only time a master block needs to be erased is when it becomes full.
|
||||||
|
Thus, if there are ``p`` pages in a block, the master blocks will be
|
||||||
|
moved along with the rest of the journal for every ``p`` journal flushes.
|
||||||
|
|
||||||
|
Before the new master node is written, none of the old pages should be erased
|
||||||
|
to allow rollback to the previous FS tree state. The moment the new master
|
||||||
|
node is updated, any block which has all of the pages in it ready for
|
||||||
|
deletion will be erased to make space.
|
||||||
|
|
||||||
|
FS Object Layer
|
||||||
|
---------------
|
||||||
|
|
||||||
|
This layer provides an abstraction for iterating, adding, deleting or
|
||||||
|
reading direntries.
|
||||||
|
|
||||||
|
This works with the LRU and the journal to get the latest data and thus the
|
||||||
|
user of this layer does not have to worry about these underlying mnemofs
|
||||||
|
components.
|
||||||
|
|
||||||
|
VFS Method Layer
|
||||||
|
----------------
|
||||||
|
|
||||||
|
VFS method layer contains methods exposed to the VFS. This layer works with
|
||||||
|
the FS Object layer for direntry related tasks, or the LRU for file level
|
||||||
|
read/write tasks.
|
Loading…
Reference in New Issue