docs/fs/mnemofs: Update new design and NAND flash structure

Add documentation about the new mnemofs design and the NAND flash structure. Signed-off-by: Saurav <resyfer.dev@gmail.com>
2024-08-23 06:47:48 +00:00 · 2024-08-23 06:47:48 +00:00 · 4a4725386d
parent 099ced4c83
commit 4a4725386d
1 changed files with 218 additions and 72 deletions
--- a/Documentation/components/filesystem/mnemofs.rst
+++ b/Documentation/components/filesystem/mnemofs.rst
@ -14,9 +14,7 @@ you can mount it with ``mnemofs`` to a location like ``/mydir`` using::

 The above command will only work if the device was already formatted using
 mnemofs. For a brand new device, or if you want to switch from an existing
-file system, this won't work, and would need a format.
-
-Instead try this::
+file system, this won't work, and would need a format::

    mount -t mnemofs -o forceformat /dev/nand /mydir

@ -33,86 +31,234 @@ system after all...to hide the storage device's pecularities behind an
 abstraction. A file system is considered good if you don't have to think
 about its existence during regular usage.

+NAND Flashes
+============
+
+Programmatically, the NAND flash has some problems. The whole device can be
+condensed into three layers: blocks, pages and cells.
+
+Cells represent the smallest unit of storage in NAND flashes, but are often
+ignored, as direct access is not allowed. If a cell stores one bit, it's a
+Single Level Cell. There are MLC, TLC, etc. for more bits per cell. Often,
+the more bits per cell, the lesser is the wear resilience. Thus, higher
+bits per cell are easier to wear out and become unreliable.
+
+Pages are the smallest readable or writable unit of the NAND flash. It's
+made up of several cells, and can be expected to have a size of the similar
+order of 512 B.
+
+Blocks are the smallest erasable unit of NAND flash. They are made up of
+several pages. If a page is already written, it needs to be erased before it
+can be written again. And since blocks are the smallest erasable unit, the
+entire block needs to be erased if the user wants to update the contents of
+one page.
+
+The erase operation is what causes a block to wear out. If a block is worn
+out too much, it will lose its ability to reliably store data. An unreliable
+block can not guarantee that the data read from the pages in it is the same
+as what was written to it. This state is called as a bad block.
+
+A manufacturer can also deem a block to be unreliable from their testing,
+and can mark them as bad blocks right from manufacture.
+
+A good file system will aim to level out the wear between blocks as much as
+it can.
+
 Design
 ======

-mnemofs is designed to be a middle ground between flash storage consumption,
-memory consumption, wear and speed. It sacrifices a little bit of everything,
-and ends up being acceptably good in all of them, instead of sacrificing
-multiple aspects, and being good in one.
+There are various layers and components in mnemofs, and they interact with
+various layers on abstraction over each other.

-mnemofs consists of several components, however, a walkthrough of the process
-where a change requested by a user ends up being written to the NAND flash
-would serve well for an introduction. The details will be explained further
-below.
+Mnemofs works on a Copy-On-Write (CoW) basis, which means, if a page needs to
+be updated, it is copied over in memory, and then the change is applied, and
+the new data is written to a new location.

-The user requests some changes, say, add ``x`` bytes to ``y`` offset in a file.
-This change is copied into the LRU cache of mnemofs. This LRU cache exists
-in-memory, and serves as a tool for wear reduction.
+R/W Layer
+---------

-This LRU cache is a kernel list of nodes. Each node represents a file or a
-directory. When the LRU is full, the last node is popped from this list and
-the changes it contains, which is an accumulation of changes requested by
-the user for that particular file or directory since the node was added to
-the LRU cache, is written to the flash.
+This works with the NAND flash device driver directly. It can write an
+entire page, read an entire page, erase an entire block, check if a block is
+bad (from it's bad block marker), or set a block as bad. It's the simplest
+layer.

-Each file or directory is represented by a `CTZ skip list <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_,
-and the only attributes required to access the list is the index of the last
-CTZ skip list block, the page number of that CTZ skip list block, and the
-size of the file. In mnemofs, CTZ skip list blocks take up exactly one page
-on the flash.
+Block Allocator
+---------------

-Mnemofs works in a Copy-On-Write manner, similar to littlefs. When a CTZ
-skip list is updated, the new location is added to the Journal of mnemofs
-as a log. This log contains some information about the location of the new
-CTZ list, the path it belongs to, etc. and then the updated location is
-added as an update to its parent's CTZ skip list, and it undergoes the same
-process. This log is appended with a checksum of the entire log, which
-gives an assurance that the saved log was indeed saved completely before a
-power loos.
+The block allocator contains two arrays. One is a bit mask is for tracking
+the free pages, while the other is an array of numbers, one number for each
+block, denoting the number of pages in that block that are ready to be
+erased.

-The journal is a modified singly linked list of blocks on the flash that
-contains logs of changes in the file system. The last two blocks of the
-journal is reserved for master blocks, hence the number of blocks in the
-journal will be referred to as ``n + 2`` blocks.
+The block allocator allocates pages or blocks in a sequential manner to keep
+it fair for all pages, thus, ensuring wear levelling. It also starts from a
+random offset to prevent bias to the front of the device in case of multiple
+power losses and reinitialization in such casses. If a block is required it
+skips pages to the start of the next block. Since block allocations happen
+only in the journal, they happen in bulk and the number of skipped pages is
+very minimal.

-The area on storage other than the journal contains a certain "base" state of
-the file system. All changes to the base state since is written to the
-journal. The first block of the journal starts with an 8 byte magic sequence
-to identify the start of the journal (on mount), followed by the number of
-blocks in the journal and then finally an array of all the ``n + 2`` block
-numbers that are part of the journal. After this part, the entire area in the
-``n`` blocks contain logs and their checksums.
+Once reaching the end of the device, it cycles back to the front. Thus any
+skipped pages get the chance to be allocated in the next cycle.

-The last two blocks of a journal are called the master blocks, and they store
-multiple instances of the master node. They are duplicates of each other, and
-each instance of the master node takes one page each, and are written to
-these master blocks in a sequential manner. The master node points to the
-root.
+CTZ Layer
+---------

-When the first ``n`` blocks of the journal are full, then they are flushed
-and since the root updates here as well, a new master node is written. Once
-the new master node is written, the file system's base state is updated and
-thus the old obsolete pages can be erased (if possible). The first ``n``
-blocks of the journal move more than the master nodes.
+This works with the R/W Layer, and acts as an abstraction layer for other
+components in mnemofs. Mnemofs uses
+`CTZ lists <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_
+to represent both files and directories in flash. CTZ lists of files contain
+only the data of the file, while CTZ lists of directories contain directory
+entries (direntries) for each FS Object (file or directory) grouped into it.

-The block allocator of mnemofs is havily inspired from littlefs. It starts
-from a random block, and starts allocating pages or blocks sequentially in a
-circular manner. It skips pages upon block requirement, but since block
-requirements are only required by internal structures, they are always
-requested in bulk, and minimize wastage. However, unlike in littlefs, mnemofs
-keeps a bitmap in memory about the pages that are currently being used, as
-well as the count of pages inside each block that want to be erased.
+This layer abstracts away the complex division of flash space that's present
+in CTZ skip lists, and allows users of this layer to not worry about the
+complexities of a CTZ skip list, and infact, to feel that the data is like a
+contiguous space.

-In mnemofs, the bind might take a lot of time in the worst possible
-theoretical case, as it's an ``O(n)`` mounting process, however, it's not the
-case in real life. Mnemofs only needs to scan the first page of every block
-in the device to look for the start of the journal. Leaving the actual
-location of the page aside, this will be pretty fast in real life as the
-larger the storage capacity is, the larger are the pages and the larger are
-the number of pages per block, and thus the number of blocks in the device
-do not increase at a rate similar to the increase in storage capacity of the
-device. Further, the journal has the journal array, which contains block
-numbers of each block in it, very close to the start of the array, and
-mnemofs can quickly jump from there to the latest master node, and scan
-the file system for used pages.
+This layer allows the user to specify a data offset, which refers to the
+offset into the actual data stored in the CTZ skip list (ie. excluding the
+pointers), and the number of bytes, and perform operations on the CTZ list
+almost like if it were a single array.
+
+In mnemofs, each CTZ block takes up the space of exactly 1 page, and each
+pointer takes up 4 bytes.
+
+Littlefs design document shows how a CTZ list can be identified using the
+index of the last CTZ block, and the page number of that CTZ block.
+
+Journal
+-------
+
+The journal in mnemofs is made out of ``n + 2`` blocks. The last two block
+concern the master node. These blocks are arranged in a singly linked list.
+
+Due to CoW policy, when a CTZ list is updated, it now has a new location. The
+first ``n`` blocks of the journal is responsible for storing logs containing
+information about this very update. It will contain the old location of the
+CTZ skip list, and the new location.
+
+Thus, when the user requires an updated location of a CTZ list, they will
+first find the old location by traversing the FS tree in the flash, and then
+will traverse the journal to find the latest location.
+
+So, the FS tree on the flash acts like a "base" state with updates stored in
+the journal. Each log in journal is followed by a checksum to verify if all
+of it was written properly. This helps in making it power loss resilient.
+
+The journal, when combined with CoW, plays another important role. In pure
+CoW, any update to a CTZ file will result in it having a new location. This
+new location wil need to be updated in the parent, which itself will have a
+new location after the update, and so on till it reaches the root. The
+journal stops this propagation immediately. When the journal is full above
+a certain limit, it will flush, and apply all of these changes to the FS
+tree in one go. This helps in wear reduction.
+
+The journal mainly works with the CTZ layer, and any updates to a CTZ list
+using this layer automatically adds a log for it in the journal.
+
+The journal starts with a magic sequence, then the number of blocks in the
+journal (excluding master blocks), and then follows an array with the block
+numbers of the blocks in the journal (including the master blocks). Following
+this, logs are stored in the blocks.
+
+Master Node and Root
+--------------------
+
+The root of the file system is treated like any directory as far as its
+storage on the flash is concerned. This is because the master node acts as a
+parent to the root, and contains information of the root in a way identical
+to direntries.
+
+The master node is stored in the master blocks. There are two master blocks,
+and both are duplicated of each other for backup. Each master block is a
+block, and thus have multiple pages in them. Each page contains one revision
+of the master node. The master nodes are written sequentially, and have a
+timestamp on them as well.
+
+When a CTZ list is moved to a new location, the obsolete pages of the old
+CTZ list are marked for deletion.
+
+LRU Cache
+---------
+
+Mnemofs has a Least Recently Used (LRU) Cache component. The main use of this
+component is to reduce wear on the flash at the expense of memory.
+
+The LRU is a kernel list of nodes. Each node represents an FS object. Each
+node also contains a kernel list of deltas. Each delta contains information
+about an update or deletion from the user (which is what all of the VFS write
+operations can be condensed to).
+
+There's a pre-configured limit for both deltas per node and nodes in the LRU.
+
+If the delta list in a node is full, and another is to be added, all the
+existing deltas in the list are clubbed together and written to the flash
+using the CTZ layer. The layer also automatically adds a log for this update.
+When a node receives a delta, it is bumped from its current location in the
+LRU to be at the front. This way, the last node in the LRU is always the
+least used node.
+
+If the node limit is reached in the LRU, and a new node is to be added to the
+LRU, then the final node (which is also the least recently used node), is
+popped from the LRU to make space for the new node. This popped node is then
+written to the flash using the CTZ layer as well.
+
+The LRU helps in clubbing updates to a single FS object and thus helps in
+reducing the wear of the flash.
+
+Journal Flush
+-------------
+
+The latest master node revision is the most useful out of the revisions. As
+in CoW it's prudent to update the FS tree from bottom up, the root is the
+last one to get updated in the case of a journal flush.
+
+The logs are just location updates. So, when a journal flush occurs, it will
+update the locations of all the children in the parent. This updates the
+parent, and then this update goes through the same procedure as any other
+update.
+
+This is why it's best to start the flush operation when the journal is filled
+up over a certain limit, instead of waiting it to be full. Why? Any log of
+a parent makes any log of its children written **before** it useless, as the
+updated location of the parent can be read to get the updated location of the
+child till that point in the logs.
+
+So, it will be best to move up the FS tree from bottom during update and
+update the root last, since the root is the parent of every FS object.
+
+Once the root is updated, all other journal logs become useless, and can be
+erased. The root's log is not written in the first ``n`` blocks of the
+journal, but written as a new master node entry in the master blocks.
+
+Once the new root is written, the first ``n`` blocks can be erased, and
+reallocated (due to the rules of wear levelling). The master blocks however
+have some conditions for reallocation. This is called moving of the journal.
+
+Every time the first ``n`` blocks are cleared, a new master node is added.
+The only time a master block needs to be erased is when it becomes full.
+Thus, if there are ``p`` pages in a block, the master blocks will be
+moved along with the rest of the journal for every ``p`` journal flushes.
+
+Before the new master node is written, none of the old pages should be erased
+to allow rollback to the previous FS tree state. The moment the new master
+node is updated, any block which has all of the pages in it ready for
+deletion will be erased to make space.
+
+FS Object Layer
+---------------
+
+This layer provides an abstraction for iterating, adding, deleting or
+reading direntries.
+
+This works with the LRU and the journal to get the latest data and thus the
+user of this layer does not have to worry about these underlying mnemofs
+components.
+
+VFS Method Layer
+----------------
+
+VFS method layer contains methods exposed to the VFS. This layer works with
+the FS Object layer for direntry related tasks, or the LRU for file level
+read/write tasks.