docs/fs/mnemofs: Update new design and NAND flash structure

Add documentation about the new mnemofs design and the NAND flash structure. Signed-off-by: Saurav <resyfer.dev@gmail.com>
2024-08-23 06:47:48 +00:00 · 2024-08-23 06:47:48 +00:00 · 4a4725386d
parent 099ced4c83
commit 4a4725386d
1 changed files with 218 additions and 72 deletions
--- a/Documentation/components/filesystem/mnemofs.rst
+++ b/Documentation/components/filesystem/mnemofs.rst
@ -14,9 +14,7 @@ you can mount it with ``mnemofs`` to a location like ``/mydir`` using::
 The above command will only work if the device was already formatted using
 mnemofs. For a brand new device, or if you want to switch from an existing
-file system, this won't work, and would need a format.
+file system, this won't work, and would need a format::
 Instead try this::
    mount -t mnemofs -o forceformat /dev/nand /mydir
@ -33,86 +31,234 @@ system after all...to hide the storage device's pecularities behind an
 abstraction. A file system is considered good if you don't have to think
 about its existence during regular usage.
 NAND Flashes
 ============
 Programmatically, the NAND flash has some problems. The whole device can be
 condensed into three layers: blocks, pages and cells.
 Cells represent the smallest unit of storage in NAND flashes, but are often
 ignored, as direct access is not allowed. If a cell stores one bit, it's a
 Single Level Cell. There are MLC, TLC, etc. for more bits per cell. Often,
 the more bits per cell, the lesser is the wear resilience. Thus, higher
 bits per cell are easier to wear out and become unreliable.
 Pages are the smallest readable or writable unit of the NAND flash. It's
 made up of several cells, and can be expected to have a size of the similar
 order of 512 B.
 Blocks are the smallest erasable unit of NAND flash. They are made up of
 several pages. If a page is already written, it needs to be erased before it
 can be written again. And since blocks are the smallest erasable unit, the
 entire block needs to be erased if the user wants to update the contents of
 one page.
 The erase operation is what causes a block to wear out. If a block is worn
 out too much, it will lose its ability to reliably store data. An unreliable
 block can not guarantee that the data read from the pages in it is the same
 as what was written to it. This state is called as a bad block.
 A manufacturer can also deem a block to be unreliable from their testing,
 and can mark them as bad blocks right from manufacture.
 A good file system will aim to level out the wear between blocks as much as
 it can.
 Design
 ======
-mnemofs is designed to be a middle ground between flash storage consumption,
+There are various layers and components in mnemofs, and they interact with
-memory consumption, wear and speed. It sacrifices a little bit of everything,
+various layers on abstraction over each other.
 and ends up being acceptably good in all of them, instead of sacrificing
 multiple aspects, and being good in one.
-mnemofs consists of several components, however, a walkthrough of the process
+Mnemofs works on a Copy-On-Write (CoW) basis, which means, if a page needs to
-where a change requested by a user ends up being written to the NAND flash
+be updated, it is copied over in memory, and then the change is applied, and
-would serve well for an introduction. The details will be explained further
+the new data is written to a new location.
 below.
-The user requests some changes, say, add ``x`` bytes to ``y`` offset in a file.
+R/W Layer
-This change is copied into the LRU cache of mnemofs. This LRU cache exists
+---------
 in-memory, and serves as a tool for wear reduction.
-This LRU cache is a kernel list of nodes. Each node represents a file or a
+This works with the NAND flash device driver directly. It can write an
-directory. When the LRU is full, the last node is popped from this list and
+entire page, read an entire page, erase an entire block, check if a block is
-the changes it contains, which is an accumulation of changes requested by
+bad (from it's bad block marker), or set a block as bad. It's the simplest
-the user for that particular file or directory since the node was added to
+layer.
 the LRU cache, is written to the flash.
-Each file or directory is represented by a `CTZ skip list <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_,
+Block Allocator
-and the only attributes required to access the list is the index of the last
+---------------
 CTZ skip list block, the page number of that CTZ skip list block, and the
 size of the file. In mnemofs, CTZ skip list blocks take up exactly one page
 on the flash.
-Mnemofs works in a Copy-On-Write manner, similar to littlefs. When a CTZ
+The block allocator contains two arrays. One is a bit mask is for tracking
-skip list is updated, the new location is added to the Journal of mnemofs
+the free pages, while the other is an array of numbers, one number for each
-as a log. This log contains some information about the location of the new
+block, denoting the number of pages in that block that are ready to be
-CTZ list, the path it belongs to, etc. and then the updated location is
+erased.
 added as an update to its parent's CTZ skip list, and it undergoes the same
 process. This log is appended with a checksum of the entire log, which
 gives an assurance that the saved log was indeed saved completely before a
 power loos.
-The journal is a modified singly linked list of blocks on the flash that
+The block allocator allocates pages or blocks in a sequential manner to keep
-contains logs of changes in the file system. The last two blocks of the
+it fair for all pages, thus, ensuring wear levelling. It also starts from a
-journal is reserved for master blocks, hence the number of blocks in the
+random offset to prevent bias to the front of the device in case of multiple
-journal will be referred to as ``n + 2`` blocks.
+power losses and reinitialization in such casses. If a block is required it
 skips pages to the start of the next block. Since block allocations happen
 only in the journal, they happen in bulk and the number of skipped pages is
 very minimal.
-The area on storage other than the journal contains a certain "base" state of
+Once reaching the end of the device, it cycles back to the front. Thus any
-the file system. All changes to the base state since is written to the
+skipped pages get the chance to be allocated in the next cycle.
 journal. The first block of the journal starts with an 8 byte magic sequence
 to identify the start of the journal (on mount), followed by the number of
 blocks in the journal and then finally an array of all the ``n + 2`` block
 numbers that are part of the journal. After this part, the entire area in the
 ``n`` blocks contain logs and their checksums.
-The last two blocks of a journal are called the master blocks, and they store
+CTZ Layer
-multiple instances of the master node. They are duplicates of each other, and
+---------
 each instance of the master node takes one page each, and are written to
 these master blocks in a sequential manner. The master node points to the
 root.
-When the first ``n`` blocks of the journal are full, then they are flushed
+This works with the R/W Layer, and acts as an abstraction layer for other
-and since the root updates here as well, a new master node is written. Once
+components in mnemofs. Mnemofs uses
-the new master node is written, the file system's base state is updated and
+`CTZ lists <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_
-thus the old obsolete pages can be erased (if possible). The first ``n``
+to represent both files and directories in flash. CTZ lists of files contain
-blocks of the journal move more than the master nodes.
+only the data of the file, while CTZ lists of directories contain directory
 entries (direntries) for each FS Object (file or directory) grouped into it.
-The block allocator of mnemofs is havily inspired from littlefs. It starts
+This layer abstracts away the complex division of flash space that's present
-from a random block, and starts allocating pages or blocks sequentially in a
+in CTZ skip lists, and allows users of this layer to not worry about the
-circular manner. It skips pages upon block requirement, but since block
+complexities of a CTZ skip list, and infact, to feel that the data is like a
-requirements are only required by internal structures, they are always
+contiguous space.
 requested in bulk, and minimize wastage. However, unlike in littlefs, mnemofs
 keeps a bitmap in memory about the pages that are currently being used, as
 well as the count of pages inside each block that want to be erased.
-In mnemofs, the bind might take a lot of time in the worst possible
+This layer allows the user to specify a data offset, which refers to the
-theoretical case, as it's an ``O(n)`` mounting process, however, it's not the
+offset into the actual data stored in the CTZ skip list (ie. excluding the
-case in real life. Mnemofs only needs to scan the first page of every block
+pointers), and the number of bytes, and perform operations on the CTZ list
-in the device to look for the start of the journal. Leaving the actual
+almost like if it were a single array.
-location of the page aside, this will be pretty fast in real life as the
+
-larger the storage capacity is, the larger are the pages and the larger are
+In mnemofs, each CTZ block takes up the space of exactly 1 page, and each
-the number of pages per block, and thus the number of blocks in the device
+pointer takes up 4 bytes.
-do not increase at a rate similar to the increase in storage capacity of the
+
-device. Further, the journal has the journal array, which contains block
+Littlefs design document shows how a CTZ list can be identified using the
-numbers of each block in it, very close to the start of the array, and
+index of the last CTZ block, and the page number of that CTZ block.
-mnemofs can quickly jump from there to the latest master node, and scan
+
-the file system for used pages.
+Journal
 -------
 The journal in mnemofs is made out of ``n + 2`` blocks. The last two block
 concern the master node. These blocks are arranged in a singly linked list.
 Due to CoW policy, when a CTZ list is updated, it now has a new location. The
 first ``n`` blocks of the journal is responsible for storing logs containing
 information about this very update. It will contain the old location of the
 CTZ skip list, and the new location.
 Thus, when the user requires an updated location of a CTZ list, they will
 first find the old location by traversing the FS tree in the flash, and then
 will traverse the journal to find the latest location.
 So, the FS tree on the flash acts like a "base" state with updates stored in
 the journal. Each log in journal is followed by a checksum to verify if all
 of it was written properly. This helps in making it power loss resilient.
 The journal, when combined with CoW, plays another important role. In pure
 CoW, any update to a CTZ file will result in it having a new location. This
 new location wil need to be updated in the parent, which itself will have a
 new location after the update, and so on till it reaches the root. The
 journal stops this propagation immediately. When the journal is full above
 a certain limit, it will flush, and apply all of these changes to the FS
 tree in one go. This helps in wear reduction.
 The journal mainly works with the CTZ layer, and any updates to a CTZ list
 using this layer automatically adds a log for it in the journal.
 The journal starts with a magic sequence, then the number of blocks in the
 journal (excluding master blocks), and then follows an array with the block
 numbers of the blocks in the journal (including the master blocks). Following
 this, logs are stored in the blocks.
 Master Node and Root
 --------------------
 The root of the file system is treated like any directory as far as its
 storage on the flash is concerned. This is because the master node acts as a
 parent to the root, and contains information of the root in a way identical
 to direntries.
 The master node is stored in the master blocks. There are two master blocks,
 and both are duplicated of each other for backup. Each master block is a
 block, and thus have multiple pages in them. Each page contains one revision
 of the master node. The master nodes are written sequentially, and have a
 timestamp on them as well.
 When a CTZ list is moved to a new location, the obsolete pages of the old
 CTZ list are marked for deletion.
 LRU Cache
 ---------
 Mnemofs has a Least Recently Used (LRU) Cache component. The main use of this
 component is to reduce wear on the flash at the expense of memory.
 The LRU is a kernel list of nodes. Each node represents an FS object. Each
 node also contains a kernel list of deltas. Each delta contains information
 about an update or deletion from the user (which is what all of the VFS write
 operations can be condensed to).
 There's a pre-configured limit for both deltas per node and nodes in the LRU.
 If the delta list in a node is full, and another is to be added, all the
 existing deltas in the list are clubbed together and written to the flash
 using the CTZ layer. The layer also automatically adds a log for this update.
 When a node receives a delta, it is bumped from its current location in the
 LRU to be at the front. This way, the last node in the LRU is always the
 least used node.
 If the node limit is reached in the LRU, and a new node is to be added to the
 LRU, then the final node (which is also the least recently used node), is
 popped from the LRU to make space for the new node. This popped node is then
 written to the flash using the CTZ layer as well.
 The LRU helps in clubbing updates to a single FS object and thus helps in
 reducing the wear of the flash.
 Journal Flush
 -------------
 The latest master node revision is the most useful out of the revisions. As
 in CoW it's prudent to update the FS tree from bottom up, the root is the
 last one to get updated in the case of a journal flush.
 The logs are just location updates. So, when a journal flush occurs, it will
 update the locations of all the children in the parent. This updates the
 parent, and then this update goes through the same procedure as any other
 update.
 This is why it's best to start the flush operation when the journal is filled
 up over a certain limit, instead of waiting it to be full. Why? Any log of
 a parent makes any log of its children written **before** it useless, as the
 updated location of the parent can be read to get the updated location of the
 child till that point in the logs.
 So, it will be best to move up the FS tree from bottom during update and
 update the root last, since the root is the parent of every FS object.
 Once the root is updated, all other journal logs become useless, and can be
 erased. The root's log is not written in the first ``n`` blocks of the
 journal, but written as a new master node entry in the master blocks.
 Once the new root is written, the first ``n`` blocks can be erased, and
 reallocated (due to the rules of wear levelling). The master blocks however
 have some conditions for reallocation. This is called moving of the journal.
 Every time the first ``n`` blocks are cleared, a new master node is added.
 The only time a master block needs to be erased is when it becomes full.
 Thus, if there are ``p`` pages in a block, the master blocks will be
 moved along with the rest of the journal for every ``p`` journal flushes.
 Before the new master node is written, none of the old pages should be erased
 to allow rollback to the previous FS tree state. The moment the new master
 node is updated, any block which has all of the pages in it ready for
 deletion will be erased to make space.
 FS Object Layer
 ---------------
 This layer provides an abstraction for iterating, adding, deleting or
 reading direntries.
 This works with the LRU and the journal to get the latest data and thus the
 user of this layer does not have to worry about these underlying mnemofs
 components.
 VFS Method Layer
 ----------------
 VFS method layer contains methods exposed to the VFS. This layer works with
 the FS Object layer for direntry related tasks, or the LRU for file level
 read/write tasks.