ᚱᛗ
© 2022
Powered by Hugo

AoC2022 Day7: No Space Left On Device

Table of Contents

The Problem

Advent of Code 2022 – Day 7 . This puzzle is about managing disk space on a device with a Unix-like filesystem . A system update needs to be installed on the device, but as the disk is nearing its capacity, there is not enough space left on it to run the update.

Our task is to find directories that are suitable candidates for deletion to make space for the update. To do this, we must browse the filesystem and determine the total size of each directory—which is the sum of the sizes of all the files it contains, either directly in its own space, or indirectly in all its subdirectories. Directories themselves have no weight, so they don’t contribute to the count.

Part One

For Part One, our goal is to identify directories with a total size of at most 100000, and compute the sum of their sizes. Let’s dive into the problem and explore the solution step by step using the example:

$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
$ cd a
$ ls
dir e
29116 f
2557 g
62596 h.lst
$ cd e
$ ls
584 i
$ cd ..
$ cd ..
$ cd d
$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k

A Unix-like filesystem is a singly-rooted tree of directories. The root is denoted /; each directory in it can contain either files or other directories1. The filesystem in the example is a simplified version of such a system that supports two basic commands (a command starts with $):

  • cd (change directory): this command considers the current working directory as the starting point and moves to a relative directory, depending on the argument. With an argument of x it will move into the x subdirectory; with an argument of .. it will move up to the parent directory; and with an argument of / it will move to the root.
  • ls (list): this command lists all files and subdirectories in the current working directory. In the output, files will show their size and their name; subdirectories will show a dir prefix and no size.

After applying these rules to the stream of commands & outputs in the example, the resulting tree should look like this:

- / (dir)
  - a (dir)
    - e (dir)
      - i (file, size=584)
    - f (file, size=29116)
    - g (file, size=2557)
    - h.lst (file, size=62596)
  - b.txt (file, size=14848514)
  - c.dat (file, size=8504156)
  - d (dir)
    - j (file, size=4060174)
    - d.log (file, size=8033020)
    - d.ext (file, size=5626152)
    - k (file, size=7214296)

This example filesystem has four directories: /, /a, /a/e, and /d:

  • /d: holds files j, d.log, d.ext, and k—total size: 24933642.
  • /a/e: holds file i—total size: 584.
  • /a: holds files f, g, h.lst, and directory /a/e—total size: 94853.
  • /: contains /d and /a—total size: 48381165.

The answer for Part One is the computation of “the sum of the sizes of directories with a total size of at most 100000”. That is to say, the sum of the sizes of directories a and a/e: 94853 + 584 = 95437 (yes, this implies the possibility of counting the same directory more than once, as is the case here).

Generalizing, the two main subproblems we’ll have to solve are: (1) building the tree, and (2) traversing it (to total the size of each directory).

Solution Design

To build the tree, we’ll need to devise an algorithm that parses the input stream into a tree data structure. Trees modeled after filesystems are sparse by definition: while the number of directories (i.e. nodes) on a disk can be vast, the number of direct relationships (i.e. edges) between them is generally limited. Given this characteristic, we’ll be using an adjacency list to efficiently store the tree.

The adjacency list for the example looks like this: {"/": ["/a", "/d"], "/a": ["/a/e"]}. That is to say, node / has an edge to node /a and another one to node /d; and node /a has an edge to node /a/e. In Python terms, the structure is of type dict[str, list[str].

Note that each node’s label must be globally unique. Meaning that a node with a label of e, for instance, is ambiguous, since it doesn’t specify all its parent nodes. To solve this ambiguity, we must use the entire path from root to leaf to label nodes—this guarantees that any and all nodes labeled e have their own space in the adjacency list, e.g. /a/e, or /b/e, or /z/b/d/.../e.

At this point, there is another design choice to be made for the adjacency list. Since the ultimate goal is to traverse the tree to compute the size of each directory, we also need to store the node sizes somewhere. One choice could be inside the same structure, something like this:

adj_list_w_sizes: dict[str, dict[str, Any]] = {
    "/": {"children": ["/a", "/d"], "size": 23352670},
    "/a": {"children": ["/a/e"], "size": 94269},
    "/a/e": {"children": [], "size": 584},
    "/d": {"children": [], "size": 24933642},
}

I think that’s a bit bloated though; so I’ll be keeping track of sizes separately:

adj_list: dict[str, list[str]] = {"/": ["/a", "/d"], "/a": ["/a/e"]}
node_sizes: dict[str, int] = {
    "/": 23352670,
    "/a": 94269,
    "/a/e": 584,
    "/d": 24933642,
}

Finally, to traverse the tree, rather than reeinventing the wheel, we’ll be using DFS —if for no other reasons than its simplicity and recursive elegance. I won’t go into too much detail regarding tree traversal algorithms, given the extensive literature on the topic.

Building the Tree

First let’s review the code and then discuss it.

import os

def cd(pwd: str, to: str) -> str:
    """Change from current work dir to target dir."""
    match to:
        case "..":
            return os.path.dirname(pwd)
        case _:
            return os.path.join(pwd, to)
from collections import defaultdict
from typing import DefaultDict

from src.utils import read_data_lazily

DAY: int = 7

def build_fs_tree(
    example: bool = False,
) -> tuple[dict[str, list[str]], dict[str, int]]:
    """Build filesystem tree from a stream of commands."""
    pwd: str = ""
    dir_size: int = 0
    dir_sizes: dict[str, int] = {}
    last_cmd: str | None = None
    tree: DefaultDict[str, list[str]] = defaultdict(list)
    for line in read_data_lazily(DAY, example):
        if line.startswith("$ cd"):
            cmd, to = line.split()[1:3]
            if last_cmd == "ls":  # End of ls output stream.
                dir_sizes[pwd], dir_size = dir_size, 0
            pwd = cd(pwd, to)
        elif line.startswith("$ ls"):
            cmd = "ls"
            dir_size = 0  # Avoid double counting.
        elif line[0].isdigit():
            dir_size += int(line.split()[0])
        else:
            _, dir_name = line.split()
            tree[pwd].append(cd(pwd, dir_name))
        last_cmd = cmd
    dir_sizes[pwd] = dir_size  # Handle EoF.
    return tree, dir_sizes

Let’s go over build_fs_tree. First we have read_data_lazily, which uses Python generators to read the input data one line at a time. This saves memory and is a great tool to use for unbounded, streaming datasets. Each line in the input stream falls into one of four cases:

  1. A $ cd x command => then the first thing we want to do is to mark the current command as cmd = "cd". Then we check if the last_cmd was $ ls, if so, then it must be that we are done reading its output, and we can add dir_size to dir_sizes, then reset the counter to 0. Then we can safely change directories using the cd subroutine.
  2. An $ ls command => then we mark the current command with cmd = "ls" and set dir_size to 0. Resetting the counter is important because it guarantees that each time ls is issued, we start counting from 0.
  3. The (partial) output of an $ ls command, a file => then the size of the file is added to the dir_size counter.
  4. The (partial) output of an $ ls command, a directory => then we add the subdirectory to the tree adjacency list as a child of the present working directory. Note that I’m using defaultdict(list), so that if the current working directory is not yet tracked, and we access it, the value defaults to an empty list, which we can append to right away.

Before the loop is over we take care of bookkeeping with last_cmd = cmd, so that the current command becomes the last command for the next iteration—this is how we can check whether the output from ls is over. After the loop ends, we don’t have an incoming input line to trigger this check, so we do it manually to offload the size counter, if there is any, via dir_sizes[pwd] = dir_size.

Traversing the Tree

def DFS_recursive(
    graph: dict[str, list[str]],
    sizes: dict[str, int],
    start: str = "/",
    visited: list[str] = [],
) -> dict[str, int]:
    """Traverse graph with DFS. Add up sizes while backtracking to root."""
    visited.append(start)
    # print(start, end=" ")
    for child in graph[start]:
        if child not in visited:
            sizes = DFS_recursive(graph, sizes, child, visited)
            sizes = {**sizes, start: sizes[start] + sizes[child]}
    return sizes

Finally, we can filter by the size requirement, at most 100000.

MAX: int = 100000

dfs_sizes = DFS_recursive(*build_fs_tree())
print(f"P1: {sum(v for (_, v) in dfs_sizes.items() if v <= MAX)}")
# P1: 1443806

Part Two

For Part Two, we’re given two new parameters: the total disk capacity, 70000000; and the free space required to run the update, 30000000. Then we’re asked to come up with the smallest directory, that if deleted, would allow the update to go through.

We can break this problem in two. First, we can determine the full set directories that would allow the update to run if deleted, and then we can choose the smallest of them.

def select_deletable_dirs(
    dir_sizes: dict[str, int],
    disk_cap: int = 70000000,
    free_space_req: int = 30000000,
) -> list[tuple[str, int]]:
    """Return dirs which if deleted, space reqs are met."""
    disk_free = disk_cap - dir_sizes["/"]  # Root contains all.
    min_to_del = free_space_req - disk_free
    return [(k, v) for (k, v) in dir_sizes.items() if v >= min_to_del]

After retrieving the qualifying directories with select_deletable_dirs, it suffices to sort them ascendingly, and then choose the first one in the list.

deletable_dirs = select_deletable_dirs(dfs_sizes)
print(f"P2: {sorted(deletable_dirs, key=lambda x: x[1])[0]}")
# P2: ('/jssbn/tlhttrgs/zjghthcb/lfrctthp', 942298)
**

Notes


  1. This is not technically correct. Unix directories do not contain files themselves but rather their names and a reference to so-called inodes , which in turn contain both the file and its metadata (owner, permissions, time of last access, etc., but no name). Directories are implemented as special files, which contain lists of other files and their inodes. An in-depth discussion can be found here: What are directories, if everything on Linux is a file?  ↩︎