basically what I learned while making catsh
Till the point I got to implementing multi-cmd pipes I had a much uglier implementation, but just for the pipes feature I refactored the whole thing and wrote it cleanly in a model that makes more sense for pipes — I think such rewrites are very much a part of making any software, there’s never a good v1.
posix part
int fd[2];
pipe(fd);- populates
fdwith file descriptors for the read and write ends. - note that this is “UNIDIRECTIONAL” and meant to be used as such (even though you can kind of do fancy stuff with forked processes both having access to the same pipe)
A file description is the actual kernel resource of the file/pipe/whatever.
A file descriptor is a handle to refer to a file description.
- non-negative integer
- each process has own set, stored in
/proc/<pid>/fd/aka the “file descriptor table” - these ints can be reused (like 1 is stdout for a lot of processes)
- same process can have multiple fds pointing to the same file description
a two-command pipe
It took me a while to build an intuition for the dup2 sys call.
You need to get rid of analogies and the “wordies” like applies or copies src to dest and think instead in terms of “file descriptors” and “file descriptions”.
The man page has it much better written.
dup2(oldfd, newfd) makes newfd point to the same file description as oldfd.
this above is the better and technically correct model for it.
So now for a two-command pipe
stdin -> 0
stdout -> 1
pipe[0] = read end
pipe[1] = write end
Note that these dup calls are inside the forked children
for cmd1
make the stdout of cmd1 forked the same description as pipe[1]
-> dup2(pipe[1], stdout)
make the stdin of cmd2 forked the same description as pipe[0]
-> dup2(pipe[0], stdin)
then close the pipe fds you no longer need in each process
so if you think of this as a graph
it makes a directed edge from the 2nd arg to the first arg
cmd1
stdin (same as parent)
stdout -> pipe[1]
cmd2
stdin -> pipe[0]
stdout (same as parent)extension to multi-command
for k cmds there are k-1 pipes and you join ends
in in the fork must use same description as
cmd 0:
dup2(pipe[0][1], STDOUT_FILENO)
cmd i, where 0 < i < k - 1:
dup2(pipe[i - 1][0], STDIN_FILENO)
dup2(pipe[i][1], STDOUT_FILENO)
cmd k - 1:
dup2(pipe[k - 2][0], STDIN_FILENO)Idea being the middle k-2 cmds need to have both their in and outs remapped
but the ends just need one remap so you do dup2 calls
code flow
implementing this in an abstracted manner was a challenge
I wanted a high-level wrapper to which I can just give the descriptors I want
the stdin, out and err to refer to and it does the forks, dups, closes and exec with
C++ style vector<string> args.
This is where the fork_and_exec and exec (my exec) come in.
The POSIX style of forking and modifying the child is then followed automatically.
There are basically 3 different executions in this
- builtin (same shell)
- builtin (subshell)
- usual fork and exec of a binary
Each has a slightly different flow.
builtin (same shell)
just calling the builtin function is trivial, what’s of note is the fd remapping for supporting redirections and then reverting them back.
To handle this cleanly there is RAII-based ScopedFDRedirect that takes in the 3
stdfds and internally maps and unmaps them back.
so this just becomes
fds = what stdin,out and err are remapped to
{
sutils::ScopedFDRedirect _{fds};
call the builtin
}builtin (subshell)
this does the same thing as above except in the forked shell process
usual fork and exec of a binary
This uses the fork_and_exec and that forks and handles the fd redirections internally
RAII for file-based FDs (for redirects)
Before control goes to execution via any of the 3 ways above, for redirects I need to open a file and get the descriptor of it (to remap stdin/out/err as needed).
FileFd is just a scoped RAII-based class for it, ctor takes in the path and automatically closes
on going out of scope.
background jobs
just some POSIX bits.
bool done = waitpid(j.pid, &status, WNOHANG) > 0 && WIFEXITED(status);
read the man too but WNOHANG will not wait and if retval 0 that means process
is running still, the second part WIFEXITED are POSIX macros to query the
status.
what’s broken (zombies)
a bit on putting pipelines to background → I treat it as an equivalent as waiting on the pid of the last cmd in pipeline.
when running in foreground, I do wait on all pids though.
now this feels like it’s all working fine but the bg handling is wrong.
if you don’t call waitpid then even if the process exited cleanly, unless
the parent “reaps” (aka calls waitpid) an entry remains in the process
table ⇒ such a state is called a “zombie process”. It ofc does
not take as much resource as a usual running process but is a leak
nonetheless.
so currently the shell’s pipeline when run in bg WILL create zombies
I haven’t fixed because the ideal way to fix is via process groups instead of an ugly vector of pids being maintained but someday I guess.
cpp bits
just making a note of overall stuff used
- RAII usage clearly simplified a lot of the code
- very heavy use of ranges and views too
- std::filesystem and related funcs
- std::optional (I don’t use it enough)
- when needing C-based pointers for arrays and strings, put them in a
std vector/array and string and then use
.data. Ofc, then you need to ensure they don’t go out of scope while their pointers are still being used. - try to use spans and string_views all the way once the initial data is stored
in some top-level function. They are non-owning, incredibly cheap to do
subranges on and to pass around
- but mutating will mutate original (you can pass as const)
- is better than passing references / const refs around for subrange semantics
- you need to ensure the backing original data remains in scope
- can just build actual objects as and when needed in local scope
- personal preference to be honest and I don’t see the downsides (yet)
std::less<>is not the same asstd::less<infer type>. It’s actuallystd::less<void>. What this allows is “transparent comparator” so now for a map of strings,string_viewswork.- to avoid abusing OOP as just a way to “organise” loosely related vars and functions, it’s better to use namespaces to group them and leave classes to where it actually makes sense to use them.
nice ranges and views snippets
auto lcs = matches.front();
for (const auto &match : matches | std::views::drop(1)) {
auto [common_end, _] = std::ranges::mismatch(lcs, match);
lcs.erase(common_end, lcs.end());
}auto matches = candidates | std::views::filter([&](const auto &candidate) {
return candidate.starts_with(prefix);
}) |
std::ranges::to<std::vector>();POSIX bits
readjust puts n bytes and does not place any null terminator. cpp will not complain if you dostring s += char bufferbut that’s plain wrong as it will read memory well beyond.closeon fds is essential. a read end will be blocked until the write end is closed.- by default the terminal is what is called as “cooked mode” which means as I
type the chars won’t hit the program at all, the kernel will buffer them for me and only send them to the program once I press enter. This is where the
whole
WithRawModecomes in. Note that these settings are applied on the device itself and not the process or fd but the fd is just used to find the device since that’s all that the process has, an fd.