body
is a small bash script that replaces a common two-shot command that I find myself running from time to time: head -n5000 file.txt | tail -n1
. Basically, I want to view approximately the middle of a file, to see the middle contents. I realized that there must be a better way, and that’s why I made a small script that does what I want: it’s not head
, it’s not tail
, it’s body
: for when you want to print (around) the middle of a file.
The source code is available at https://github.com/megamansec/body.
I ended up fitting the script up with a bit more functionality which is more in-line with what I would usually use head | tail
for. Namely, the following flags are available:
Context control:
-A NUM print NUM lines of trailing context
-B NUM print NUM lines of leading context
-C NUM print NUM lines of output context
--color=WHEN use markers to highlight the matching strings;
WHEN is 'always', 'never', or 'auto'
Output control:
-n do not print line number with output lines
-N do not print file name with output lines
These flags generally follow their similar usage in GNU grep. -A
prints some lines after the middle line, -B
before the middle line, and -C
in both directions. --color
is supported similar to grep. If multiple files are printed using body
, their filenames are shown in the results; something that -N
can disable. All results include line numbers, unless -n
is specified, in which case no line numbers are printed.
How does it actually work? There’s a million ways to get the middle contents of a file (middle, measured by newlines, not bytes), and I wasn’t sure which would be the fastest with a simple bash script. So, I benchmarked. On a Linux server, I ran the following:
#!/bin/bash
file="/var/log/auth.log.1"
benchmark_command() {
local start_time=$(date +%s.%N)
for i in {1..20}; do
eval "$1" >/dev/null
echo 3 > /proc/sys/vm/drop_caches
done
local end_time=$(date +%s.%N)
local elapsed_time=$(echo "$end_time - $start_time" | bc -l)
local average_time=$(echo "$elapsed_time / 10" | bc -l)
printf "%.6f " "$average_time"
}
echo 3 > /proc/sys/vm/drop_caches
benchmark_command "lines=\$(wc -l < $file); lines=\$((lines/2)); head -n\$lines $file | tail -n1"
benchmark_command "lines=\$(wc -l < $file); lines=\$((lines/2)); cat -n $file | head -n\$lines | tail -n1"
benchmark_command "lines=\$(wc -l < $file); lines=\$((lines/2)); sed \$lines,\$lines'!d;=' $file | sed 'N;s/\\n/ /'"
benchmark_command "lines=\$(wc -l < $file); lines=\$((lines/2)); awk 'NR==$lines{print NR\" \"\$0}' $file"
benchmark_command "lines=\$(wc -l < $file); lines=\$((lines/2)); cat -n $file | sed -n \$lines,\${lines}p"
echo
20 times, and averaged the results:
for i in {1..20}; do bash /tmp/test; done | awk '{
for (i = 1; i <= NF; i++) {
sum[i] += $i
count[i]++
}
}
END {
for (i = 1; i <= NF; i++) {
printf "avg %d: %.2f\n", i, sum[i] / count[i]
}
}'
avg 1: 0.12
avg 2: 0.14
avg 3: 0.13
avg 4: 0.17
avg 5: 0.17
Perhaps unsurprisingly, using head | tail
was the fastest. However, I wanted to print the line numbers, so that wouldn’t work. Piping the output of cat -n
(which prints the line numbers prior to each line of a file) into head | tail
was also quick, but the quickest was using sed
directly on the file. So in its most simple form, body
runs this:
sed "$start_line,$end_line"'!d;=' "$filename" | sed "N;s/\n/:/"
In reality, the speed of all of these commands don’t matter at all; they’re going to be fast either way, but it’s interesting to see which is the fastest. It also likely depends on the file size.
I imagine this could be an excellent optimization challenge in a competition where people compete for the lowest processing time for this type of functionality (like The Billion Row Challenge), and I’m sure there’s some cool bit-twiddling that could really shine. But in the meantime, I’m happy with my bash script.