Skip to main content

Self Extracting Tar Files


It has long been normal to embed some blob of data into the end of a shell script that is implemented as a self-extracting tar file. I've used varying methods of this over the years, but I recently started seeing Segmentation Faults when creating files over 2G.

The process I was using when hitting the segfaults was kind of neat because it kept the shell script 100% text by base64 encoding the embedded data in the script:


# Create test data to tar up
truncate -s 3G bigfile

# ## Create the header of a self-extracting script.
# This header will use a _Here document_ (multi-line
# input) to inject the base64 encoded data.
cat > <<SH_EOF
base64 -d <<TAR_EOF | tar -xf -

# Tar the data, base64, and append to the header above.
tar -cf - bigfile | base64 -w 72 >>

# Add the end marker of the Here document.
cat >> <<SH_EOF


# Make our self-extracting shell script executable.
chmod +x

Ok, great! It's clean because its all printable text and we know the TAR_EOF marker can not show up in the data because it has a _. The problem is that if you use data that goes over 2GiB (assuming this is literally 2^31 bytes), the shell script will Segmentation Fault!

In troubleshooting this, I've ruled out base64 and tar as the culprits. While I don't have the evidence in code of this, I suspect that the Bash Here document from a script can only handle data up to 2GiB. (... more to investigate later.)


Another technique for self-extracting shell scripts includes using a marker in the data for sed to use as a EOF marker.


# Create test data to tar up
truncate -s 3G bigfile

# ## Create the header of a self-extracting script.
# Here we use a end of file marker (#EOF#)
cat > <<SH_EOF
sed '0,/^#EOF#$/d' \$0 | tar zx; exit 0

# Tar the data, base64, and append to the header above.
tar -c bigfile >>

# Make our self-extracting shell script executable.
chmod +x

This is better than the base64 case because it doesn't Segfault. But there is still something that bothers me about this solution. If I am embedding gigabytes of data, having the combination of bytes #EOF# is more likely to be in the file. Is there a way to eliminate this edge case?



# Create test data to tar up
truncate -s 3G bigfile

# ## Create the header of a self-extracting script.
# Here we use a end of file marker (#EOF#)
cat > <<SH_EOF
sed '1,3d' \$0 | tar x ; exit 0
# Verbatim tar data following this 3rd line.

# Tar the data, base64, and append to the header above.
tar -c bigfile >>

# Make our self-extracting shell script executable.
chmod +x

This is probably a good sweet spot. It uses sed to stream out the embedded data to tar. But instead of using a marker that could potentially show up in other files, we're explicitly telling sed to remove the top 3 lines of the script and assume everything else is embedded data. This is by far the cleanest way to handle this in a repeatable manner.
