Tuesday, September 28, 2010

A Shell Script to Find and Remove the BOM Marker

Edited:
As pointed out by Omri, the script is failing on OS X apparently because of an idiosyncrasy in Apple's sed implementation. I temporarily fixed the script switching from sed to perl on OS X: perl is also shipped by default on OS X so there shouldn't be any problem. However, on OS X this version of the script scans by default the entire file, and not only the first line as it does with other sed implementations.

Introduction

Have you ever seen this characters while dumping the contents of some of your text files, haven't you?



If you have, you found a BOM marker! The BOM marker is a Unicode character with code point U+FEFF that specifies the endianness of an Unicode text stream.

Since Unicode characters can be encoded as a multibyte sequence with a specific endianness, and since different architectures may adopt distinct endianness types, it's fundamental to signal the receiver about the endianness of the data stream being sent. Dealing with the BOM, then, it's part of the game.

If you want to know more about when to use the BOM you can start by reading this official Unicode FAQ.

This post has been modified to solve some problems and improve the script according to your comments:

  • Solved some mktemp inconsistencies across UNIX flavours (such as Solaris and Mac OS X).
  • Files can now be filtered by extension using the -e option, as suggested by Goldan.
  • BOM can be removed throughout the file using the -a option, as suggested by Goldan.
  • An arbitrary number of files can be safely passed as a parameter.
  • The script behaves correctly even with filenames with whitespaces in it.
Safe Harbour Statements: I try to test the script in the greatest number of systems but I'm not guaranteeing that it is working correctly on your. I'll be glad if you give me your feedback: any suggestion or bug report will be appreciated.

UTF-8

UTF-8 is one of the most widely used Unicode characters encoding on software and protocols that have to deal with textual data stream. UTF-8 represents each Unicode character with a sequence of 1 to 4 octects. Each octect contains control bits that are used to identify the beginning and the length of an octect sequence. The Unicode code point is simply the concatenation of the non control bits in the sequence. One of the advantages of UTF-8 is that it retains backwards compatibility with ASCII in the ASCII [0-127] range since such characters are represented with the same octect in both encodings.

If you feel curious about how the UTF-8 encoding works, I've written an introductory post about it.

Common Problems
Because of its design, the UTF-8 encoding is not endianness-sensible and using the BOM with this encoding is discouraged by the Unicode standard. Unfortunately some common utilities, notably Microsoft Notepad, keep on adding a BOM in your UTF-8 files thus breaking those application that aren't prepared to deal with it.

Some programs could, for example, display the following characters at the beginning of your file:



A more serious problem is that a BOM will break a UNIX shell script interfering with the shebang (#!).

A Shell Scripts to Check for BOMs and Remove Them

The Byte Order Mark (BOM) is a Unicode character with code point U+FEFF. Its UTF-8 representation is the following sequence of 3 octects:

1110 1111 1011 1011 1011 1111
E    F    B    B    B    F

The quickest way I know of to process a text file and perform this operation is sed. The following syntax will instruct sed to remove the BOM from the first line of its input file:

sed '1 s/\xEF\xBB\xBF//' < input > output

A Warning for Solaris Users
I haven't found a way (yet) to correctly use a sed implementation bundled with Solaris 10 to perform this operation, neither using /usr/bin/sed nor /usr/xpg4/bin/sed. If you're a Solaris user, please consider installing GNU sed to use the following script.

The quickest way to install sed and a lot of fancy Solaris packages is using Blastwave or OpenCSW. I've also written a post about loopback-mounting Blastwave/OpenCSW installation directory in Solaris Zones to simplify Blastwave/OpenCSW software administration.

A Suggestion for Windows Users
If you want to execute this script in a Windows environment, you can install CygWin. The base install with bash and the core utilities will be sufficient for this script to work on your CygWin environment.

Source
This is the source code of a skeleton implementation of a bash shell script that will remove the BOM from its input files. The script support recursive scanning of directories to "clean" an entire file system tree and a flag (-x) to avoid descending in a filesystem mounted elsewhere. The script uses temporary files while doing the conversion and the original file will be overwritten only if the -d option is not specified.



#!/bin/bash


set -o nounset
set -o errexit


DELETE_ORIG=true
DELETE_FLAG=""
RECURSIVE=false
PROCESSALLFILE=false
PROCESSING_FILES=false
PROCESSALLFILE_FLAG=""
SED_EXEC=sed
USE_EXT=false
FILE_EXT=""
TMP_CMD="mktemp"
TMP_OPTS="--tmpdir="
XDEV=""
ISDARWIN=false


if [ $(uname) == "SunOS" ] ; then
  if [ -x /usr/gnu/bin/sed ] ; then
    echo "Using GNU sed..."
    SED_EXEC=/usr/gnu/bin/sed
  fi 
  TMP_OPTS="-p "
fi


if [ $(uname) == "Darwin" ] ; then
  TMP_OPTS="-t tmp"

  SED_EXEC="perl -pe"
  echo "Using perl..."
  ISDARWIN=true

fi


function usage() {
  echo "bom-remove [-adrx] [-s sed-name] [-e ext] files..."
  echo ""
  echo "  -a    Remove the BOM throughout the entire file."
  echo "  -e    Look only for files with the chosen extensions."
  echo "  -d    Do not overwrite original files and do not remove temp files."
  echo "  -r    Scan subdirectories."
  echo "  -s    Specify an alternate sed implementation."
  echo "  -x    Don't descend directories in other filesystems."
}


function checkExecutable() {
  if ( ! which "$1" > /dev/null 2>&1 ); then
    echo "Cannot find executable:" $1
    exit 4
  fi
}


function parseArgs() {
  while getopts "adfrs:e:x" flag
  do
    case $flag in
      a) PROCESSALLFILE=true ; PROCESSALLFILE_FLAG="-a" ;;
      r) RECURSIVE=true ;;
      f) PROCESSING_FILES=true ;;
      s) SED_EXEC=$OPTARG ;;
      e) USE_EXT=true ; FILE_EXT=$OPTARG ;;
      d) DELETE_ORIG=false ; DELETE_FLAG="-d" ;;
      x) XDEV="-xdev" ;;
      *) echo "Unknown parameter." ; usage ; exit 2 ;; 
    esac
  done


  shift $(($OPTIND - 1))



  if [ $# == 0 ] ; then
    usage;
    exit 2;
  fi



  # fixing darwin
  if [[ $ISDARWIN == true && $PROCESSALLFILE == false ]] ; then
    PROCESSALLFILE=true
    echo "Process all file is implicitly set on Darwin."
  fi

  FILES=("$@")


  if [ ! -n "$FILES" ]; then
    echo "No files specified. Exiting."
  fi


  if [ $RECURSIVE == true ]  && [ $PROCESSING_FILES == true ] ; then
    echo "Cannot use -r and -f at the same time."
    usage
    exit 1
  fi


  checkExecutable $SED_EXEC
  checkExecutable $TMP_CMD
}


function processFile() {
  if [ $(uname) == "Darwin" ] ; then
    TEMPFILENAME=$($TMP_CMD $TMP_OPTS)
  else
    TEMPFILENAME=$($TMP_CMD $TMP_OPTS"$(dirname "$1")")
  fi
  echo "Processing $1 using temp file $TEMPFILENAME"


  if [ $PROCESSALLFILE == false ] ; then 
    cat "$1" | $SED_EXEC '1 s/\xEF\xBB\xBF//' > "$TEMPFILENAME"
  else
    cat "$1" | $SED_EXEC 's/\xEF\xBB\xBF//g' > "$TEMPFILENAME"
  fi


  if [ $DELETE_ORIG == true ] ; then
    if [ ! -w "$1" ] ; then
      echo "$1 is not writable. Leaving tempfile."
    else
      echo "Removing temp file..."
      mv "$TEMPFILENAME" "$1"
    fi
  fi
}


function doJob() {
  # Check if the script has been called from the outside.
  if [ $PROCESSING_FILES == true ] ; then
    for i in $(seq 1 ${#FILES[@]})
    do
      echo ${FILES[$i-1]}
      processFile "${FILES[$i-1]}"
    done


  else
    # processing every file
for i in $(seq 1 ${#FILES[@]})
do
CURRFILE=${FILES[$i-1]}
      # checking if file or directory exist
      if [ ! -e "$CURRFILE" ] ; then echo "File not found: $CURRFILE. Skipping..." ; continue ; fi
      
      # if a paremeter is a directory, process it recursively if RECURSIVE is set
      if [ -d "$CURRFILE" ] ; then
        if [ $RECURSIVE == true ] ; then
          if [ $USE_EXT == true ] ; then
            find "$CURRFILE" $XDEV -type f -name "*.$FILE_EXT" -exec "$0" $DELETE_FLAG $PROCESSALLFILE_FLAG -f "{}" \;
          else
            find "$CURRFILE" $XDEV -type f -exec "$0" $DELETE_FLAG $PROCESSALLFILE_FLAG -f "{}" \;
          fi
        else
          echo "$CURRFILE is a directory. Skipping..."
        fi
      else
        processFile "$CURRFILE"
      fi
    done
  fi
}


parseArgs "$@"
doJob




Examples
Assuming the script is in your $PATH and it's called bom-remove, you can "clean" a bunch of files invoking it this way:

$ bom-remove file-to-clean ...

If you want to clean the files in an entire directory, you can use the following syntax:

$ bom-remove -r dir-to-clean ...

If your sed installation is not in your $PATH or you have to use an alternate version, you can invoke the script with the following syntax:

$ bom-remove -s path/to/sed file-to-clean ...

If you want to clean a directory in which other file systems might be mounted, you can use the -x option so that the script does not descend them:

$ bom-remove -xr dir-to-clean ...

Next Steps

The most effective way to fight the BOM is avoiding spreading it. Microsoft Notepad, if there's anybody out there using it, isn't the best tool to edit your UTF-8 files so, please, avoid it.

However, should your file system be affected by the BOM-desease, I hope this script will be a good starting point to build a BOM-cleaning solution for your site.

Enjoy!






18 comments:

driver said...

Unfortunately I have to inform you that the script is not working! I have just tried it.

Grey said...

Thank you driver for your report.

Can you give me more detail about the problems you're experiencing and the platform you're running it on? I could only test the script in a limited set of systems.

Christian said...

output/error is


mktemp: invalid option -- -
Usage: mktemp [-V] | [-dqtu] [-p prefix] [template]

Grey said...

Thank you, Christian, it seems like an old mktemp version.

Which OS are you testing the script on?

Bowser said...

Great script... it worked just fine except that the temp file writes to 644 permissions. Is there any way you could modify it to hold the permissions on the file so that they are preserved? I would definitely use it all the time if so...

Grey said...

Thank you very much to point it out, Bowser.

I'll have a look at it.

Goldan said...

Thanks for the great post.

To remove BOM character from all lines of the file, use this command:
sed 's/\xEF\xBB\xBF//' < input > output

Grey said...

Thanks for your comment, Goldan.

I just wanted to point out that I'm intentionally removing the BOM marker from the first line of the file since it should appear at the beginning of a stream.

Bye,
-- Grey

Goldan said...

It should, but this is not always the case, unfortunately. I've got the BOM marker in the beginning of every \footnote{} in LaTeX file after opening and saving it with TeXMaker on Windows. As it is noted on unicode.org, the marker's usage in the middle of a file is deprecated: http://unicode.org/faq/utf_bom.html#bom6

Please correct my last comment. The command should be:
sed 's/\xEF\xBB\xBF//g' < input > output
(note the 'g' modifier)

Thanks again for the bash script, just tried it for recursively correct a directory and it worked perfectly. By the way, can I use it to recursively correct files with specified extension (e.g. .tex) in a directory?

And another suggestion: since your script is so well written, you could easily add an option to remove BOM marker not only from the beginning, but from the whole document. To do that, I just needed to replace sed command on line 76 with the one I mentioned above.

Grey said...

Thank you very much for the useful information!

I didn't know about that specific BOM usage and it will surely be useful to other people running into the same problem.

The quickest way to filter files by extensions is modifying the line where find execs the scripts adding this criterion:

find […] -type f -name "*.tex" -exec […]

Since your suggestions are useful, I'll modify the script to add the two options you're recommending and update the post as soon as it's done.

Bye,
-- Grey

Grey said...

Goldan,

I implemented your suggestions and also corrected some problems related with the management of file names containing white spaces.

Bye,
-- Grey

Omri said...

Hi Grey, thanks for posting this!

I wasn't able to get this to work on MacOSX Lion. I know UNIX but am pretty new to Apple's environment so bear with me :-) When running the script, everything appeared to run fine but the output file still had the BOM in it (I double-checked by running with the -x option and examining the temp file).

When trying to debug this and playing around with sed on MacOSX, it appears that it ignores the BOM at the beginning of the file. It's possible that this is an idiosyncrasy of the Apple implementation (or maybe something that's new in Lion, since it appears from the article that you tested on previous versions of OSX?)

I'm using the native /usr/bin/sed. You mentioned that you couldn't get this to work on the native solaris sed - so maybe this is a similar issue. You also mentioned substituting GNU sed - do you know by any chance whether there's a simple way to download standalone GNU utils for OSX as opposed to installing a big package?

Thanks!

Omri.

Grey said...

Hi Omri.

Although most of the time on OS X, I don't really use it for shell scripting, I'm still a Solaris guy for that. And yes: I just tried and the script doesn't work correctly with Apple's sed.

This is a quick workaround: I put it here because it's not going to fix the entire script as it is. Instead of using sed, replace the line where sed is invoked with the following one:

cat "$1" | perl -pe 's/\xEF\xBB\xBF//' > "$TEMPFILENAME"

I preferred using sed instead of perl mainly because it's available on almost any system, even the most stripped-down installations. However, for OS X we've got to stick with perl and wait for me to fix the script.

Also, I just realized I missed an argument check during the last script refactoring. In a few minutes it will be updated.

Thank you very much,
-- Grey

Unknown said...

Thanks Grey! Substituting perl for sed worked well indeed. Much appreciated!

Juri Strumpflohner said...

thx, really useful

Anonymous said...

thanks for share.

Anonymous said...

I was also able to use dos2unix to remove it ... :)

Petri Sirkkala said...

On OSX (and possibly some other posix systems) you need to use:

sed $'1 s/\xEF\xBB\xBF//' < input > output

(note the dollar sign)