Your Bash Cheat-sheet for Data exploration (Part I)

5 min readSep 5, 2020

If you work with large datasets, you will definitely need these bash commands someday.

Do you need to read this article?

If you work in data science, you spend most of your time moving, exploring, comparing, editing and processing files ( text, csv, xml, json, images). Bash commands are robust, fast and interactive tools to manipulate files without having to go through the hassle of writing code. The commands that I have shared in this article will help you do more with less effort.

Read the story now or save for later using Medium’s bookmark feature.

Prerequisite:

I am assuming that you are familiar with common Ubuntu bash commands like echo, cd, ls, head, tail, cp, mv, grep, cut, sort, paste, tqdm, sed. If your bash skills are a bit rusty, you can use --help to know what a command does:

$ mv --helpUsage: mv [OPTION]... [-T] SOURCE DEST
  or:  mv [OPTION]... SOURCE... DIRECTORY
  or:  mv [OPTION]... -t DIRECTORY SOURCE...
Rename SOURCE to DEST, or move SOURCE(s) to DIRECTORY.Mandatory arguments to long options are mandatory for short options too.
      --backup[=CONTROL]       make a backup of each existing destination file
  -b                           like --backup but does not accept an argument
  -f, --force                  do not prompt before overwriting
  -i, --interactive            prompt before overwrite
  -n, --no-clobber             do not overwrite an existing file
If you specify more than one of -i, -f, -n, only the final one takes effect.
      --strip-trailing-slashes  remove any trailing slashes from each SOURCE
                                 argument
  -S, --suffix=SUFFIX          override the usual backup suffix
  -t, --target-directory=DIRECTORY  move all SOURCE arguments into DIRECTORY
  -T, --no-target-directory    treat DEST as a normal file
  -u, --update                 move only when the SOURCE file is newer
                                 than the destination file or when the
                                 destination file is missing
  -v, --verbose                explain what is being done
  -Z, --context                set SELinux security context of destination
                                 file to default type
      --help     display this help and exit
      --version  output version information and exit

1) List absolute paths of files

We all know that ls lists the names of all the files in a directory.

maria@Inspiron:~/my-documents$ ls *file1  
file2.csv  
file3.txt

Now suppose you have a directory with a million text or image files and you need to create a list of their complete paths. This list can be fed to your machine learning model as input.

Use $PWD

maria@Inspiron:~/my-documents$ ls $PWD/* -d/home/maria/my-documents/file1  
/home/maria/my-documents/file2.csv  
/home/maria/my-documents/file3.txt

>$PWD is an environment variable for the current working directory. In the above command, if you remove -d argument, sub-directories will also be listed.

Simple but so useful. Right?

Use readlink

readlink can also be used to list the absolute path of all files in a directory.

maria@Inspiron:~/my-documents$ ls * -d | xargs -i readlink -f {}/home/maria/my-documents/file1  
/home/maria/my-documents/file2.csv  
/home/maria/my-documents/file3.txt

>We will discuss xargs later in the article.

2) Zip all the files in a directory

Suppose your directory has a million files or sub-directories and you want to zip all these files individually.

maria@Inspiron:~/my-documents$ ls *course-books/ 
novels/ 
medium-articles/

Use For Loops in Bash

maria@Inspiron:~/my-documents$ for file in `ls *`; do zip ${file%.*}.zip $file; donecourse-books.zip
novels.zip 
medium-articles.zip

>Every element in the list ` ls *` is stored in the variable named file. Then each file is zipped using zip command. $ is used to identify a variable in bash.

For loop in bash is a powerful tool as it can iterate through all your data to perform operations on each file. Learn the for loop syntax and see how else you can use them:

for variable in list; 
do 
command1;
command2;
.... 
command-n; 
done;

> ‘for’, ‘do’ and ‘done’ are keywords of the for loop. If your list is coming from another bash command like ls or grep, you can use: ` your bash command `

Try it yourself: Print the first 10 lines of all the text files in a directory and write them to an output file.

3) Nested bash commands

This feature opens the door to a gazillion possibilities for pre-processing, cleaning and manipulating large data in text or csv format.

Use <()

Say you want to use the result of command1 as an input to command2. For example: You have two files, file1 and file2 and you need to grep the last 10 lines of file1 in file2.

grep -Ff <(tail -n 10 file1) file2

>You will use tail to get the last 10 lines of file1 and pass them as input to grep using <()

Other arguments:

-F, --fixed-strings    PATTERN is a set of newline-separated strings-f, --file=FILE        obtain PATTERN from FILE

Try it yourself: Suppose you have two csv files with 2 columns in each. Use nested commands to grep the 2nd column of 2nd file from 1st column of 1st file.

Use xargs

xargs allows you to use the results of one command as an argument for another command.

The grep task above can also be performed with xargs as:

tail -n 10 file1 | xargs -i grep {} file2

>The | is called pipe and it connects the output of one process to the input of another process. {} is the placeholder where the output of the first command would be placed.

4) If-else in bash

Suppose that you have ten thousand text files containing names of objects. You want to select the files which contain “glasses” and move these files to directory1. All files without glasses will go to directory2. How will you do it? You cannot go through each file individually as it will take too much time. You can write a Python script to perform this task but the computer you are working on might not have a Python interpreter. However, if the computer uses Ubuntu, it will definitely have bash.

First lets work with one file:

if grep "glasses" file1; then mv file1 dir2/;  fi;

The syntax of if-else is simple:

if condition; 
then command1;  
else command2;
fi;

>‘if’, ‘then’, ‘else’ and ‘fi’ are the key words of the if-else statement. You must already be familiar with all of them except fi which is used to close the if-else statement.

Try it yourself: Count the number of lines in a text file. If count is 0, aka file is empty, then remove the file.

Now to the big showdown. Use for loop with this if-else statement to process all the ten thousand files with one bash command.

for file in `ls *`; do if grep 'glasses' $file; then mv $file dir1/; else mv $file dir2; fi; done;

Closing remarks:

So that’s all for today. It was only a teeny tiny glimpse of the power of bash. If you find it useful, hit the clap button down there to encourage me to write part 2, 3, 4, …., n of bash cheat sheet.