
Mastering Unix Commands and Shell Scripting for Data Science Success
Learn essential Unix commands for data manipulation, like egrep, sed, sort, head, tail, cut, join, uniq, and more. Explore advanced concepts such as diff, cmp, od, find, loops, and file redirection. Understand regular expressions, programming constructs, and Python regex with pythex utility. Discover how to use man pages effectively and enhance your data processing skills with this comprehensive guide.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Bash Shell Scripting for Data Science Dr. Dale Parson, Kutztown University, Fall 2021
Unix commands for manipulating data streams egrep for matching (searching) regular expressions in data streams sed (steam editor) for matching & modifying regular expressions in data streams sort [ -n ] for sorting lines of data using alpha or numeric keys head -n and tail -n for selecting a subset of lines in data cut for selecting a subset of columns in data join for joining lines of two files on a common field uniq for counting unique (sorted) instances make for automating sequences of target-driven Unix commands
Unix commands for manipulating data continued diff for left, right line comparisons cmp for comparing non-textual (binary) data files od [ -c ] (octal dump) for examining file for invalid characters find for locating files by names and properties cat and more and less for examining text data contents pr [ -n ] for printing with line numbers time for measuring real time & CPU time of a process Basic programming constructs include: < input redirection, > output redirection, and >> concatenation | pipe redirection of output of one process to input of the next while loops, foreach loops, if-then-elif-else selection running a command and capturing its output as a data stream
Highlights of man man egrep egrep REGULAR EXPRESSIONS A regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions. The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any meta-character with special meaning may be quoted by preceding it with a backslash. (Much more follows.) The pythex utility is for Python regular expressions. Option i uses case-insensitive matching, -l simply reports matching file path, -v inverts the matched lines.
Highlights of man man sort sort Default is lexicographic sort. -n is numeric. -r is reversed -t is field separator, default TAB, e.g., -t, for comma separation -k field1[,field2], --key=field1[,field2] Define a restricted sort key that has the starting position field1, and optional ending position field2 of a key field. -R, --random-sort, --sort=random Sort by a random order. This is a random permutation of the inputs except that the equal keys sort together.
Highlights of man man sed sed sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is sed's ability to filter text in a pipeline which particularly distinguishes it from other types of editors. -e script add the script to the commands to be executed
egrep egrep and find find [:-) ~/DataMine] pwd /home/kutztown.edu/parson/DataMine [:-) ~/DataMine] egrep -li 'csv.*writer' $(find . -name '*.py') ./CSC523Example2/CSC523Example2.py ./CSC523Example2/bak/CSC523Example2.py ./csc523F20TCPUDP/csc523F20TCPUDP.py ./csc458ensemble5sp2021/csc458ensemble5sp2021.py ./csc458ensemble5sp2021/parallel/csc458ParallelEnsemble5sp2021.py [:-) ~/DataMine] egrep -i 'csv.*writer' ./CSC523Example2/bak/CSC523Example2.py csvwriter = csv.writer(csvf, delimiter=',', quotechar='"') csvwriter.writerow(scolumnnames) csvwriter.writerows(sdataset)
find find and a foreach foreach loop loop using head head [:-) ~/DataMine] for f in $(find . -name '*csc458**csv' -type f); do echo ; echo $f ; read j ; head -4 $f; done ./csc458ensemble5sp2021/csc458ensemble5sp2021.summary.csv testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime ZeroR,10FoldCrossValidation,0.0,0.1522,0.2759,49066,1.7,0.001378 OneR,10FoldCrossValidation,0.3975,0.089,0.2983,49066,2.14,0.001218 NaiveBayes,10FoldCrossValidation,0.4138,0.1192,0.2438,49066,3.99,0.00103 ./csc458ensemble5sp2021/csc458ensemble5sp2021.summary.ref.csv testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime ZeroR,10FoldCrossValidation,0.0,0.1522,0.2759,49066,1.7,0.001378 OneR,10FoldCrossValidation,0.3975,0.089,0.2983,49066,2.14,0.001218 NaiveBayes,10FoldCrossValidation,0.4138,0.1192,0.2438,49066,3.99,0.00103
find find and a foreach foreach loop loop using head continued head continued ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summary.csv testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime ZeroR,10FoldCrossValidation,0.0,0.1522,0.2759,49066,9.0,6.35 OneR,10FoldCrossValidation,0.3975,0.089,0.2983,49066,10.83,8.84 NaiveBayes,10FoldCrossValidation,0.4138,0.1192,0.2438,49066,17.7,13.07 ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summary.ref.csv testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime ZeroR,10FoldCrossValidation,0.0,0.1522,0.2759,49066,8.76,6.17 OneR,10FoldCrossValidation,0.3975,0.089,0.2983,49066,9.99,7.55 NaiveBayes,10FoldCrossValidation,0.4138,0.1192,0.2438,49066,16.66,13.14
head head and cut cut with delimiter , , and field field specs specs [:-) ~/DataMine] head -4 ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summary.ref.csv testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime ZeroR,10FoldCrossValidation,0.0,0.1522,0.2759,49066,8.76,6.17 OneR,10FoldCrossValidation,0.3975,0.089,0.2983,49066,9.99,7.55 NaiveBayes,10FoldCrossValidation,0.4138,0.1192,0.2438,49066,16.66,13.14 [:-) ~/DataMine] head -4 ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summary.ref.csv |cut -d, - f2-4,8 testdatatype,kappa,MAE,cputime 10FoldCrossValidation,0.0,0.1522,6.17 10FoldCrossValidation,0.3975,0.089,7.55 10FoldCrossValidation,0.4138,0.1192,13.14
head, cut head, cut, sort sort with delimiter , , and key key spec spec [:-) ~/DataMine] head -4 ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summ ary.ref.csv |cut -d, -f2-4,8 |sort -t, -k3 10FoldCrossValidation,0.3975,0.089,7.55 10FoldCrossValidation,0.4138,0.1192,13.14 10FoldCrossValidation,0.0,0.1522,6.17 testdatatype,kappa,MAE,cputime
head head and egrep egrep v v for separating header row [:-) ~/DataMine] head -1 ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summ ary.ref.csv ; egrep -v kappa # ; runs as sequential commands ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summ ary.ref.csv | head -3 | cut -d, -f2-4,8 |sort -t, -k3 testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime 10FoldCrossValidation,0.3975,0.089,7.55 10FoldCrossValidation,0.4138,0.1192,13.14 10FoldCrossValidation,0.0,0.1522,6.17
sed sed used to generate CSC558 grade sheets # CSC 558 csc558maillist2021fall.py for generating grade sheets import subprocess # Updated from commands for Python 3. people = ( ("Student1Lastname,Firstname","stu1email"), ("Student2Lastname,Firstname","stu2email") ) for pair in people: cmd = "cat template.txt | sed -e 's/STUNAME/" + pair[0] + "/' -e 's/STUID/" + pair[1] + "/' > sheets/" + pair[1] + ".txt" print (cmd) (exitstatus, outtext) = subprocess.getstatusoutput(cmd) if exitstatus: print("ERROR ON " + pair)
sed sed used to generate grade sheets continued $ head template.txt CSC558 Assignment 3 grade rubric, D. Parson, Fall 2021, due 11/11. I will be flexible about wording & accurate concepts as usual. Name: STUNAME Email: STUID@live.kutztown.edu Project Grade: Grading rubrics appear below. Each of Q1 through Q15 is worth 6.6% of the project. If late the usual 10% per day applies. ------------------------------------------------------------------------
Line numbers for subsequent join, slide 1 $ cat -n csc458ensemble5sp2021.summary.ref.csv | head -4 1 testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime 2 ZeroR,10FoldCrossValidation,0.0,0.1522,0.2759,49066,8.76,6.17 3 OneR,10FoldCrossValidation,0.3975,0.089,0.2983,49066,9.99,7.55 4 NaiveBayes,10FoldCrossValidation,0.4138,0.1192,0.2438,49066,16.66,13.14 $ cat -n csc458ensemble5sp2021.summary.ref.csv | tail -1 | od -c 0000000 3 9 \t R a n d o m F o r 0000020 e s t , E x t e r n a l T e s t 0000040 F i l e , 0 . 7 1 8 2 , 0 . 0 6 0000060 1 8 , 0 . 1 7 2 3 , 4 1 8 9 4 9 0000100 , 4 4 4 . 7 4 , 4 7 0 . 1 1 \r \n # Convert leading WS \t to a , # Eliminate \r before join
Line numbers for subsequent join, slide 2 $ cat -n csc458ensemble5sp2021.summary.ref.csv | tail -1 | sed -r -e 's/^ +//' -e 's/[ \t]+/,/ e s/\r// 0. 39,RandomForest,ExternalTestFile,0.7182,0.0618,0.1723,418949,444.74,470.11 # > redirect above without tail to junk1.txt and junk2.txt for join demo $ diff junk1.txt junk2.txt # No output because files are identical $ join -t, -j1 junk1.txt junk2.txt | tail -1 39,RandomForest,ExternalTestFile,0.7182,0.0618,0.1723,418949,444.74,470.11,Ra ndomForest,ExternalTestFile,0.7182,0.0618,0.1723,418949,444.74,470.11 # Follow join with cut d, -f2-1000000 to remove join line number