Unix Commands for Data Manipulation and Scripting with Dr. Dale Parson at Kutztown University Fall 2022

bash shell scripting for data science n.w
1 / 10
Embed
Share

Dive into the world of Unix commands for manipulating data streams, comparing files, examining text data, and more in this comprehensive course taught by Dr. Dale Parson at Kutztown University. Explore egrep, sed, sort, head, tail, cut, and other powerful tools to enhance your data science skills. Uncover the basics of programming constructs and understanding regular expressions as you master the art of shell scripting for data science. Join this course to level up your skills in Unix data manipulation!

  • Unix Commands
  • Data Science
  • Shell Scripting
  • Dr. Dale Parson
  • Kutztown University

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Bash Shell Scripting for Data Science Dr. Dale Parson, Kutztown University, Fall 2022

  2. Unix commands for manipulating data streams egrep for matching regular expressions in data streams sed (steam editor) for matching & modifying regular expressions in data streams sort [ -n ] for sorting lines of data using alpha or numeric keys head -n and tail -n for selecting a subset of lines in data cut for selecting a subset of columns in data join for joining lines of two files on a common field uniq for counting unique instances

  3. Unix commands for manipulating data continued diff for left, right line comparisons cmp for comparing non-textual (binary) data files od [ -c ] (octal dump) for examining file for invalid characters find for locating files by names and properties cat and less for examining text data contents pr [ -n ] for printing with line numbers time for measuring real time & CPU time of a process Basic programming constructs include: < input redirection, > output redirection, and >> concatenation | pipe redirection of output of one process to input of the next while loops, foreach loops, if-then-elif-else selection running a command and capturing its output as a data stream

  4. Highlights of man man egrep egrep REGULAR EXPRESSIONS A regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions. ... The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any meta-character with special meaning may be quoted by preceding it with a backslash. (Much more follows.) The pythex utility for testing Python regular expressions

  5. egrep egrep and find find [:-) ~/DataMine] pwd /home/kutztown.edu/parson/DataMine [:-) ~/DataMine] egrep -li 'csv.*writer' $(find . -name '*.py') ./CSC523Example2/CSC523Example2.py ./CSC523Example2/bak/CSC523Example2.py ./csc523F20TCPUDP/csc523F20TCPUDP.py ./csc458ensemble5sp2021/csc458ensemble5sp2021.py ./csc458ensemble5sp2021/parallel/csc458ParallelEnsemble5sp2021.py [:-) ~/DataMine] egrep -i 'csv.*writer' ./CSC523Example2/bak/CSC523Example2.py csvwriter = csv.writer(csvf, delimiter=',', quotechar='"') csvwriter.writerow(scolumnnames) csvwriter.writerows(sdataset)

  6. find find, head head, and tail tail [:-) ~/DataMine] for f in $(find . -name '*csc458**csv' -type f); do echo ; echo $f ; read j ; head -4 $f; done ./csc458ensemble5sp2021/csc458ensemble5sp2021.summary.csv testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime ZeroR,10FoldCrossValidation,0.0,0.1522,0.2759,49066,1.7,0.001378 OneR,10FoldCrossValidation,0.3975,0.089,0.2983,49066,2.14,0.001218 NaiveBayes,10FoldCrossValidation,0.4138,0.1192,0.2438,49066,3.99,0.00103 ./csc458ensemble5sp2021/csc458ensemble5sp2021.summary.ref.csv testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime ZeroR,10FoldCrossValidation,0.0,0.1522,0.2759,49066,1.7,0.001378 OneR,10FoldCrossValidation,0.3975,0.089,0.2983,49066,2.14,0.001218 NaiveBayes,10FoldCrossValidation,0.4138,0.1192,0.2438,49066,3.99,0.00103

  7. find find, head head, and tail tail ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summary.csv testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime ZeroR,10FoldCrossValidation,0.0,0.1522,0.2759,49066,9.0,6.35 OneR,10FoldCrossValidation,0.3975,0.089,0.2983,49066,10.83,8.84 NaiveBayes,10FoldCrossValidation,0.4138,0.1192,0.2438,49066,17.7,13.07 ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summary.ref.csv testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime ZeroR,10FoldCrossValidation,0.0,0.1522,0.2759,49066,8.76,6.17 OneR,10FoldCrossValidation,0.3975,0.089,0.2983,49066,9.99,7.55 NaiveBayes,10FoldCrossValidation,0.4138,0.1192,0.2438,49066,16.66,13.14

  8. find find, head head, and tail tail [:-) ~/DataMine] head -4 ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summary.ref.csv testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime ZeroR,10FoldCrossValidation,0.0,0.1522,0.2759,49066,8.76,6.17 OneR,10FoldCrossValidation,0.3975,0.089,0.2983,49066,9.99,7.55 NaiveBayes,10FoldCrossValidation,0.4138,0.1192,0.2438,49066,16.66,13.14 [:-) ~/DataMine] head -4 ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summary.ref.csv |cut -d, - f2-4,8 testdatatype,kappa,MAE,cputime 10FoldCrossValidation,0.0,0.1522,6.17 10FoldCrossValidation,0.3975,0.089,7.55 10FoldCrossValidation,0.4138,0.1192,13.14

  9. head head, cut cut, and sort sort [:-) ~/DataMine] head -4 ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summ ary.ref.csv |cut -d, -f2-4,8 |sort -t, -k3 10FoldCrossValidation,0.3975,0.089,7.55 10FoldCrossValidation,0.4138,0.1192,13.14 10FoldCrossValidation,0.0,0.1522,6.17 testdatatype,kappa,MAE,cputime

  10. head head, egrep egrep, cut cut, and sort sort [:-) ~/DataMine] head -1 ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summ ary.ref.csv ; egrep -v kappa ./csc458ensemble5sp2021/parallel/csc458ensemble5sp2021.summ ary.ref.csv | head -3 | cut -d, -f2-4,8 |sort -t, -k3 testkey,testdatatype,kappa,MAE,RMSE,Instances,runtime,cputime 10FoldCrossValidation,0.3975,0.089,7.55 10FoldCrossValidation,0.4138,0.1192,13.14 10FoldCrossValidation,0.0,0.1522,6.17

Related


More Related Content