Master UNIX Text Processing Commands for Efficient Data Manipulation

introduction to unix text processing n.w
1 / 69
Embed
Share

Discover the power of UNIX text processing commands and streamline your data manipulation tasks efficiently. Learn about essential commands like awk, sed, grep, and more, and see how UNIX commands outperform traditional programming languages in file sorting and text processing. Take advantage of free resources provided by Stanford University to enhance your UNIX skills.

  • UNIX Commands
  • Text Processing
  • Stanford Resources
  • Data Manipulation
  • Programming Languages

Uploaded on | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Introduction to UNIX Text Processing Bo Yoo January 20, 2021

  2. Announcements Homework 1 released: cs273a.Stanford.edu (schedule section) Read the instructions carefully! Due 11:59PM PST 02/01/2021 TA office hours: Tuesdays 12PM-2PM PST starting next week (1/26) For this week only: we will have one on Friday 4PM-6PM PST (1/22) Meet via Zoom (link in our course website) 2

  3. Stanford UNIX resources Host: cardinal.stanford.edu To connect from Unix/Linux/Mac: Open a terminal (user your SUID): ssh user@myth.stanford.edu ssh user@cardinal.stanford.edu To connect from Windows: Download terminal emulator: PuTTy (http://goo.gl/s0itD) 3

  4. Stanford Resources There are many books, web tutorials, and videos out there on materials we will talk about in the tutorial today (e.g., UNIX, Python, Awk etc.) Stanford gives you free access to many of these resources for you 4

  5. Stanford Resources General Stanford search: https://searchworks.stanford.edu/ All kinds of e-resources https://library.stanford.edu/science/collections/mathemat ics-and-statistics-collection/ebooks-and-digital-projects Safari! https://searchworks.stanford.edu/view/4797413 You can even ask a specialized librarian for the best resources Stanford currently has access to: https://library.stanford.edu/subjects/computer-science 5

  6. Huge suite of tools 6

  7. Many useful text processing UNIX commands awk bzcat cat column cut grep head join sed sort tail tee tr uniq wc zcat UNIX commands work together via text streams. Example usage and others available at http://tldp.org/LDP/abs/html/textproc.html http://en.wikipedia.org/wiki/Cat_%28Unix%29#Other 7

  8. Knowing UNIX commands eliminates having to reinvent the wheel In the past homework, to perform a simple file sort, submissions used: 35 lines of Python 19 lines of Perl 73 lines of Java 1 line of UNIX commands 8

  9. Anatomy of a UNIX command command [options] [FILE1] [FILE2] options: -n 1 -g -c = -n1 -gc output is directed to standard output (stdout) if no input file is specified, input comes from standard input (stdin) To view the usage: command --help 9

  10. The real power of UNIX commands comes from combinations through piping ( | ) Pipes are used to pass the output of one program (stdout) as the input (stdin) to another Pipe character is <Shift>-\ grep CS273a grades.txt | sort -k 2,2gr | uniq Find all lines in the file that have CS273a in them somewhere Sort those lines by second column, in numerical order, highest to lowest Remove duplicates and print to standard output 10

  11. Dual Piping Pass the output of two commands to another command: sort <(cat file1) <(cat file2) join -1 1 -2 4 <(sort k1 file1) <(sort k4 file2) This is particularly useful for join and comm commands 11

  12. Output redirection (>, >>) Instead of writing everything to standard output, we can write (>)or append (>>) to a file grep CS273a allClasses.txt > CS273aInfo.txt cat addlInfo.txt >> CS273aInfo.txt 12

  13. http://genomewiki.ucsc.edu/index.php/Kent_source_utilities UCSC KENT SOURCE UTILITIES 13

  14. /afs/ir/class/cs273a/bin/ Many C programs in this directory that do manipulation of sequences or chromosome ranges Run programs with no arguments to see help message overlapSelect [OPTION] selectFile inFile outFile Many useful options to alter how overlaps computed selectFile inFile Output is all inFile elements that overlap any selectFile elements outFile 14

  15. Kent Source and Mysql Linux + Mac Binaries http://hgdownload.soe.ucsc.edu/admin/exe/ Using MySQL on browser http://genome.ucsc.edu/goldenPath/help/mysql.h tml 15

  16. Interacting with UCSC Genome Browser MySQL Tables Command line: mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A Ne <STMT> e.g. mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A Ne \ select count(*) from hg18.knownGene ; +-------+ | 66803 | +-------+ http://dev.mysql.com/doc/refman/8.0/en/tutorial.html 16

  17. https://bedtools.readthedocs.io/en/latest/index.html BEDTOOLS 17

  18. Bedtools Bedtools are useful to perform wide-range of genome analysis tasks. They take in multiple commonly used genomic files (e.g. BED, BAM, VCF) You can run bedtools on any Stanford machines Command line: bedtools <subcommand> [options/input files] 18

  19. BED Files Browser Extensible Data (BED) format is a tab delimited text file format that store genomic regions. It has 3 required fields: chromosome, start, and end (https://genome.ucsc.edu/FAQ/FAQformat.html#f ormat1) Start is the starting position and is 0th based index (i.e., the first base in a chromosome is 0 End is the ending position of the region and this base is not included in the region 19

  20. BED Files The first 10 bases of chromosome 1: chr1 0 10 This will span the bases 0-9 (0-9 index inclusively) 20

  21. Bedtools intersect Return intersecting regions in two or more bed files TIP: bedtools sometimes have hard time recognizing the file has a BED file if there are extra fields, so trim to the first 3 columns Command line bedtools intersect [options] a file b file1,file2, 21

  22. Bedtools merge Combines overlapping or close enough regions into one Command line bedtools merge [options] i file 22

  23. SPECIFIC UNIX COMMANDS 23

  24. man, whatis, apropos UNIX program that invokes the manual written for a particular program man sort Shows all info about the program sort Hit <space> to scroll down, q to exit whatis sort Shows short description of all programs that have sort in their names apropos sort Shows all programs that have sort in their names or short descriptions 24

  25. cat Concatenates files and prints them to standard output cat [OPTION] [FILE] A B C D 1 2 3 A B C D 1 2 3 Variants for compressed input files: zcat (.gz files) bzcat (.bz2 files) Also useful if your zipped file is very large and you want to check the content of the file quickly zcat file1.gz | less (or | head) 25

  26. head, tail head: first ten lines tail: last ten lines -n option: number of lines For tail, -n+K means line K to the end. head n5 : first five lines tail n3 : last 3 lines tail n+2 | head n 2 : lines 2-4 26

  27. cut Prints selected parts of lines from each file to standard output cut [OPTION] [FILE] -d Choose delimiter between columns (default TAB) -f Fields to print -f1,7 : fields 1 and 7 -f1-4,7,11-13: fields 1,2,3,4,7,11,12,13 27

  28. cut example file.txt CS a CS.273.a CS cut f1,3 file.txt CS 273 a CS.273.a CS 273 a CS 273 a CS.a CS 273 a cut d . f1,3 file.txt In general, you should make sure your file columns are all delimited with the same character(s) before applying cut! 28

  29. wc Print line, word, and character (byte) counts for each file, and totals of each if more than one file specified wc [OPTION] [FILE] -l Print only line counts 29

  30. sort Sorts lines in a delimited file (default: tab) -k m,n sorts by columns m to n (1-based) -g sorts by general numerical value (can handle scientific format) -r sorts in descending order sort -k1,1gr -k2,3 Sort on field 1 numerically (high to low because of r). Break ties on field 2 alphabetically. Break further ties on field 3 alphabetically. 30

  31. uniq Discard all but one of successive identical lines from input and print to standard output -d Only print duplicate lines -i Ignore case in comparison -u Only print unique lines 31

  32. uniq example uniq example file.txt file.txt CS 273a TA: Bo Yoo CS 273a TA: Bo Yoo CS 273a CS 273a uniq file.txt uniq file.txt CS 273a CS 273a TA: Bo Yoo CS 273a CS 273a CS 273a CS 273a TA: Bo Yoo TA: Bo Yoo TA: Bo Yoo CS 273a CS 273a uniq u file.txt uniq u file.txt uniq d file.txt uniq d file.txt CS 273a CS 273a In general, you probably want to make sure your file is sorted before applying uniq! is sorted before applying uniq! In general, you probably want to make sure your file 32 25

  33. grep Search for lines that contain a word or match a regular expression grep [options] PATTERN [FILE ] -i ignore case -v Output lines that do not match -f <FILE>: patterns from a file (1 per line) -E Extended regex grep (=egrep) -o print only matching part of the line 33

  34. grep example grep -E ^CS[[:space:]]+273$ file Then have one or more spaces (or tabs) For lines that start with CS And end with 273 Search through file file CS 273a CS273 CS 273 cs 273 CS CS 273 CS 273 273 34

  35. sed: stream editor Most common use is a string replace. sed e s/SEARCH/REPLACE/g cat file.txt | sed e s/is/EEE/g file.txt ThEEE EEE an Example. This is an Example. 35

  36. join Join lines of two files on a common field join [OPTION] FILE1 FILE2 -1 Specify which column of FILE1 to join on -2 Specify which column of FILE2 to join on Important: FILE1 and FILE2 must alreadybe sorted on their join fields! 36

  37. join example file1.txt file2.txt Bejerano CS273a Villeneuve DB210 Batzoglou DB273a CS273a Comp Tour Hum Gen. CS229 Machine Learning DB210 Devel. Biol. join -1 2 -2 1 file1.txt file2.txt CS273a DB210 Bejerano Villeneuve Devel. Biol. Comp Tour Hum Gen. 37

  38. SHELL SCRIPTING 38

  39. Common shells Two common shells: bash and tcsh Run ps to see which you are using. 39

  40. Multiple UNIX commands can be combined into a single shell script. script.sh #!/bin/bash cat $1 $2 > tmp.txt paste tmp.txt $3 > $4 Command prompt % ./script.sh file1.txt file2.txt file3.txt out.txt % sh script.sh file1.txt file2.txt file3.txt out.txt Set scripts to be executable: % chmod u+x script.sh http://www.faqs.org/docs/bashman/bashref_toc.html 40

  41. for loop # BASH for loop to print 1,2,3 on separate lines for i in `seq 1 3` do echo ${i} done should execute the command within the quotes Special quote character, usually left of 1 on keyboard that indicates we 41

  42. Making the script executable You need to make the script executable to use dual piping $sh script.sh $chmod u+x script.sh $./script.sh 42

  43. SCRIPTING LANGUAGES 43

  44. awk A quick-and-easy shell scripting language https://www.gnu.org/software/gawk/manual/ gawk.html Treats each line of a file as a record, and splits fields by whitespace Fields referenced as $1, $2, $3, ($0 is entire line) 44

  45. Anatomy of an awk script. awk BEGIN { } { } END { } before first line once per line after last line 45

  46. awk example Output the lines where column 3 is less than column 5 in a comma-delimited file. Output a summary line at the end. awk -F', 'BEGIN{ct=0;} { if ($3 < $5) { print $0; ct=ct+1; } } END { print "TOTAL LINES: " ct; }' 46

  47. Useful things from awk Make sure fields are delimited with tabs (to be used by cut, sort, join, etc. awk {print $1 \t $2 \t $3} whiteDelim.txt > tabDelim.txt Good string processing using substr, index, length functions awk {print substr($1, 1, 10)} longNames.txt > shortNames.txt Start position Length String to manipulate substr( helloworld , 4, 3) = low index( helloworld , low ) = 4 length( helloworld ) = 10 index( helloworld , notpresent ) = 0 47

  48. Python A scripting language with many useful constructs http://wiki.python.org/moin/BeginnersGuide http://docs.python.org/tutorial/index.html Call a python program from the command line: python myProg.py 48

  49. Number types Numbers: int, float >>> f = 4.7 >>> i = int(f) >>> j = round(f) >>> i 4 >>> j 5.0 >>> i*j 20.0 >>> 2**i 16 49

  50. Strings >>> dir( ) [ , 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'] >>> s = hi how are you? >>> len(s) 15 >>> s[5:10] w are >>> s.find( how ) 3 >>> s.find( CS273 ) -1 >>> s.split( ) [ hi , how , are , you? ] >>> s.startswith( hi ) True >>> s.replace( hi , hey buddy, ) hey buddy, how are you? >>> extraBlanks .strip() extraBlanks 50

Related


More Related Content