Frequently Used Awk Commands

For fast text stream manipulations

Posted by Yuan on September 16, 2022

运用之妙,存乎一心

Introduction

Awk is a program that you can use to select particular records in a file and perform operations upon them. It is installed on Linux and Mac by default.

Awk is an interpreting program language with a simple programming paradigm: find a pattern in the input and then perform an action, which often reduce complex or tedious data manipulations to a few lines of code. Since it’s easy and powerful, why not using it to avoid tedious openning/searching/matching/changings using one or few lines of awk codes?

This blog is for my own notes and records, mainly from Dougspeed and The GNU Awk User’s Guide. The awk is far more powerful than the content/examples listed in this blog.

Basic composition of an awk program: rule-action

When you run awk, you specify an awk program that tells awk what to do. The program consists of a series of rules (it may also contain user defined functions). Each rule specifies one pattern to search for and one action to perform upon finding the pattern. By default, the action of awk will be the {print} (or equally {print $0}) if not specified.
Syntactically, a rule consists of a pattern followed by an action. The action is enclosed in braces to separate it from the pattern. Newlines usually separate rules. Therefore, an awk program looks like this:\

1
2
3
4
5
pattern1 { action1 }
pattern2 { action2 }
pattern3 { action3 }
pattern4 { action4 }

Running awk

No input files

awk applies the program to the standard input, which usually means whatever you type on the keyboard. This continues until you indicate end-of-file by typing Ctrl-d. \

1
2
#No input files
awk 'program'

Input with files

1
2
3
4
5
6
7
8
9
10
# Input with single file
awk 'program' input-file1
awk <input-file1 'program'
zcat xxx.gz | awk 'program'

# Input with multiple files
# The 'dash' argument indicate awk to read its standard input from stream '|'
awk 'program' input-file1 input-file2 …
zcat xxx.gz | awk 'program' input-file2 -

Read ‘program’ from a script

1
2
#When the program is long,
awk -f program-file input-file1 input-file2 …

System Variables and Functions

System variables in awk

The awk program defines a number of special variables that can be referenced or reset inside a program.\

Variable Name Definition Example Meaning
FILENAME Current filename awk 'NR==1{print FILENAME} input-file' print filename
FS Field separator(a blank by default) awk 'BEGIN{FS=","}NR==1' xxx.csv Set separator to ‘,’
NR Total number of lines/records processed awk '1;NR == 11{exit}' inputfile Print the first 11 lines
FNR the record number (typically the line number) in the current file awk '{if(NR==FNR){arr[$1];next}}($1 in arr){print $1}' file1 file2 first store Column 1 of the first file (in the variable arr), then test whether elements in Column 1 of the second file are in arr
NF Number of field in the current record awk NF Delete all blank lines from a file: If NF==0, not print; Else print $0;
OFS Output file separator awk ‘BEGIN{FS=",";OFS=";"}NR==4, NR==8 {print NR,$1,$2}’ sample_summary.csv print line#,\$1,\$2 from line4 to line8,separated by “;”
RS record separator awk 'BEGIN{RS="\n"};1;NR==2{exit}' file1 set record separator to “\n”
ORS output record separator awk 'BEGIN{ORS="\n\n"};1;NR==4{exit}' file1 set output record separator to “\n\n”

System functions in awk

The awk program defines a number of special variables that can be referenced or reset inside a program.\

Function Name Definition Example Meaning
int(num) get int of number awk 'BEGIN{print int(3.534);print int(4);print int(-5.223);print int(-5);}' print int(num)
log(num) get natural logarithmic(with base e) of given amount awk 'BEGIN{print log(3.534);print log(4);print log(0);print log(-5);print log(-1);}' Returns -inf when given zero and gives nan error when negative number is given.
exp(x) the exponential of x (e ^ x) or report an error if x is out of range awk 'BEGIN{print exp(2.1)}' get e^2.1
sqrt(num) gives the positive root for the given number awk 'BEGIN{print sqrt(16);print sqrt(1.21);print sqrt(0);print sqrt(-12);}' returns nan error if we give negative number as argument.
sin(num) gives sine value of num, with num in radians awk 'BEGIN{print sin(-60);print sin(90);print sin(45);}' get sin(num)
cos(num) gives cosine value of n, with n in radians awk 'BEGIN{print cos(-60);print cos(90);print cos(45);}' get cos(num)
length(string) alculates the length of a string awk 'BEGIN{print length("AAAA BBB\tCC")}' Length of the string also includes spaces
substr(s, p, n) Returns substring of string s at beginning position p up to a maximum length of n. awk 'BEGIN{print substr("example aa bb cc", 4)}' If n is not supplied, the rest of the string from p is used.
index(str1, str2) searches the string str1 for the first occurrences of the string str2, and returns the position in characters where that occurrence begins in the string str1 awk 'BEGIN{print index("Graphic", "ph"); print index("University", "abc")}' String indices in awk starts from 1.
tolower(s) all uppercase characters in string s to lowercase awk 'BEGIN{print tolower("GEEKSFORGEEKS")}'  
toupper(s) all lowercase characters in string s to uppercase awk 'BEGIN{print toupper("geeksforgeeks")}'  
split(string, array, fieldsep) divides string into pieces separated by fieldsep, and stores the pieces in array, return length of array awk 'BEGIN{string="Hi world Hi AZ"; fieldsep=" "; n=split(string, array, fieldsep); for(i=1; i<=n; i++){printf("%s\n", array[i]);}}' Split string, store in array, get n, and print

Regular Expressions in awk

AWK is very powerful and efficient in handling regular expressions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
echo -e "cat\nbat\nfun\nfin\nfan" | awk '/f.n/'

echo -e "This\nThat\nThere\nTheir\nthese" | awk '/^The/'

echo -e "knife\nknow\nfun\nfin\nfan\nnine" | awk '/n$/'

echo -e "Call\nTall\nBall" | awk '/[CT]all/'

echo -e "Call\nTall\nBall" | awk '/[^CT]all/'

echo -e "Call\nTall\nBall\nSmall\nShall" | awk '/Call|Ball/'

echo -e "Colour\nColor" | awk '/Colou?r/'

echo -e "ca\ncat\ncatt" | awk '/cat*/'

echo -e "111\n22\n123\n234\n456\n222"  | awk '/2+/'
#Grouping
echo -e "Apple Juice\nApple Pie\nApple Tart\nApple Cake" | awk '/Apple (Juice|Cake)/'

#substitute
echo "apple apple\npineapple apple\n" | awk 'sub(/apple/, "nut")'
echo "apple apple\npineapple apple\n" | awk 'gsub(/apple/, "nut")'

# Regx with variables
## awk can match against a variable if you don't use the // regex markers
## need to build up the required regex as a string
echo "apple apple\npineapple apple\n" | awk 'BEGIN{r = "eapp"} $0~r'
# Using variables from bash terminal
rex=eapp
echo "apple apple\npineapple apple\n" | awk -v r="$rex" '$0~r'

Examples by Practice

Find the intersect (overlap) of two files

1
2
3
awk '{if(NR==FNR){arr[$1];next}}($1 in arr){print $1}' file1.txt file2.txt
# Or 
awk '(NR==FNR){arr[$1];next}($1 in arr){print $1}' file1.txt file2.txt

Get details of the overlaps

1
awk '(NR==FNR){arr[$1]=$2;next}($1 in arr){print "SNP:",$1, "P-Value1:",arr[$1], "P-Value2:", $2}' file1.txt file2.txt

Remove duplicates

Remove duplicates in file1.$1, keep only the first record.
Step by step intepretations:

  1. check \$1 in seen, if not: seen[\$1]=0;
  2. check !seen[\$1], if TRUE: {print \$0};
  3. seen[\$1]++;
1
awk '(!seen[$1]++)' file1.txt >file1_rmDuplicate.txt

An example of awk script

From Steve\

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
BEGIN {
    FS = "[. ]"
    OFS = "."
}

FNR == NR {
    domain[$1] = $0
    next
}

FNR < NR {
    if ($2 in domain) {
        for ( i = 2; i < NF; i++ ) {
            if ($i != "") {
                line = (line ? line OFS : "") $i
            }
        }
        total[line] += $NF
        line = ""
    }
}

END {
    for (i in total) {
        printf "%s\t%s\n", i, total[i]
    }
}