运用之妙,存乎一心
Introduction
Awk is a program that you can use to select particular records in a file and perform operations upon them. It is installed on Linux and Mac by default.
Awk is an interpreting program language with a simple programming paradigm: find a pattern in the input and then perform an action, which often reduce complex or tedious data manipulations to a few lines of code. Since it’s easy and powerful, why not using it to avoid tedious openning/searching/matching/changings using one or few lines of awk codes?
This blog is for my own notes and records, mainly from Dougspeed and The GNU Awk User’s Guide. The awk is far more powerful than the content/examples listed in this blog.
Basic composition of an awk program: rule-action
When you run awk, you specify an awk program that tells awk what to do. The program consists of a series of rules (it may also contain user defined functions). Each rule specifies one pattern to search for and one action to perform upon finding the pattern. By default, the action of awk will be the {print}
(or equally {print $0}
) if not specified.
Syntactically, a rule consists of a pattern followed by an action. The action is enclosed in braces to separate it from the pattern. Newlines usually separate rules. Therefore, an awk program looks like this:\
1
2
3
4
5
pattern1 { action1 }
pattern2 { action2 }
pattern3 { action3 }
pattern4 { action4 }
…
Same rule applys to pattern3, pattern4, etc.
Running awk
No input files
awk applies the program to the standard input, which usually means whatever you type on the keyboard. This continues until you indicate end-of-file by typing Ctrl-d. \
1
2
#No input files
awk 'program'
Input with files
1
2
3
4
5
6
7
8
9
10
# Input with single file
awk 'program' input-file1
awk <input-file1 'program'
zcat xxx.gz | awk 'program'
# Input with multiple files
# The 'dash' argument indicate awk to read its standard input from stream '|'
awk 'program' input-file1 input-file2 …
zcat xxx.gz | awk 'program' input-file2 -
Read ‘program’ from a script
1
2
#When the program is long,
awk -f program-file input-file1 input-file2 …
System Variables and Functions
System variables in awk
The awk program defines a number of special variables that can be referenced or reset inside a program.\
Variable Name | Definition | Example | Meaning |
---|---|---|---|
FILENAME | Current filename | awk 'NR==1{print FILENAME} input-file' |
print filename |
FS | Field separator(a blank by default) | awk 'BEGIN{FS=","}NR==1' xxx.csv |
Set separator to ‘,’ |
NR | Total number of lines/records processed | awk '1;NR == 11{exit}' inputfile |
Print the first 11 lines |
FNR | the record number (typically the line number) in the current file | awk '{if(NR==FNR){arr[$1];next}}($1 in arr){print $1}' file1 file2 |
first store Column 1 of the first file (in the variable arr), then test whether elements in Column 1 of the second file are in arr |
NF | Number of field in the current record | awk NF |
Delete all blank lines from a file: If NF==0, not print; Else print $0; |
OFS | Output file separator | awk ‘BEGIN{FS=",";OFS=";"}NR==4, NR==8 {print NR,$1,$2}’ sample_summary.csv |
print line#,\$1,\$2 from line4 to line8,separated by “;” |
RS | record separator | awk 'BEGIN{RS="\n"};1;NR==2{exit}' file1 |
set record separator to “\n” |
ORS | output record separator | awk 'BEGIN{ORS="\n\n"};1;NR==4{exit}' file1 |
set output record separator to “\n\n” |
System functions in awk
The awk program defines a number of special variables that can be referenced or reset inside a program.\
Function Name | Definition | Example | Meaning |
---|---|---|---|
int(num) | get int of number | awk 'BEGIN{print int(3.534);print int(4);print int(-5.223);print int(-5);}' |
print int(num) |
log(num) | get natural logarithmic(with base e) of given amount | awk 'BEGIN{print log(3.534);print log(4);print log(0);print log(-5);print log(-1);}' |
Returns -inf when given zero and gives nan error when negative number is given. |
exp(x) | the exponential of x (e ^ x) or report an error if x is out of range | awk 'BEGIN{print exp(2.1)}' |
get e^2.1 |
sqrt(num) | gives the positive root for the given number | awk 'BEGIN{print sqrt(16);print sqrt(1.21);print sqrt(0);print sqrt(-12);}' |
returns nan error if we give negative number as argument. |
sin(num) | gives sine value of num, with num in radians | awk 'BEGIN{print sin(-60);print sin(90);print sin(45);}' |
get sin(num) |
cos(num) | gives cosine value of n, with n in radians | awk 'BEGIN{print cos(-60);print cos(90);print cos(45);}' |
get cos(num) |
length(string) | alculates the length of a string | awk 'BEGIN{print length("AAAA BBB\tCC")}' |
Length of the string also includes spaces |
substr(s, p, n) | Returns substring of string s at beginning position p up to a maximum length of n. | awk 'BEGIN{print substr("example aa bb cc", 4)}' |
If n is not supplied, the rest of the string from p is used. |
index(str1, str2) | searches the string str1 for the first occurrences of the string str2, and returns the position in characters where that occurrence begins in the string str1 | awk 'BEGIN{print index("Graphic", "ph"); print index("University", "abc")}' |
String indices in awk starts from 1. |
tolower(s) | all uppercase characters in string s to lowercase | awk 'BEGIN{print tolower("GEEKSFORGEEKS")}' |
|
toupper(s) | all lowercase characters in string s to uppercase | awk 'BEGIN{print toupper("geeksforgeeks")}' |
|
split(string, array, fieldsep) | divides string into pieces separated by fieldsep, and stores the pieces in array, return length of array | awk 'BEGIN{string="Hi world Hi AZ"; fieldsep=" "; n=split(string, array, fieldsep); for(i=1; i<=n; i++){printf("%s\n", array[i]);}}' |
Split string, store in array, get n, and print |
Regular Expressions in awk
AWK is very powerful and efficient in handling regular expressions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
echo -e "cat\nbat\nfun\nfin\nfan" | awk '/f.n/'
echo -e "This\nThat\nThere\nTheir\nthese" | awk '/^The/'
echo -e "knife\nknow\nfun\nfin\nfan\nnine" | awk '/n$/'
echo -e "Call\nTall\nBall" | awk '/[CT]all/'
echo -e "Call\nTall\nBall" | awk '/[^CT]all/'
echo -e "Call\nTall\nBall\nSmall\nShall" | awk '/Call|Ball/'
echo -e "Colour\nColor" | awk '/Colou?r/'
echo -e "ca\ncat\ncatt" | awk '/cat*/'
echo -e "111\n22\n123\n234\n456\n222" | awk '/2+/'
#Grouping
echo -e "Apple Juice\nApple Pie\nApple Tart\nApple Cake" | awk '/Apple (Juice|Cake)/'
#substitute
echo "apple apple\npineapple apple\n" | awk 'sub(/apple/, "nut")'
echo "apple apple\npineapple apple\n" | awk 'gsub(/apple/, "nut")'
# Regx with variables
## awk can match against a variable if you don't use the // regex markers
## need to build up the required regex as a string
echo "apple apple\npineapple apple\n" | awk 'BEGIN{r = "eapp"} $0~r'
# Using variables from bash terminal
rex=eapp
echo "apple apple\npineapple apple\n" | awk -v r="$rex" '$0~r'
Examples by Practice
Find the intersect (overlap) of two files
1
2
3
awk '{if(NR==FNR){arr[$1];next}}($1 in arr){print $1}' file1.txt file2.txt
# Or
awk '(NR==FNR){arr[$1];next}($1 in arr){print $1}' file1.txt file2.txt
Get details of the overlaps
1
awk '(NR==FNR){arr[$1]=$2;next}($1 in arr){print "SNP:",$1, "P-Value1:",arr[$1], "P-Value2:", $2}' file1.txt file2.txt
Remove duplicates
Remove duplicates in file1.$1, keep only the first record.
Step by step intepretations:
- check \$1 in seen, if not: seen[\$1]=0;
- check !seen[\$1], if TRUE: {print \$0};
- seen[\$1]++;
1
awk '(!seen[$1]++)' file1.txt >file1_rmDuplicate.txt
An example of awk script
From Steve\
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
BEGIN {
FS = "[. ]"
OFS = "."
}
FNR == NR {
domain[$1] = $0
next
}
FNR < NR {
if ($2 in domain) {
for ( i = 2; i < NF; i++ ) {
if ($i != "") {
line = (line ? line OFS : "") $i
}
}
total[line] += $NF
line = ""
}
}
END {
for (i in total) {
printf "%s\t%s\n", i, total[i]
}
}