This article will describes how to change the file delimiter, let say from a comma(,) separated file to pipe(|) separated file provided comma(,) available in between double quote should not be replaced by pipe(|). We will see how gsub function works as well as we will see how print & printf are different in awk.
Input Data: F_Input_File.txt
"B1,A1",TRANS,Y,10000
"B2,A2",MED,Y,20000
"B3,A3",FIN,N,10000
"B4,A4",HR,Y,20000
"B5,A5",CSR,N,50000
"B6,A6",TRANS,Y,30000
"B7,A7",FIN,N,40000
Desired Output:
We want to update the file delimiter from comma(,) to pipe(|). At the same time we don't want to change the comma(,) available in between double quotes "" as that comma is not the delimiter.
"B1,A1"|TRANS|Y,10000
"B2,A2"|MED|Y|20000
"B3,A3"|FIN|N|10000
"B4,A4"|HR|Y|20000
"B5,A5"|CSR|N|50000
"B6,A6"|TRANS|Y|30000
"B7,A7"|FIN|N|40000
Solution:
$ awk -F, '{
for (k=1;k<NF;k++)
{
V_Count+=gsub(/\"/,"&",$k);
printf "%s" (V_Count%2?FS:"|"), $k
}
print $NF
}' F_Input_Data.txt
[OR]
$ awk -F, '{
for (k=1;k<NF;k++)
{
V_Count+=gsub(/\"/,"&",$k);
printf "%s%s", $k, (V_Count%2?FS:"|")
}
print $NF
}' F_Input_Data.txt
There is a very tricky difference in both the solution because of the way printf is used, see the highlighted part in above code. We will understand the same in below explanation.
Output: Above solution has provided the expected result as shown below. Solution seems tricky not to worry let see how it works.
"B1,A1"|TRANS|Y|10000
"B2,A2"|MED|Y|20000
"B3,A3"|FIN|N|10000
"B4,A4"|HR|Y|20000
"B5,A5"|CSR|N|50000
"B6,A6"|TRANS|Y|30000
"B7,A7"|FIN|N|40000
Explanation: We will see how this code is working and try to understand on first row/record as process will remain same for rest of the rows/records.
$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print NF}'
Output: 5
So for first line for loop will execute 4 times (k<NF).
2. gsub function stands for global substitution, let see what it does:
$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $1; V_Count=gsub(/\"/,"&",$1); print V_Count;}'
Output:
"B1
1
$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $2; V_Count=gsub(/\"/,"&",$2); print V_Count;}'
Output:
A1"
1
$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $3; c=gsub(/\"/,"&",$3); print c;}'
Output:
TRANS
0
$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $4; c=gsub(/\"/,"&",$4); print c;}'
Output:
Y
0
$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $5; c=gsub(/\"/,"&",$5); print c;}'
Output:
10000
0
3. Most important, why we have used [&] in gsub. what is the use of &? Let see.
gsub(<Match Pattern/RegExp>,<New value/Replacement> [,<String>])
$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $1; c=gsub(/\"/,"@@",$1);printf "%s" ,$1}'
Output:
"B1
@@B1
See clearly, double quotes " has been replaced by the @@, if you use & then " are not replaced because & is an predefined variable used to replace with the pattern to be matched. Here searching pattern is double quotes (") hence it is replacing double quotes with " only.
3. Now let see a very interesting syntax of if and else:
$ echo "" |awk 'BEGIN{c=2}{ a=c%2?0:1}END{print a}'
Output: 1
Here we have initialised value of c as 2. Variable a will hold the output of the expression
c%2?0:1
If c%2 returns a non zero value, it is considered as true and will return 0.
If c%2 returns a zero, it is considered as false and will return 1.
Here c=2, so 2%2 returns zero,condition become FALSE, hence expression 2%2?0:1 will return 1.
Read it like if(c%2){print 0}else{print 1};
$ echo "" |awk 'BEGIN{c=2}END{print (c==2?0:1)}'
Output: 0
Here we have initialised c=2. When we checked the expression 2==2?0:1, 2==2 is true hence will return value as 0;
Let combine both for clear conclusion:
$ echo "" |awk 'BEGIN{c=2}{ if (a=c%2) {print "hi"} else {print "bye"}}END{print (c==2?0:1), a}'
Output:
bye
0 0
We have already seen how if works without operator in awk, see the following post. See how express c==2?0:1 & a value is returned as 0.
Conclusion: if c=2 then (c==2?TRUE:FALSE) return true or else false
4. Lets visit our case again
printf "%s" (V_Count%2?FS:"|"),$i
When first time gsub will find double quote (") in first field $1( "B1,A1" ) V_Count value will be 1.
Hence our expression V_Count%2?FS:"|" would be look like 1%2?FS:"|" => 1%2 will return 1 which is non- zero value , condition becomes TRUE, hence condition is TRUE and will return current field separator(FS) comma (,) .
When gsub will read 2nd field(A1"), it will find double quote " and hence V_Count will become 2 (V_Count =V_count +1). Again
V_Count%2?FS:"|" would look like 2%2?FS:"|" => 2%2 will return 0 which is zero value,condition becomes FALSE, hence condition is FALSE and will return pipe (|).
5. Now question is, our code is << printf "%s" (V_Count%2?FS:"|"),$i >> hence delimiter should be concatenated in left side of the field but it didn't, why ? let see..
$ echo "" | awk 'BEGIN{c="|"; d="A1";}END{printf "%s" c,d}'
Output: A1|
See how | is concatenated in right side of the field.
$ echo "" | awk 'BEGIN{c="|"; d="A1";}END{printf "%s%s" c,d}'
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: not enough arguments to satisfy format string
`%s%s|'
^ ran out for this one
You tried to give %s%s it has produced a beautiful error :)
$ echo "" | awk 'BEGIN{c="|"; d="A1";}END{printf "%s%s", c,d}'
Output: |A1
See how pipe(|) has concatenated in left when we used an extra ,(highlighted in Red)
We can achieve the correct result in below way using familier syntax.
$ echo "" | awk 'BEGIN{c="|"; d="A1";}END{printf "%s%s", d,c}'
Output: A1|
6. Lets combine everything, and see how result is combining in a single line:
$ echo '"B1,A1",TRANS,Y,10000' |\
awk -F, '{for (k=1;i<5;k++)
{
V_Count+=gsub(/\"/,"&",$k);
printf "%s" (V_Count%2?FS:"|"), $k;
print "=>";
}
print "#=>" $NF
}'
Output: Below output is self explanatory how all the attributes are produced.
"B1,=>
A1"|=>
TRANS|=>
Y|=>
#=>10000
$ echo '"B1,A1",TRANS,Y,10000' |\
awk -F, '{for (i=1;i<5;i++)
{
V_Count+=gsub(/\"/,"&",$i);
printf "%s" (V_Count%2?FS:"|"), $i;
printf "=>";
}
print "#=>" $NF
}'
Output:
"B1,=>A1"|=>TRANS|=>Y|=>#=>10000
We are getting two different output when we are using print and printf. Let see the difference between print & printf.
printf("Hi, Visiotr!")
This prints "Hi, Visiotr!" to the display, just like "print" would, with one slight difference:
- The cursor remains at the end of the text, instead of skipping to the next line.
- In case of "print" cursor skips to the new line.
- A "newline" code ("\n") has to be added to force "printf()" to skips to the next line or we can say to behave like print.
Conclusion: Explanation exceed all the limits :) :) but to get the understanding of how things are working is most important. I hope you are clear how code is working.
Keep Learning, Keep Sharing....!!!
No comments:
Post a Comment