Monday, 4 January 2016

How To Change Delimiter Of The File Keeping Comma(,) In Double Quote("")

This article will describes how to change the file delimiter, let say from a comma(,) separated file to pipe(|) separated file provided comma(,) available in between double quote should not be replaced by pipe(|). We will see how gsub function works as well as we will see how print & printf are different in awk.

Input Data: F_Input_File.txt

"B1,A1",TRANS,Y,10000
"B2,A2",MED,Y,20000
"B3,A3",FIN,N,10000
"B4,A4",HR,Y,20000
"B5,A5",CSR,N,50000
"B6,A6",TRANS,Y,30000
"B7,A7",FIN,N,40000

Desired Output: 
We want to update the file delimiter from comma(,) to pipe(|). At the same time we don't want to change the comma(,) available in between double quotes "" as that comma is not the delimiter.

"B1,A1"|TRANS|Y,10000
"B2,A2"|MED|Y|20000
"B3,A3"|FIN|N|10000
"B4,A4"|HR|Y|20000
"B5,A5"|CSR|N|50000
"B6,A6"|TRANS|Y|30000
"B7,A7"|FIN|N|40000

Solution:

$ awk -F, '{
            for (k=1;k<NF;k++)
            {
                V_Count+=gsub(/\"/,"&",$k);
                printf "%s" (V_Count%2?FS:"|"), $k
            }
            print $NF

}' F_Input_Data.txt

[OR]

$ awk -F, '{
            for (k=1;k<NF;k++)
            {
                V_Count+=gsub(/\"/,"&",$k);
                printf "%s%s", $k, (V_Count%2?FS:"|")
            }
            print $NF

}' F_Input_Data.txt


There is a very tricky difference in both the solution because of the way printf is used, see the highlighted part in above code. We will understand the same in below explanation.

Output: Above solution has provided the expected result as shown below. Solution seems tricky not to worry let see how it works.

"B1,A1"|TRANS|Y|10000
"B2,A2"|MED|Y|20000
"B3,A3"|FIN|N|10000
"B4,A4"|HR|Y|20000
"B5,A5"|CSR|N|50000
"B6,A6"|TRANS|Y|30000
"B7,A7"|FIN|N|40000


Explanation:  We will see how this code is working and try to understand on first row/record as process will remain same for rest of the rows/records.

1. At first it will read first line from the file for which NF=4. see how:

$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print NF}'
Output: 5

So for first line for loop will execute 4 times (k<NF).

2. gsub function stands for global substitution, let see what it does:

$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $1; V_Count=gsub(/\"/,"&",$1); print V_Count;}'
Output:
"B1
1

$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $2; V_Count=gsub(/\"/,"&",$2); print V_Count;}'
Output:
A1"
1

$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $3; c=gsub(/\"/,"&",$3); print c;}'
Output:
TRANS
0

$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $4; c=gsub(/\"/,"&",$4); print c;}'
Output:
Y
0

$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $5; c=gsub(/\"/,"&",$5); print c;}'
Output:
10000
0

3. Most important, why we have used [&] in gsub. what is the use of &? Let see.

gsub(<Match Pattern/RegExp>,<New value/Replacement> [,<String>])

$ echo '"B1,A1",TRANS,Y,10000' | awk -F"," '{print $1; c=gsub(/\"/,"@@",$1);printf "%s" ,$1}'
Output:
"B1
@@B1


See clearly, double quotes " has been replaced by the @@, if you use & then " are not replaced because & is an predefined variable used to replace with the pattern to be matched. Here searching pattern is double quotes (") hence it is replacing double quotes with " only.

3. Now let see a very interesting syntax of if and else:

$ echo "" |awk 'BEGIN{c=2}{ a=c%2?0:1}END{print a}' 
Output: 1

Here we have initialised value of c as 2. Variable a will hold the output of the expression 
c%2?0:1

If c%2 returns a non zero value, it is considered as true and will return 0.
If c%2 returns a zero, it is considered as false and will return 1.

Here c=2, so 2%2 returns zero,condition become FALSE, hence expression 2%2?0:1 will return 1.

Read it like if(c%2){print 0}else{print 1};

$ echo "" |awk 'BEGIN{c=2}END{print (c==2?0:1)}'
Output: 0

Here we have initialised c=2. When we checked the expression 2==2?0:1, 2==2 is true hence will return value as 0;

Let combine both for clear conclusion:

$ echo "" |awk 'BEGIN{c=2}{ if (a=c%2) {print "hi"} else {print "bye"}}END{print (c==2?0:1), a}'
Output: 
bye 
0 0

We have already seen how if works without operator in awk, see the following post. See how express c==2?0:1 & a value is returned as 0.

Conclusion: if c=2 then (c==2?TRUE:FALSE) return true or else false

4. Lets visit our case again

printf "%s" (V_Count%2?FS:"|"),$i

When first time gsub will find double quote (") in first field $1( "B1,A1" ) V_Count value will be 1.

Hence our expression V_Count%2?FS:"|" would be look like 1%2?FS:"|" => 1%2 will return 1 which is non- zero value , condition becomes TRUE, hence condition is TRUE and will return current field separator(FS) comma (,) .

When gsub will read 2nd field(A1"), it will find double quote " and hence V_Count will become 2 (V_Count =V_count +1). Again

V_Count%2?FS:"|" would look like 2%2?FS:"|" => 2%2 will return 0 which is zero value,condition becomes FALSE, hence condition is FALSE and will return pipe (|).

5. Now question is, our code is << printf "%s" (V_Count%2?FS:"|"),$i >> hence delimiter should be concatenated in left side of the field but it didn't, why ? let see..

$ echo "" | awk 'BEGIN{c="|"; d="A1";}END{printf "%s" c,d}'
Output: A1|

See how | is concatenated in right side of the field.

$ echo "" | awk 'BEGIN{c="|"; d="A1";}END{printf "%s%s" c,d}'
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: not enough arguments to satisfy format string
        `%s%s|'
           ^ ran out for this one
You tried to give %s%s it has produced a beautiful error :)

$ echo "" | awk 'BEGIN{c="|"; d="A1";}END{printf "%s%s", c,d}'
Output: |A1

See how pipe(|) has concatenated in left when we used an extra ,(highlighted in Red)

We can achieve the correct result in below way using familier syntax.

$ echo "" | awk 'BEGIN{c="|"; d="A1";}END{printf "%s%s", d,c}'
Output: A1|

6. Lets combine everything, and see how result is combining in a single line:

$ echo '"B1,A1",TRANS,Y,10000' |\
                       awk -F, '{for (k=1;i<5;k++)
                     { 
V_Count+=gsub(/\"/,"&",$k);
printf "%s" (V_Count%2?FS:"|"), $k;
print "=>";
}
print "#=>" $NF
                     }'
Output: Below output is self explanatory how all the attributes are produced.
"B1,=>
A1"|=>
TRANS|=>
Y|=>
#=>10000

$ echo '"B1,A1",TRANS,Y,10000' |\
                   awk -F, '{for (i=1;i<5;i++)
                 { 
      V_Count+=gsub(/\"/,"&",$i);
      printf "%s" (V_Count%2?FS:"|"), $i;
      printf "=>";
      }
                   print "#=>" $NF
                 }'
Output:
"B1,=>A1"|=>TRANS|=>Y|=>#=>10000

We are getting two different output when we are using print and printf. Let see the difference between print & printf.

printf("Hi, Visiotr!")

This prints "Hi, Visiotr!" to the display, just like "print" would, with one slight difference: 

  • The cursor remains at the end of the text, instead of skipping to the next line. 
  • In case of "print" cursor skips to the new line. 
  • A "newline" code ("\n") has to be added to force "printf()" to skips to the next line or we can say to behave like print.
        print("Hi, there!") == printf("Hi, there!\n")

Conclusion: Explanation exceed all the limits :) :) but to get the understanding of how things are working is most important. I hope you are clear how code is working. 

Keep Learning, Keep Sharing....!!!

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...