Sunday, 22 November 2015

How To Find Duplicate Records in File Using AWK

Input File: F_Data_File.txt

EMPID|ENAME|DEPT|FLAG
10001|A1|TRANS|Y
10002|A2|MED|Y
10003|A3|FIN|N
10004|A4|HR|Y
10005|A5|CSR|N
10001|A1|TRANS|Y
10003|A3|FIN|N


1. Find out the duplicate record available in the file.

$ awk  '{V_Dup_Count[$0]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' F_Data_File.txt | awk -F" " '$3 > 1{print $0}'


10001|A1|TRANS|Y Rec_Count=> 2
10003|A3|FIN|N
Rec_Count=> 2

Explanation: 

V_Dup_Count holds the count of record associated that record which is available as INDEX to array V_Dup_Count.

V_Dup_Count[10001|A1|TRANS|Y] = 2
V_Dup_Count[10003|A3|FIN|N] = 2

Index value in red is k of for loop and 2 is V_Dup_Count[k]. Please follow post How to Use Associative Array in AWK

We have passed the output of first AWK to second AWK command which will receive 3 fields as input: 1st is complete record, 2nd is Rec_Count=> and 3rd field is count of record all are separated by space. If 3rd field value > 1 that means record is duplicate. 

2. Find out duplicate record based on key column, e.g. based on 1st column:

$ cat
F_Data_File.txt | cut -d"|" -f1 | awk  '{V_Dup_Count[$0]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' | awk -F" " '$3 > 1{print $0}'

(or)

awk -F"|" '{V_Dup_Count[$1]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' 1.txt | awk -F" " '$3 > 1{print $0}'

10001 Rec_Count=> 2
10003 Rec_Count=> 2


Explanation: 

Same as in case 1 but here we are generating the array V_Dup_Count based on field 1 of the record.

3. Find out duplicate record based on key column, e.g. based on 1st & 2nd column:

$ cat
F_Data_File.txt | cut -d"|" -f1,2 | awk  '{V_Dup_Count[$0]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' | awk -F" " '$3 > 1{print $0}'

(or)

awk -F"|" '{V_Dup_Count[$1"|"$2]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' 1.txt | awk -F" " '$3 > 1{print $0}'


10001|A1 Rec_Count=> 2
10003|A3 Rec_Count=> 2


Explanation: 

Same as in case 1 but here we are generating the array V_Dup_Count based on field 1 & 2 of the record.

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...