Input File: F_Data_File.txt
EMPID|ENAME|DEPT|FLAG
10001|A1|TRANS|Y
10002|A2|MED|Y
10003|A3|FIN|N
10004|A4|HR|Y
10005|A5|CSR|N
10001|A1|TRANS|Y
10003|A3|FIN|N
1. Find out the duplicate record available in the file.
$ awk '{V_Dup_Count[$0]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' F_Data_File.txt | awk -F" " '$3 > 1{print $0}'
10001|A1|TRANS|Y Rec_Count=> 2
10003|A3|FIN|N Rec_Count=> 2
Explanation:
V_Dup_Count holds the count of record associated that record which is available as INDEX to array V_Dup_Count.
V_Dup_Count[10001|A1|TRANS|Y] = 2
V_Dup_Count[10003|A3|FIN|N] = 2
Index value in red is k of for loop and 2 is V_Dup_Count[k]. Please follow post How to Use Associative Array in AWK
We have passed the output of first AWK to second AWK command which will receive 3 fields as input: 1st is complete record, 2nd is Rec_Count=> and 3rd field is count of record all are separated by space. If 3rd field value > 1 that means record is duplicate.
2. Find out duplicate record based on key column, e.g. based on 1st column:
$ cat F_Data_File.txt | cut -d"|" -f1 | awk '{V_Dup_Count[$0]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' | awk -F" " '$3 > 1{print $0}'
(or)
awk -F"|" '{V_Dup_Count[$1]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' 1.txt | awk -F" " '$3 > 1{print $0}'
10001 Rec_Count=> 2
10003 Rec_Count=> 2
Explanation:
Same as in case 1 but here we are generating the array V_Dup_Count based on field 1 of the record.
3. Find out duplicate record based on key column, e.g. based on 1st & 2nd column:
$ cat F_Data_File.txt | cut -d"|" -f1,2 | awk '{V_Dup_Count[$0]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' | awk -F" " '$3 > 1{print $0}'
(or)
$ awk -F"|" '{V_Dup_Count[$1"|"$2]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' 1.txt | awk -F" " '$3 > 1{print $0}'
10001|A1 Rec_Count=> 2
10003|A3 Rec_Count=> 2
Explanation:
Same as in case 1 but here we are generating the array V_Dup_Count based on field 1 & 2 of the record.
EMPID|ENAME|DEPT|FLAG
10001|A1|TRANS|Y
10002|A2|MED|Y
10003|A3|FIN|N
10004|A4|HR|Y
10005|A5|CSR|N
10001|A1|TRANS|Y
10003|A3|FIN|N
1. Find out the duplicate record available in the file.
$ awk '{V_Dup_Count[$0]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' F_Data_File.txt | awk -F" " '$3 > 1{print $0}'
10001|A1|TRANS|Y Rec_Count=> 2
10003|A3|FIN|N Rec_Count=> 2
Explanation:
V_Dup_Count holds the count of record associated that record which is available as INDEX to array V_Dup_Count.
V_Dup_Count[10001|A1|TRANS|Y] = 2
V_Dup_Count[10003|A3|FIN|N] = 2
Index value in red is k of for loop and 2 is V_Dup_Count[k]. Please follow post How to Use Associative Array in AWK
We have passed the output of first AWK to second AWK command which will receive 3 fields as input: 1st is complete record, 2nd is Rec_Count=> and 3rd field is count of record all are separated by space. If 3rd field value > 1 that means record is duplicate.
2. Find out duplicate record based on key column, e.g. based on 1st column:
$ cat F_Data_File.txt | cut -d"|" -f1 | awk '{V_Dup_Count[$0]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' | awk -F" " '$3 > 1{print $0}'
(or)
awk -F"|" '{V_Dup_Count[$1]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' 1.txt | awk -F" " '$3 > 1{print $0}'
10001 Rec_Count=> 2
10003 Rec_Count=> 2
Explanation:
Same as in case 1 but here we are generating the array V_Dup_Count based on field 1 of the record.
3. Find out duplicate record based on key column, e.g. based on 1st & 2nd column:
$ cat F_Data_File.txt | cut -d"|" -f1,2 | awk '{V_Dup_Count[$0]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' | awk -F" " '$3 > 1{print $0}'
(or)
$ awk -F"|" '{V_Dup_Count[$1"|"$2]++} END{for(k in V_Dup_Count) {print k,"Rec_Count=>",V_Dup_Count[k]} }' 1.txt | awk -F" " '$3 > 1{print $0}'
10001|A1 Rec_Count=> 2
10003|A3 Rec_Count=> 2
Explanation:
Same as in case 1 but here we are generating the array V_Dup_Count based on field 1 & 2 of the record.
No comments:
Post a Comment