Saturday, 2 January 2016

How To Split A File Based On a Pattern In Unix/Linux

This article describes about splitting of file based on a pattern. It will be very difficult to split a file based on a pattern using split command. We will see how we can do it with ease by using awk command.

Input Data: F_Input_Data.txt

IP:127.0.0.1
server:status:logtime
kwdev001:succeed:21:21:10
kwdev002:succeed:21:31:10
kwdev003:succeed:21:31:10
IP:127.0.0.2
server:status:logtime
kwsit001:succeed:22:21:10
kwsit002:succeed:22:31:10
kwsit003:succeed:22:31:10
IP:127.0.0.3
server:status:logtime
kwuat001:succeed:23:21:10
kwuat002:succeed:23:31:10
kwuat003:succeed:23:31:10


Scenario 1: We want to populate the above data in 3 different files say ip_1.txt, ip_2.txt & ip_3.txt(Including IP address)


Scenario 2: We want to populate the above data in 3 different files say ip_1.txt, ip_2.txt & ip_3.txt(Excluding IP address)


Scenario 3: We want to populate the above data in 3 different files having IP address appeneded in name eg. ip_127.0.0.1.txt



Scenario 1:

$ awk '/IP/{i++}{ print > "ip_"i".txt"}' F_Input_Data.txt

[OR]

$ awk '/IP/{i++}{ print $0 > "ip_"i".txt"}' F_Input_Data.txt

Output: Will create 3 files

$ cat ip_1.txt
IP:127.0.0.1
server:status:logtime
kwdev001:succeed:21:21:10
kwdev002:succeed:21:31:10
kwdev003:succeed:21:31:10

$ cat ip_2.txt
IP:127.0.0.2
kwsit001:succeed:22:21:10
kwsit002:succeed:22:31:10
kwsit003:succeed:22:31:10

$ cat ip_3.txt
IP:127.0.0.3
server:status:logtime
kwuat001:succeed:23:21:10
kwuat002:succeed:23:31:10
kwuat003:succeed:23:31:10

Explanation: Let see how above command is working.
  • When first line (IP:127.0.0.1) is read, awk checks for line which has (IP) available in it.
  • IP is found in first line, code inside 1st curly braces {} will executes and i value evaluate to 1. 
  • At the same time 2nd curly braces {} will also executes for 1st line as well and redirect the line to the file ip_1.txt
  • In 2nd {} statement would be something like {print $0 > ip_1.txt} hence first line will be redirected to the file (print $0 >) ip_1.txt.
  • When 2nd line is read by awk (IP) is not found hence condition failed and 1st curly braces {} will not execute hence i value will remain same(i.e 1).
  • 2nd cuerly braces {} will execute and redirect (print $0 > ) in ip_1.txt.
  • Same will be applicable for line 3 as well.
  • Now, when 6th line (IP:127.0.0.2) is read, (IP) is found again hence 1st curly braces{} will execute and evaluate i value to 2.
  • In 2nd cuerly braces{} statement would be something like {print $0 > ip_2.txt}
  • same will continue until pattern is matched again.

Note: consider above command like 

if (pattern matching is TRUE)
{
i++
}
{Operation 1 independent of if block}
{Operation 2 independent of if block}

You should not consider it as if-else block. It should be considered as stand alone if block ONLYYou can re-write the above solution in more explanatory way as follows:

$ awk -F":" '{ if ($0 ~ /^IP/){i++}{ print > "ip_"i".txt"} }'

Note: Tilde (~) is used to match patterns in awk command.


Above solution is much clear as now we can see the independent if block and solution has summarized the explanation which we have discussed. 

Scenario 2:

$ awk '/IP/{i++} !/IP/{ print > "ip_"i".txt"}' F_Input_Data.txt

[OR]

$ awk '/IP/{i++} !/IP/{ print $0 > "ip_"i".txt"}' F_Input_Data.txt

[OR]

$ awk '{if ($0 ~ /^IP/){i++} if ($0 !~ /^IP/) { print $0 > "ip_"i".txt"}}' F_Input_Data.txt

Output: Will produced same result as in scenario 1, only difference is now split files will not have IP address available in it.

$ cat ip_1.txt
server:status:logtime
kwdev001:succeed:21:21:10
kwdev002:succeed:21:31:10
kwdev003:succeed:21:31:10

$ cat ip_2.txt
kwsit001:succeed:22:21:10
kwsit002:succeed:22:31:10
kwsit003:succeed:22:31:10

$ cat ip_3.txt
server:status:logtime
kwuat001:succeed:23:21:10
kwuat002:succeed:23:31:10
kwuat003:succeed:23:31:10

Explanation: 
  • In first line IP( /IP/) will be found and  first curly braces {} will execute and set i as 1 .
  • For the first line !/IP/ condition will fail and hence 2nd curly braces {} will not execute and hence this line will not redirected to the file ip_1.txt.
  • Now, when 2nd line is read, 1st condition /IP/ is failed and 2nd condition !/IP/ is passed hence moved to the file ip_1.txt.

Scenario 3:

$ awk -F":" '/IP/{i++;V_File_Name="ip_"$2".txt"} { print > V_File_Name}' F_Input_Data.txt #--Include IP in split files

[OR]

$ awk -F":" '{if( $0 ~ /^IP/){i++;V_File_Name="ip_"$2".txt"} { print > V_File_Name}}' F_Input_Data.txt #--Include IP in split files

[OR]

$ awk -F":" '/IP/{i++;V_File_Name="ip_"$2".txt"} !/IP/{ print > V_File_Name}' F_Input_Data.txt #--Not Include IP in split files

[OR]

$ awk -F":" '{if ($0 ~ /^IP/){i++;V_File_Name="ip_"$2".txt"} if( $0 !~ /^IP/){ print > V_File_Name}}' F_Input_Data.txt 
#--Not Include IP in split files


Output: Will produce below 3 files
ip_127.0.0.3.txt
ip_127.0.0.2.txt
ip_127.0.0.1.txt

Explanation: 
  • When we are searching a line with IP (/IP/),once found, we are fetching 2nd field of that line.(As we can see from data, this line is ":" delimited)
  • If pattern is matched, we are deriving the name of the file based on 2nd field hence file name like ip_27.0.0.1.txt,ip_27.0.0.2.txt,ip_27.0.0.3.txt.

Conclusion: As explained above, we can split the huge files/log files based on some pattern. We can use awk regular functions to search a pattern and accordingly pass the data.At the same time we have see how can we use flow control, if-else,in awk. For detailed information on split with awk and split command please follow below articles.


Keep Reading, Keep Learning, Keep Sharing...!!

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...