Cleaning data out of a text file?

estockly · July 19, 2022, 4:14pm

Will the end user have BBEdit installed?

dbrewood · July 19, 2022, 4:15pm

Ah ha right, I can live with it is I have to, it’s not a problem
I don’t want to put you to so much trouble…

dbrewood · July 19, 2022, 4:17pm

I do and even better Textmate.

Ideally though I’d prefer not to use another tool. The original script did sort in the required manner. But if this is a preferred way to go with things…

robertfern · July 20, 2022, 12:56am

Here is a version that will sort IPs numerically, not alphabetically.

I found my multi-item sort version of combSort

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

global IPsExisted

on run
	try
		set theFiles to choose file of type {"txt"} with multiple selections allowed
	on error
		return
	end try
	open theFiles
end run

on open theFiles
	local atid, dText, aFile, cFile, startTime, tmp
	script F
		property ipList : missing value
	end script
	set atid to text item delimiters
	set IPsExisted to read (((path to home folder as text) & "Library:Mobile Documents:deny-ip-list.txt") as alias)
	--set IPsExisted to paragraphs of IPsExisted -- left IPsExisted as text rather than list, sped up search
	set text item delimiters to {linefeed}
	set IPsExisted to linefeed & ((paragraphs of IPsExisted) as text) & linefeed -- gets rid of returns, only linefeeds left
	repeat with aFile in theFiles
		try
			set cFile to open for access aFile --with write permission
		on error
			display alert "Uh-oh! Error opening file…" giving up after 10
			return false
		end try
		try
			set dText to read cFile from 1 as text
		on error
			set dText to false
			display alert "File Empty!" giving up after 10
		end try
		close access cFile
		if class of dText is not boolean then
			set F's ipList to parseDosIPs(dText)
			set text item delimiters to {".txt"}
			set dText to ((text items 1 thru -2 of (aFile as text)) as text)
			set text item delimiters to {"."}
			set c to count of (text items in item 1 of F's ipList)
			repeat with j from 1 to count F's ipList
				set tmp to text items of item j of F's ipList
				repeat with i from 1 to c
					set item i of tmp to (item i of tmp) as integer
				end repeat
				set item j of F's ipList to tmp
			end repeat
			combSortM(F's ipList, {1, 2, 3, 4})
			--=ipList
			repeat with j from 1 to count F's ipList
				set item j of F's ipList to item j of F's ipList as text
			end repeat
			saveIPs(F's ipList, dText)
		end if
	end repeat
	set text item delimiters to atid
end open

on parseDosIPs(dosText)
	local atid, IPv4, ipRef, IPs, i
	script D
		property dosList : paragraphs of dosText
		property tmp : missing value
	end script
	set ipRef to a reference to IPsExisted
	set IPs to ""
	--set dosText to paragraphs of dosText
	set atid to text item delimiters
	set text item delimiters to {"remote] from ", "] from source: ", "LAN access", ", ", ":"} -- {"from source: ", ", "}
	repeat with i in D's dosList
		set IPv4 to contents of i
		if IPv4 ≠ "" then
			set D's tmp to text items of IPv4
			if (count D's tmp) > 2 then
				set IPv4 to linefeed & (item 3 of D's tmp) & linefeed --word 1 of 
				if IPv4 does not start with (linefeed & "192.168.1.") then
					if (IPv4 is not in IPs) then
						if (IPv4 is not in ipRef) then
							set IPs to IPs & IPv4 --set end of D's IPs to IPv4
						end if
					end if
				end if
			end if
		end if
	end repeat
	set text item delimiters to linefeed & linefeed
	set IPs to text items of (text 2 thru -2 of IPs) -- removes leading and trailing linefeed
	set text item delimiters to atid
	return IPs
end parseDosIPs

on combSortM(aList, sortList) -- FASTEST
	local i, j, cc, ns, js, gap, pgap, c, sw, sf, comp -- ns means No Swap
	script mL
		property nlist : aList
		property sList : {}
		property oList : {}
	end script
	set sf to 1.7 -- shrink factor
	set cc to count mL's nlist
	repeat with j in sortList
		if j > 0 then -- if positive, sort ascending
			set end of mL's sList to (contents of j)
		else -- if negative,sort descending
			set end of mL's sList to -(contents of j)
		end if
		set end of mL's oList to (j > 0)
	end repeat
	
	set gap to cc div sf
	repeat until gap = 0
		repeat with i from 1 to gap
			set js to cc - gap
			repeat until js < 1 -- do each gap till nor more swaps
				set ns to gap
				repeat with j from i to js by gap
					set comp to false
					repeat with i from 1 to count of mL's sList -- do multiple comparisons
						set c to item i of mL's sList
						if (item c of item j of mL's nlist) < (item c of item (j + gap) of mL's nlist) then
							if not (item i of mL's oList) then set comp to true -- ascending
							exit repeat
						else if (item c of item j of mL's nlist) > (item c of item (j + gap) of mL's nlist) then
							if (item i of mL's oList) then set comp to true -- descending
							exit repeat
						end if
					end repeat
					if comp then -- do the swap
						set sw to (item j of mL's nlist)
						set (item j of mL's nlist) to (item (j + gap) of mL's nlist)
						set (item (j + gap) of mL's nlist) to sw
						set ns to j
					end if
				end repeat
				set js to ns - gap
			end repeat
		end repeat
		set pgap to gap
		set gap to gap div sf
		if gap = 0 then -- no while using as integer
			if pgap ≠ 1 then set gap to 1
		end if
	end repeat
end combSortM

on saveIPs(pList as list, pPath as string)
	local cFile, cEOF
	set cFile to pPath & "_IP_CLEANED.txt"
	try
		set cFile to open for access cFile with write permission
	on error
		display alert "Uh-oh! Error opening file…" giving up after 10
		return false
	end try
	set atid to text item delimiters
	set text item delimiters to linefeed
	try
		set cEOF to (get eof cFile) + 1
		write (pList as text) & linefeed to cFile as text starting at cEOF
	on error
		display alert "Error! Can't write to IP_CLEANED file…" giving up after 10
	end try
	set text item delimiters to atid
	close access cFile
	return true
end saveIPs

EDIT I added another linefeed to the end of text from the deny-ip-plist.txt. this is so every ip is enclosed in linefeeds to prevent finding for example “41.2.3.4” in “141.2.3.4” when using text searches

EDIT I moved the ipList into a script object to speed up the parsing

dbrewood · July 20, 2022, 7:50am

Sorry, but I got an error again:

Fixed it, I replaced the line above with this one:

set IPsExisted to read (((path to home folder as text) & "Library:Mobile Documents:com~apple~CloudDocs:_Daron Files:NAS:Deny List:deny-ip-list.txt") as alias)
   --set IPsExisted to paragraphs of IPsExisted -- left IPsExisted as text rather than list, sped up search

That is perfect now!

Many many thanks…

dbrewood · July 20, 2022, 8:11am

Okay, being a ‘data freak’ I did a compare of the output data from the original script I used (where I had to strip out manually the 192.168.1.xxx IP addresses) and the new script. There was a difference of 5 IP addresses, the new script output didn’t contain 5 addresses.
The IP addresses which were not picked up were from these lines:

[LAN access from remote] from 192.241.214.10:48856 to 192.168.1.202:80, Monday, July 18, 2022 12:14:50
[LAN access from remote] from 192.241.214.10:54378 to 192.168.1.202:80, Monday, July 18, 2022 12:16:15

[LAN access from remote] from 64.62.197.7:54397 to 192.168.1.202:80, Tuesday, July 19, 2022 13:07:12
[LAN access from remote] from 64.62.197.9:60333 to 192.168.1.202:80, Monday, July 18, 2022 02:25:26

[LAN access from remote] from 71.6.232.2:37319 to 192.168.1.202:80, Monday, July 18, 2022 19:58:58
[LAN access from remote] from 71.6.232.2:57042 to 192.168.1.202:80, Monday, July 18, 2022 19:58:59

[LAN access from remote] from 162.142.125.7:32998 to 192.168.1.202:32400, Sunday, July 17, 2022 05:38:29
[LAN access from remote] from 162.142.125.7:43402 to 192.168.1.202:32400, Sunday, July 17, 2022 05:38:30
[LAN access from remote] from 162.142.125.7:55248 to 192.168.1.202:32400, Sunday, July 17, 2022 05:38:28
[LAN access from remote] from 162.142.125.7:57904 to 192.168.1.202:32400, Sunday, July 17, 2022 05:38:27

Any ideas at all as to why these were missed?

Sorry about this…

robertfern · July 20, 2022, 1:43pm

Weird.

Of the sample you just gave only half were in the sample file you gave me.

But when I added them, they showed up for me.

Are any of them in the deny-ip-list?

p.s. I made some small edits to the script above.

dbrewood · July 20, 2022, 2:50pm

Much wierdness indeed. I’ve just run the script (the updated one) over the latest log file (and today’s deny list) and this time the resultant data ties up. Very strange indeed.

I’ve sent copies of both to you via email…

Oh, I did get the alias error again so had to modify the script, changing:

set IPsExisted to read (((path to home folder as text) & "Library:Mobile Documents:deny-ip-list.txt") as alias)
	--set IPsExisted to paragraphs of IPsExisted -- left IPsExisted as text rather than list, sped up search

to

set IPsExisted to read (((path to home folder as text) & "Library:Mobile Documents:com~apple~CloudDocs:_Daron Files:NAS:Deny List:deny-ip-list.txt") as alias)
--set IPsExisted to paragraphs of IPsExisted -- left IPsExisted as text rather than list, sped up search

estockly · July 20, 2022, 3:48pm

I was wondering about these.

Do you want just the first?

(I didn’t get the email)

dbrewood · July 20, 2022, 4:53pm

The IP in the output file only needs to be shown once, yes.

I’ve resent the files to you again using a different email program, please advise if they get to you okay this time…

dbrewood · July 21, 2022, 5:00pm

It all seems to be working well at the moment

CK11 · July 28, 2022, 11:20am

Just perform all string comparisons within:

considering numeric strings
		 ...
		 "179.154.236.75" > "18.166.33.109" # true
		 ...
end considering

That way, you can compare IP addresses numerically without the massive ball-ache.

dbrewood · July 28, 2022, 12:12pm

Would that make it more efficient / faster?

I’ve also found out that the script needs to stop creating the output file so that the output file ends on a line with the IP address, and not an IP address followed by a CR (if that makes sense)…

CK11 · July 29, 2022, 12:05am

Almost certainly, given what @robertfern described as the alternative, namely, decomposing the string into octets, coercing to integers, etc. Those are hefty operations, so if there’s a way to avoid all of that, it’d be desirable.

Why ? It’s a fairly standard convention that text files contain a single trailing blank line, i.e. that its final character is a new line character (which is more typically a linefeed rather than a carriage return unless you’re working in Windows).

dbrewood · July 29, 2022, 8:03am

The reason for not requiring a blank line at the end of the created output list is that the other script (up thread) which submits the list entries to IP Abuse looks to try and submit the blank line. That then causes an ‘invalid data’ error from IP Abuse and the script falls over.

As I submit the data for checking once it hits around 980+ IP’s and the submission limit is 1000 IP’s a day, if I get an error I have to re-check the list and re-submit the next day. It’s just easiest to have a ‘good list’ I can submit easily.

Hoping that makes sense

CK11 · July 29, 2022, 10:42pm

Ah, that’s probably because the script didn’t read in the file properly. I haven’t looked at the scripts, but, as I intimated previously, a trailing blank line is so standard that AppleScript takes care of this automatically for it not to have to be an unnecessary issue.

When reading in a file using AppleScript’s read command, if the intention is to split the contents of the file into a list of text items, such as when dealing with CSV files, or in your situation when you want to split the file into separate lines, then it is better for this to be done at the point of reading by the read command itself. Doing this manually after the data has been read usually means using text item delimiters, so that would naturally lead to errant items in the resulting list, namely empty string items (“”), which sounds like this is what’s happening in your case.

There are other reasons to have the read command process the data it reads for you, such as when the file contents isn’t plain text, or uses text encoding that isn’t ascii or utf-8. It’s also able to convert data from one type into another, to provide, for example, a list of boolean values or integer values instead of empty strings.

But sticking with plain text, you want to read your file containing the IP addresses (one per line) something like this;

set filepath to "/path/to/ipv4-file.txt"
set ip_addrs to read the filepath as "utf8" using delimiter linefeed

It’s always prudent to specify the type of data you want to be given, in this case utf8 encoded text (items). The delimiter supplied to the command can be a single character, a list, or a string. When supplying a string, it splits text on every instance of each character in the string, which is an important distinction between this and text item delimiters, where it would split on instances of the entire string.

You’ll find that whether the file has a trailing blank line or terminates at the end of a string, the list obtained won’t have an empty string item at the end of it.

If you’re wondering why trailing blank lines are a standard, one reason is because when writing to a text file that isn’t empty, you almost always would want new text to be appended to it starting on the line below the last lot of text that is already there. Without the trailing line, new text written to the file would carry on from the end of the previous sentence. Also, some programs actually use the presence of a terminating newline character as an end-of-file marker.

dbrewood · July 30, 2022, 8:09am

Sounds logical to me

Way beyond my abilities to fix the scripts though

Mockman · July 30, 2022, 10:41pm

CK’s recent posts got me playing around with the read command and I couldn’t help but imagine that the shell’s sort command might be useful — surely, some of the folks who worked on IP4 were unix users — so here is an alternate approach.

Sort requires an input file. You could save the intermediate results (e.g. blank lines removed) back to the original but I’ve created an empty text file to deposit them in. You also need an empty output file, although you could revise the script to create that automatically. I’ve included an option to show the results rather than save them.

set rFil to ((path to desktop) as text) & "rawfile.txt" as alias -- source data
set iFil to ((path to desktop) as text) & "intfile.txt" as alias -- intermediate file
set sFil to ((path to desktop) as text) & "sortedfile.txt" as alias -- sorted output file

set rawInput to read rFil using delimiter linefeed

-- exclude all lines without '.' in them
set rList to {}
repeat with z from 1 to length of rawInput
	if item z of rawInput contains "." then set end of rList to item z of rawInput
end repeat

-- deposit pruned results into intermediate file 'intfile.txt'
set AppleScript's text item delimiters to linefeed
iFil as text

write (rList as text) to iFil
set iFilpp to POSIX path of iFil
set sFilpp to POSIX path of sFil

-- sort and save in 'sortedfile.txt'
do shell script "sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4 " & iFilpp & " -o " & sFilpp
read sFilpp -- optional, just so you can see resulting text in script editor

-- or do this to merely view the results
-- do shell script "sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4 " & iFilpp


-- origin of 'sort' arguments
-- https://stackoverflow.com/q/21980133/7410243

dbrewood · July 31, 2022, 11:40am

Looks interesting… I’ve no ideas howe to use it for me though as I can’t work out what I need to change it so it works with drag and dropping my source file onto it as a compiled application, and how to change it so I do get the output file to be generated as the dropped file name with ‘_IP_CLEANED’ added to it
Sorry…

dbrewood · April 30, 2024, 9:25am

This works wonderfully well with one exception … after the CSV file has been saved to the desktop it leaves a ‘ReportX’ numbers file live inside the Numbers app and next time I open Numbers I’m prompted to save it. Is there a way of closing Numbers and either saving that file automatically or closing it without saving? The ‘close documents saving no’ seems not to do anything.