Cleaning data out of a text file?

dbrewood · July 20, 2022, 4:53pm

The IP in the output file only needs to be shown once, yes.

I’ve resent the files to you again using a different email program, please advise if they get to you okay this time…

dbrewood · July 21, 2022, 5:00pm

It all seems to be working well at the moment

CK11 · July 28, 2022, 11:20am

Just perform all string comparisons within:

considering numeric strings
		 ...
		 "179.154.236.75" > "18.166.33.109" # true
		 ...
end considering

That way, you can compare IP addresses numerically without the massive ball-ache.

dbrewood · July 28, 2022, 12:12pm

Would that make it more efficient / faster?

I’ve also found out that the script needs to stop creating the output file so that the output file ends on a line with the IP address, and not an IP address followed by a CR (if that makes sense)…

CK11 · July 29, 2022, 12:05am

Almost certainly, given what @robertfern described as the alternative, namely, decomposing the string into octets, coercing to integers, etc. Those are hefty operations, so if there’s a way to avoid all of that, it’d be desirable.

Why ? It’s a fairly standard convention that text files contain a single trailing blank line, i.e. that its final character is a new line character (which is more typically a linefeed rather than a carriage return unless you’re working in Windows).

dbrewood · July 29, 2022, 8:03am

The reason for not requiring a blank line at the end of the created output list is that the other script (up thread) which submits the list entries to IP Abuse looks to try and submit the blank line. That then causes an ‘invalid data’ error from IP Abuse and the script falls over.

As I submit the data for checking once it hits around 980+ IP’s and the submission limit is 1000 IP’s a day, if I get an error I have to re-check the list and re-submit the next day. It’s just easiest to have a ‘good list’ I can submit easily.

Hoping that makes sense

CK11 · July 29, 2022, 10:42pm

Ah, that’s probably because the script didn’t read in the file properly. I haven’t looked at the scripts, but, as I intimated previously, a trailing blank line is so standard that AppleScript takes care of this automatically for it not to have to be an unnecessary issue.

When reading in a file using AppleScript’s read command, if the intention is to split the contents of the file into a list of text items, such as when dealing with CSV files, or in your situation when you want to split the file into separate lines, then it is better for this to be done at the point of reading by the read command itself. Doing this manually after the data has been read usually means using text item delimiters, so that would naturally lead to errant items in the resulting list, namely empty string items (“”), which sounds like this is what’s happening in your case.

There are other reasons to have the read command process the data it reads for you, such as when the file contents isn’t plain text, or uses text encoding that isn’t ascii or utf-8. It’s also able to convert data from one type into another, to provide, for example, a list of boolean values or integer values instead of empty strings.

But sticking with plain text, you want to read your file containing the IP addresses (one per line) something like this;

set filepath to "/path/to/ipv4-file.txt"
set ip_addrs to read the filepath as "utf8" using delimiter linefeed

It’s always prudent to specify the type of data you want to be given, in this case utf8 encoded text (items). The delimiter supplied to the command can be a single character, a list, or a string. When supplying a string, it splits text on every instance of each character in the string, which is an important distinction between this and text item delimiters, where it would split on instances of the entire string.

You’ll find that whether the file has a trailing blank line or terminates at the end of a string, the list obtained won’t have an empty string item at the end of it.

If you’re wondering why trailing blank lines are a standard, one reason is because when writing to a text file that isn’t empty, you almost always would want new text to be appended to it starting on the line below the last lot of text that is already there. Without the trailing line, new text written to the file would carry on from the end of the previous sentence. Also, some programs actually use the presence of a terminating newline character as an end-of-file marker.

dbrewood · July 30, 2022, 8:09am

Sounds logical to me

Way beyond my abilities to fix the scripts though

Mockman · July 30, 2022, 10:41pm

CK’s recent posts got me playing around with the read command and I couldn’t help but imagine that the shell’s sort command might be useful — surely, some of the folks who worked on IP4 were unix users — so here is an alternate approach.

Sort requires an input file. You could save the intermediate results (e.g. blank lines removed) back to the original but I’ve created an empty text file to deposit them in. You also need an empty output file, although you could revise the script to create that automatically. I’ve included an option to show the results rather than save them.

set rFil to ((path to desktop) as text) & "rawfile.txt" as alias -- source data
set iFil to ((path to desktop) as text) & "intfile.txt" as alias -- intermediate file
set sFil to ((path to desktop) as text) & "sortedfile.txt" as alias -- sorted output file

set rawInput to read rFil using delimiter linefeed

-- exclude all lines without '.' in them
set rList to {}
repeat with z from 1 to length of rawInput
	if item z of rawInput contains "." then set end of rList to item z of rawInput
end repeat

-- deposit pruned results into intermediate file 'intfile.txt'
set AppleScript's text item delimiters to linefeed
iFil as text

write (rList as text) to iFil
set iFilpp to POSIX path of iFil
set sFilpp to POSIX path of sFil

-- sort and save in 'sortedfile.txt'
do shell script "sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4 " & iFilpp & " -o " & sFilpp
read sFilpp -- optional, just so you can see resulting text in script editor

-- or do this to merely view the results
-- do shell script "sort -n -t . -k 1,1 -k 2,2 -k 3,3 -k 4,4 " & iFilpp


-- origin of 'sort' arguments
-- https://stackoverflow.com/q/21980133/7410243

dbrewood · July 31, 2022, 11:40am

Looks interesting… I’ve no ideas howe to use it for me though as I can’t work out what I need to change it so it works with drag and dropping my source file onto it as a compiled application, and how to change it so I do get the output file to be generated as the dropped file name with ‘_IP_CLEANED’ added to it
Sorry…

dbrewood · April 30, 2024, 9:25am

This works wonderfully well with one exception … after the CSV file has been saved to the desktop it leaves a ‘ReportX’ numbers file live inside the Numbers app and next time I open Numbers I’m prompted to save it. Is there a way of closing Numbers and either saving that file automatically or closing it without saving? The ‘close documents saving no’ seems not to do anything.