Randomly Delete Multiple Records From A CSV File

Rob_Sternberger · February 28, 2014, 5:40pm

Hi everyone!

I need to randomly delete 5,000 records from a csv file that contains 20,000 records. Each record is on a single line of the csv file; the csv file contains 20,000 lines. I searched the forum for some starter code but have found nothing.

My scripting idea is:
open csv file and get number of paragraphs (20,000)
randomly select a number from 1 to 20,000
delete that record based on the random number
repeat another 4,999 times

I feel like that will work but it seems a little clunky. Any thoughts?

Thanks!

McUsrII · February 28, 2014, 8:39pm

Hello.

I figure you know to use this to assemble a solution. I did this in a hurry, or I’d use Nigel Garvey’s quick sort handler.

property random_deletions : 5000

# set your_csv_file_as_list to (read (choose file))'s paragraphs
set your_csv_file_as_list to {}
repeat 20000 times
	set end of your_csv_file_as_list to "hakakakakakaka"
end repeat

set cur_lin_count to (count your_csv_file_as_list)
# assertion
if cur_lin_count < random_deletions then ¬
	error "not enough entries in the file ( " & random_deletions & " )." number -128

set rand_list to {}
script o
	property l : missing value
	property m : missing value
end script
set o's l to rand_list
# I do this to ensure 5000 unique random numbers
repeat random_deletions times
	repeat
		set rand_num to ((random number) * (cur_lin_count - 1) + 1) as integer
		if rand_num is not in o's l then
			set end of rand_list to rand_num
			exit repeat
		end if
	end repeat
end repeat
set o's m to your_csv_file_as_list
repeat with i from 1 to random_deletions
	set item (item i of o's l) of o's m to missing value
	#delete item (item i of o's l) of your_csv_file
end repeat
set cleansed_csv_list to o's m's text

# you should now have a cleansed list with 15000 items...(paragraphs).
# creating new text file
tell (a reference to text item delimiters)
	set {tids, contents of it} to {contents of it, linefeed}
	set sifted_csv to cleansed_csv_list as text
	set contents of it to tids
end tell
set fref to open for access (choose file name) with write permission
try
	write sifted_csv to fref starting at 0 as «class utf8»
	close fref
on error
	close fref
end try

Edit
I mended the script to deliver 5000 unique random numbers. (I opted for a slower solution by actually removing rows from the list, when I settled for the first solution with nonunique random numbers, as the item number would then have been replaced, so it wouldn’t matter.)

I didn’t remove enough the second time around, but now it is consistent, (and abit faster) it also saves the file afterwards, It is an open quenstion to me whether it will work with a 20000 lines file, since AppleScript only can deal with about 14000 elements.

Marc_Anthony · March 1, 2014, 3:07am

Hi. I started playing around with this before McUser edited his post, and I’ll leave this here as an alternative. I’m actually not sure how the real data is meant to be incorporated into his approach, so maybe it’s ultimately faster; I tested mine on a text file with 30000 lines, and it completed in roughly 1.5 seconds.

I made it work but not perfectly; the caveat is that there are stack overflows when the lineRange is close to the destroyCount. If you use the current values, there is enough proportional distance for that to not be an issue.


set numList to {}
set destroyCount to 5000
property lineRange : 20000

repeat destroyCount times
	my makeList()
end repeat

set isLine to (read (choose file))'s paragraphs

repeat with thisitem from 1 to destroyCount
	set my isLine's item (my numList's item thisitem) to null
end repeat
set AppleScript's text item delimiters to return
isLine's text as string


to makeList()
	tell my (random number from 1 to lineRange) to if my numList does not contain it then
		set my numList's end to it
	else
		my makeList()
	end if
end makeList

Edited to rename two variables, for clarity, and to produce a vertical list as the result.

Nigel_Garvey · March 1, 2014, 9:49am

Here’s another variation:

-- Assumptions in this script:
-- The number of CSV records â‰¥ 5000.
-- None of the fields contain line breaks.

set zapCount to 5000

set csvRecordList to paragraphs of (read (choose file))

script o
	property csvRecordList : missing value
	property indexList : missing value
end script

set o's csvRecordList to csvRecordList
copy csvRecordList to o's indexList

repeat with i from 1 to (count csvRecordList)
	set item i of o's indexList to i
end repeat

repeat zapCount times
	set i to some integer of o's indexList
	set item i of o's csvRecordList to missing value
	set item i of o's indexList to missing value
end repeat

set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to return & linefeed
set editedCSV to o's csvRecordList's text as text
set AppleScript's text item delimiters to astid

return editedCSV

McUsrII · March 1, 2014, 9:49am

Hello.

I didn’t remove enough code yester day, but now it should be okay, added reading and writing file.

I am not sure if this will work since the big list contains more than 2^14 elements.

Marc_Anthony · March 1, 2014, 2:21pm

Nigel, your version and mine are neck and neck in my 30000 line test, but using some, based on length, appears much less precarious than relying on recursion.

McUsrII · March 1, 2014, 3:24pm

Hello.

@Marc Anthony: I wonder if my version is slower by your test (excluding any reading and writing).

If you stumble upon a very rare sequence of random digits, then you may get more alike digits than there are room for handler calls on the stack, and your approach may crash. The likelihood is very unlikely, probably improbable.

My solution is almost unreadable, but I hope it to be the fastest, most robust and exact.

Edit
I was curious, so I timed my machine, which is working hard at the moment. The script took 7 seconds to complete in AppleScript Editor with the log window visible. (If that makes a difference).

Marc_Anthony · March 1, 2014, 4:52pm

McUser, I tested yours, and it was roughly equivalent in speed to Nigel’s and mine”about 2 seconds on a 33,774 line document. I haven’t read and examined exactly what is going on in your code, but, FYI, there appears to be no content in the written output file.

McUsrII · March 1, 2014, 5:01pm

Hello.

I’m sorry to hear that there is no content that is output, I just tested it here, and worked flawlessly, I have optimized it a bit, so now it uses one second less on my heater. I’ll rerun it and test the file writing part.

Edit

I have fixed the problems with the file write, the biggest problem was a mistyped variable name for the file handler, the second being the fact that didn’'t open for access with write permission. I also added “starting at 0” for the case that a file is to be overwritten.

Nigel_Garvey · March 1, 2014, 5:34pm

Hi Marc. It probably depends on which parts and how much of the scripts you’re timing.

For a like-for-like comparison, your ‘choose file’ line should go above the first repeat and the timing should be from immediately after it (including both repeats) to the production of the final string. Timing both our scripts from after the obtaining of the paragraphs to after the production of the final text, I get 4.45 seconds for your script and 1.21 seconds for mine. However, my test file only has 10659 lines. The speeds may be more nearly similar when there are considerably more lines because my index list has to be the same length as the paragraph list, whereas yours only has to be the length of the number of lines to delete.

Hi McUsrII.

I don’t ignore them, I prevent them. As each randomly chosen index number’s used, it’s repaced in the index list with ‘missing value’ so that ‘some integer of o’s indexList’ doesn’t choose it again.

It takes AppleScript Editor quite a while to render large lists, records, and text into human-readable form, so the log window shouldn’t be open, and list/records/text results shouldn’t be allowed to display, during timing tests.

McUsrII · March 1, 2014, 6:47pm

Hello.

I’m sorry, I wasn’t aware of the fact that you could do that!

So now I know that the “card-shuffling-technique” can be easily implemented in AppleScript. That some can kind of type-qualify what it returns is really nifty!

Thanks.

(I’ll edit my post above.)

Nigel_Garvey · March 1, 2014, 7:57pm

Hi McUsrII.

I think this should be either .

set rand_num to (random number from 1 to cur_lin_count)

. or .

set rand_num to (random number (cur_lin_count - 1)) + 1

‘random number’ by itself returns a real between 0.0 and 1.0. The closer this real is to 1.0, the more likely the result of your entire line is to be 1 more than the number of items in your list, which is occasionally happening.

‘random number from x to y’ returns a number between x and y inclusively. If x and y are both integers, the result’s an integer. Otherwise it’s a real. If either x or y are omitted, the default values are 0.0 and 1.0 respectively, the result being the same class as the parameter which isn’t omitted.

‘random number x’ returns a number between 0.0 and x inclusively, the result being the same class as x. So ‘(random number (x - 1)) + 1’ produces integer results in the same range as ‘random number from 1 to x’. It used to be a little faster than using ‘from’ and ‘to’, but it doesn’t seem to be reliably so now.

Edit: Deleted assertion about ‘some’ being faster than any of these ‘random number’ variants after testing DJ Bazzie Wazzie’s comments in post #16 below.

McUsrII · March 1, 2014, 10:01pm

Hello Nigel.

You are correct, I assumed to have it right with

	set end of l to ((random number) * end_range  + 1) as integer

But the correct version is indeed.

set end of l to ((random number) * (end_range - 1) + 1) as integer

It was nice with a walk through of the random number command. I’ll of course stick to “some” when appropriate, and the from . to syntax, with the return of the passed in type, seems also like a better alternative to random number x.

It was nice to learn this, thanks

@Marc Anthony
The use of item, and that a null value isn’t an item, was a good observation!
Maybe you can tweak the last “drop” of it, by using a property, so that you get o’s m’s some item?

Nigel_Garvey · March 2, 2014, 12:38am

Hi Marc. The idea of using ‘some integer’ is to stop indices being used more than once: use an index, change it to a non-integer, ‘some integer’ ignores it thereafter.

Marc_Anthony · March 2, 2014, 1:10am

Thanks, Nige. I read that earlier, but didn’t see that changing it in my script was admitting index dupes… my list was too big. :rolleyes:

Please disregard this mistaken point; my use of null is identical to Nigel’s use of missing value.

DJ_Bazzie_Wazzie · March 2, 2014, 2:54am

Don’t bother… some is slow, very slow when lists gets bigger than 5,000. It’s an real turning point though. Also you’re using the card-shuffle-technique for only 5,000 of the 20,000 cards. So the last card, card #5,000, has only 25% of a collision which makes the total random generated number somewhere around 5,700. Because random number is much faster than some for bigger lists with minor fine tuning you can make in this particular situation random command faster.

When you would have a real card shuffle deck, then indeed it would have so much collisions that the last card of a 5,000 has 99.95% chance of a colision and the average rng is then around 50,000. So like I said, for this particular situation when you only need to shuffle 25% of a complete deck, random number is the way to go. Then some is the way to go because some has no collisions at all and it will beat rng just by it’s efficiency, not by it’s performance.

However the overhead of reading, splitting, joining and writing to a file of 20.000 lines (2MB in file size) takes 90% of all execution time. So I could post a way that’s more than 2 times faster than using some, the total amount of execution time will be shortened by less than 5%.

edit: example codes to compare:
here an version using some:

script linearList
	property array : {}
end script

script randomList
	property array : {}
end script

set linearList's array to {}
set randomList's array to {}

repeat with x from 1 to 20000
	set end of linearList's array to x
end repeat

repeat 5000 times
	set seed to some integer of linearList's array
	set item seed of linearList's array to missing value
	set end of randomList's array to seed
end repeat

return

Here an version using random command and using an boolean array.

script linearList
	property array : {}
end script

script randomList
	property array : {}
end script

set linearList's array to {}
set randomList's array to {}

repeat 20000 times
	set end of linearList's array to false
end repeat

set countRandomNumbers to 0
repeat until countRandomNumbers = 5000
	set seed to random number from 1 to 20000
	if not item seed of linearList's array then
		set item seed of linearList's array to true
		set end of randomList's array to seed
		set countRandomNumbers to countRandomNumbers + 1
	end if
end repeat

return

Like I said, the solution work’s best with an 1:4 ratio like the TS asks. When you use a deck with 100,000 card and pick randomliy 25,000 the solution is still fast, unlike some.

McUsrII · March 2, 2014, 10:23am

Hello.

I tried to write a solution in shellscript, to see if I could come up with something faster, but the truth is that AppleScript, and its one thread is actually quite fast.

Believe it or not, a loop in Applescript generating 20000 intergers, are faster than “seq 1 2000” in a do shell script, (or terminal window) for that matter.

And it has beaten me, that the random command of Applescript, is quite the easy and versatile implementation of it, with seed argument, that lets you re-run a sequence of random digits, which is quite useful for testing purposes at least.

For prosperity:

set t0 to (current date)
set randlist to paragraphs of (do shell script " seq 1 20000 |perl -wnl -e '@f=<>; END{ foreach $i (reverse 0 .. $#f) { $r=int rand ($i+1); @f[$i, $r]=@f[$r,$i] unless ($i==$r); } chomp @f; foreach $line (@f){ print $line; }}' |head -5000 ")
set t1 to (current date) - t0

Nigel_Garvey · March 2, 2014, 11:10am

OK, DJ. You’ve convinced me. :lol: After playing around with this for most of the morning to try and find the catch, I have to admit that ‘random number’ certainly is faster than ‘some integer’ with large lists. Things have changed since I and others originally investigated this.

In the context of the OP’s enquiry ” randomly deleting items from a list of text ” it’s not even necessary to generate your list of booleans, since each randomly-indexed item in the text list can instead be checked to see if it’s become ‘missing value’. So a faster version of my script would be:

-- Assumptions in this script:
-- The number of CSV records â‰¥ 5000.
-- None of the fields contain line breaks.

set zapCount to 5000

set csvRecordList to paragraphs of (read (choose file))

script o
	property csvRecordList : missing value
end script

set o's csvRecordList to csvRecordList
set recordCount to (count csvRecordList)

set countRandomNumbers to 0
repeat until (countRandomNumbers = zapCount)
	set i to (random number from 1 to recordCount)
	if (item i of o's csvRecordList is not missing value) then
		set item i of o's csvRecordList to missing value
		set countRandomNumbers to countRandomNumbers + 1
	end if
end repeat

set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to return & linefeed
set editedCSV to o's csvRecordList's text as text
set AppleScript's text item delimiters to astid

return editedCSV

Rob_Sternberger · March 2, 2014, 2:19pm

Wow! I didn’t expect so many solutions and the last thing I was concerned about was overall speed; I just needed a solution that works.

Thanks again for ALL the scripting help!

DJ_Bazzie_Wazzie · March 3, 2014, 1:56am

I remember that some was faster than random. That was also the time that avoiding AppleEvents and simply built-in AppleScript functions was faster. Splitting strings using delimiters and counting their length as an alternative to the offset command was always faster, and proven to be faster. Also in this case due to more efficient self targeted events that’s not always the case anymore (depending on lengths of the string and substring).

You’re correct, there has been some unannounced changes in AppleScript.