Parallel data fetching/processing

Hi All,

I’ve a question about whether or not I can run some handlers in parallel, since the script in question can take 5 minutes to run when used with large files.

I’m using a script (the heart of which I got from Nigel Garvey, who is awesome) to fetch hundreds or thousands of bits of info each from a series of large text files. Currently, there is a single variable, which has a list of each of the sets of data. Each item in the list has the data set name, and the location of one or two files containing the needed information for the data set. There is then a handler which reads the specified text files and fetches many pieces of information and appends each item in the list of data sets with a list of records containing all the information needed for the dataset. Then there’s another handler which goes through each list for each dataset (all in one variable still) and does some calculations, appending the information to the lists of each set in the variable.

The handlers which take the longest to run don’t have any application tells, although one does do a shell script. Is there any way to have, for example, the data fetching handler run on each specified file in multiple instances, each instance create a variable with the needed information, and then I can just append that to the variable with all of the information, rather than have everything happen sequentially?

It seems the only parallelization you can do with AS is application tells in an ignore block.

Would my chances be better trying this with applescript studio in xcode?




If you reworked your script “architecture” to become a shell script one, running osa scripts, and finding a mechanism for “serializing” the data from the parallell jobs, then I guess you would be good, and this sounds far more complex than it is.

With shell scripts, you can start up several instances of the same script, and run them in background, the different processes then appending to the same file, but the different processes would have to set lock files and such, so that their exclusive writing isn’t disrupted.

Look at Mutex at wikipedia.



A simpler scheme, than resorting to mutexe’s and mklock could be to first have the main process, part the input and create subprocesses, an output folder, where each process puts its output for the main process to collate, and to drop “finished.files” so that the main process knows when all subprocesses are done and can collate the output.