Split 1.2 Gb text file into separate, smaller files.

elsombreron · June 27, 2008, 4:48am

Hi all,

I’m new to this forum, so I’m not sure if this is the best place to ask this. If not, please let me know.

**Background
I have a dataset that’s 1.2GB big! The text file is delimited by the character “~” (Tilde). In the first row it has the field(column) names and every subsequent row has the data. Below is an example of what it looks like. The actual data set has more columns and rows.

Group_number~Price~Group_description
1~34.2~Base_Customer
2~23~Medium_Customer
3~89~High_Customer
1~90.21~Base_Customer
2~100.55~Medium_Customer
3~200.11~High_Customer

***Problem
Now, I usually read this kind of file using Stata (a statistical package - www.stata.com). I then do all sorts of analyses on the same package. The problem is that Stata holds the data you load in memory, and the text file I’m trying to read is 1.2GB!!!. It has way, way, way, more rows than the example I gave above, which makes it reach such a size.

I’m no expert in computer hardware, but my MackBook Pro 2.16 GHz Intel Duo has only 1GB of RAM. So, I’m pretty sure it won’t load!. Furthermore, I’ve tried loading large files with Stata and all the OS is willing to give it is 900Mb.

***Request for help
So, I figured that what I may be able to do is somehow open the text file as a stream and split it into, say 3 separate files, one for each Group_number in my example above. This would hopefully yield 3 separate text files that would be under 900Mb each. I could then easily manage these. The problem is, I don’t know how to do this in OS X. I’ve seen/heard of people doing this in C# in Windows, but I’m wondering (and quietly hoping) if an AppleScript could do the job. If so, I would love some pointers if anyone’s got any. I have very limited experience with AppleScript but have experience with procedural and (basic level) object-oriented programming so I’m hoping I’ll be able to work out scripts people pitch.

Thanks so much in advance for your help!!!

Model: MacBook Pro
AppleScript: 2.2
Browser: Safari 525.20
Operating System: Mac OS X (10.5)

Martin_Michel · June 27, 2008, 7:18am

Hi elsombreron,

I have written a small Python script for you, which might get you started. You can download and inspect it right here.

To test it on a backup of your file, you must first save the Python script to your Mac, then open a Terminal window and enter the following command:

/usr/bin/python pythonscriptpath largetextfilepath

The Python script will make a new file every 50000 lines, but you can easily modify this within the code. The new text files will have a trailing number in their file name following the scheme: 1_largetextfilename. They are created in the same folder as where your large text file resides. The script will also add the first row of your large text file to any subsequent text file.

Nevertheless I cannot guarantee you a 100% success with my script But you get the idea

elsombreron · June 28, 2008, 4:01am

Martin, thanks so much for all that!
I have no experience with python but it’s not too hard figuring out where to change the file break parameter.
The script split the files into 13Mb files. I can easily loop through these in Stata now and work with them!
I was expecting an applescript but this python script did the job just as well!

Thanks again for your help, Martin.