Text manipulation - getting started question - convert text to csv

panini · March 12, 2020, 6:58pm

Hi Folks,

This is my first post, so as a bit of an introduction I’ll introduce myself a bit:
I’m working in industrial automation mainly doing installation of equipment at larger food industry factories around the world. This way I got some experience with electricity in general and automation more in particular, but not really as a programmer, more of a from time to time when needed I can intervene and do small modifications. (PLC level)

Today I’m looking for help in getting some text manipulation started.

Here’s the thing:
Today a lot of education can be found in online video’s. Recently I discovered transcripts and to a non-native EN speaker this can really be very helpfull.

I would like to be able to convert some copied text from these transcripts and execute some minor manipulations on this text in order to be able to use it in a spreadsheet such as Excel or Google Sheets.

What do I need to do?

after each MM:SS
add a comma (,)
add a space (_)
before the next MM:SS, insert a return / linefeed (jump to the next line)

Copied text would look like:

and would need to become:

00:23, questions about quite a bit
00:25, tracking setups which I think is a big
00:27, some more text

What would be the best tool at hand?
Automator?
Applescript?

Thanks for helping me get started!

peavine · March 12, 2020, 8:28pm

panini. Welcome to the forum.

There are many ways to do what you want and this is just one:

set theText to paragraphs of "00:23
questions about quite a bit
00:25
tracking setups which I think is a big
00:27"

set textCount to (count theText)

set newText to {}

repeat with i from 1 to textCount by 2
	if i is not equal to textCount then
		set the end of newText to (item i of theText) & ", " & item (i + 1) of theText
	else
		set the end of newText to (item i of theText) & ", some more text"
	end if
	
end repeat

set text item delimiters to linefeed
set newText to newText as text
set text item delimiters to ""

newText

if the source text is on the clipboard, then the script should begin with:

set theText to paragraphs of (the clipboard)

This script needs some error correction in case the source text is not as expected.

Yvan_Koenig · March 12, 2020, 8:57pm

I’m quite sure that it’s not the more efficient code but starting from a Shane Stanley’s sample code I built this one which does the job.

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

-- Shane's handler
on findPattern:thePattern inString:theString
	set theNSString to current application's NSString's stringWithString:theString
	set theOptions to ((current application's NSRegularExpressionDotMatchesLineSeparators) as integer) + ((current application's NSRegularExpressionAnchorsMatchLines) as integer)
	set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:theOptions |error|:(missing value)
	set theFinds to theRegEx's matchesInString:theNSString options:0 range:{location:0, |length|:theNSString's |length|()}
	set theRanges to (theFinds's valueForKey:"range")
	set theResult to {} -- we will add to this
	repeat with i from 1 to count of items of theRanges
		set end of theResult to (theNSString's substringWithRange:(item i of theRanges)) as string
	end repeat
	return theResult
end findPattern:inString:

#=====

set origText to "00:23
questions about quite a bit
00:25
tracking setups which I think is a big
blah blah blah
00:27
some more text
time : 00:30
one more"
-- standardize the line break
set newText to my remplace(origText, {linefeed, return}, linefeed)
-- extract the list of time values
set theTimes to its findPattern:"([0-9][0-9])(\\:)([0-9][0-9])" inString:newText
-- replace the time values + linefeed by time value +", "
repeat with aTime in theTimes
	set newText to my remplace(newText, aTime & linefeed, aTime & ", ")
end repeat
newText

#=====
(*
replaces every occurences of d1 by d2 in the text t
*)
on remplace(t, d1, d2)
	local oTIDs, l
	set {oTIDs, AppleScript's text item delimiters} to {AppleScript's text item delimiters, d1}
	set l to text items of t
	set AppleScript's text item delimiters to d2
	set t to l as text
	set AppleScript's text item delimiters to oTIDs
	return t
end remplace

#=====

As you may see, it works also with extraneous text

Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) jeudi 12 mars 2020 21:56:38

KniazidisR · March 13, 2020, 5:45am

Peavine and Yvan Koenig have already provided a solution, but here are even less manipulations and lines of code:


set theText to "00:23
questions about quite a bit
00:25
tracking setups which I think is a big
00:27"

set textCount to count (paragraphs of theText)
set newText to ""

repeat with i from 1 to textCount by 2
	if i is textCount then
		set newText to newText & paragraph i of theText & ", " & "some more text"
	else
		set newText to newText & paragraph i of theText & ", " & paragraph (i + 1) of theText & linefeed
	end if
end repeat

Yvan_Koenig · March 13, 2020, 9:30am

@KniazidisR
Like Peavine’s one, your code assume that the passed data matches exactly the structure used in the original example.
If it doesn’t, the result will be wrong.

Try with : “00:23
questions about quite a bit
00:25
tracking setups which I think is a big
blah blah blah
00:27
some more text
time : 00:30
one more”

which is used in my message.

You will get :
“00:23, questions about quite a bit
00:25, tracking setups which I think is a big
blah blah blah, 00:27
some more text, time : 00:30
one more, some more text”

Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) vendredi 13 mars 2020 10:29:45

peavine · March 13, 2020, 3:51pm

KniazidisR’s script avoids a number of steps, which is certainly good. I decided to test his script and my script in Script Geek for speed. For testing purposes, I created a text file with 354 lines of text and modified both scripts to read this text file.

My script took 5 milliseconds and KniazidisR’s script took 8 milliseconds, which, as a practical matter, are equivalent. I didn’t test Yvan’s script because it works differently and does more.

peavine · March 13, 2020, 4:22pm

Yvan. This is a legitimate concern but it’s easily fixed with my script and with KniazidisR’s script. As you know, there a large number of ways to do this and the following is a very simple example. If the text data is not in the correct order, my personal preference would be to notify of the error and let the user take whatever corrective action is appropriate.

set theText to paragraphs of "00:23
questions about quite a bit
00:25
tracking setups which I think is a big
00:27
some more text
00:30
one more"

set textCount to (count theText)
set newText to {}

repeat with i from 1 to textCount by 2
	try
		text 1 of (item i of theText) as integer
	on error
		errorDialog("The text is not formatted correctly")
	end try
	if i is not equal to textCount then
		set the end of newText to (item i of theText) & ", " & item (i + 1) of theText
	else
		set the end of newText to (item i of theText) & ", some more text"
	end if
end repeat

set text item delimiters to linefeed
set newText to newText as text
set text item delimiters to ""

newText

on errorDialog(dialogText)
	display dialog dialogText buttons {"OK"} cancel button 1 default button 1 with title "" with icon stop
end errorDialog

peavine · March 13, 2020, 4:31pm

Yvan. I opened and ran your script and the result was a shown below. Did I do something wrong?

panini · March 13, 2020, 4:46pm

A big thank you guys!
I’ll try to see what I can get going and report back.

Model: 2015 Macbook Pro
Browser: Firefox 68.0
Operating System: macOS 10.14

Yvan_Koenig · March 13, 2020, 6:22pm

It’s the designed behavior.
I deliberately inserted extraneous items which would fool your proposal or KniazidisR’s one.
Both of them would return:
“00:23, questions about quite a bit
00:25, tracking setups which I think is a big
blah blah blah, 00:27
some more text, time : 00:30
one more, some more text”

which is not satisfying.

With my proposal, I tried to treat anomalies a “correct way”.
“00:23
questions about quite a bit” are gathered as “00:23, questions about quite a bit”

“00:25
tracking setups which I think is a big” are gathered as “00:25, tracking setups which I think is a big”

On entry, there is no time value before the string “blah blah blah” so it remains orphan.

“00:27
some more text” are gathered as “00:27, some more text”

“time : 00:30
one more” are gathered as “time : 00:30, one more”

I hoped that panini would tell how he wish that these special cases are treated.

I wondered if “blah blah blah” would be appended at the end of “00:25, tracking setups which I think is a big”

I wondered if “00:27
some more text
time : 00:30
one more”
was to be rendered as “00:27, some more text time :
00:30, one more”

splitting before “00:30” is quite easy which just require 6 added instructions near the end of the main code which would become :

repeat with aTime in theTimes -- ADDED 1
	set off7 to offset of aTime in newText -- ADDED 2
	if (off7 > 1) and text item (off7 - 1) of newText is not linefeed then -- ADDED 3
		set newText to text 1 thru (off7 - 1) of newText & linefeed & text off7 thru -1 of newText -- ADDED 4
	end if -- ADDED 5
end repeat -- ADDED 6
-- replace the time values + linefeed by time value +", "
repeat with aTime in theTimes
	set newText to my remplace(newText, aTime & linefeed, aTime & ", ")
end repeat
newText

Edited this way, the script would return:
“00:23, questions about quite a bit
00:25, tracking setups which I think is a big
blah blah blah
00:27, some more text
time :
00:30, one more”

A third loop would be able to return:
“00:23, questions about quite a bit
00:25, tracking setups which I think is a big blah blah blah
00:27, some more text time :
00:30, one more”

As I don’t know if the original data may contain such anomalies I didn’t proposed these 6 instructions which would not hurt.

In fact I assumed that somebody would give a code making a more important use of regular expressions.

Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) vendredi 13 mars 2020 19:14:18