Use grep, sed, awk to get content of kMDItemTextContent?

I need to parse the output of mdimport.

[format]mdimport -t -d3 $file [/format]

…dumps many attributes and values to stdout, including kMDItemTextContent.

How can I use grep/awk/sed/etc. to leave me with just the value of kMDItemTextContent? You can use this partial output of mdimport -t -d3 one one of my files as an example.

[format] kMDItemPageHeight = 842;
kMDItemPageWidth = 595;
kMDItemPhysicalSize = 12288;
kMDItemSecurityMethod = None;
kMDItemTextContent = “11/8 - Hash Tag Test Document #HashTag1 this is the first hash tag. #HashTag2 this is the second hash tag. The following hash tag is inside and at the end of a paragraph: #HashTag3 The next hash tag #HashTag4 is in the middle of a paragraph.”;
kMDItemTitle = “11/8 - Hash Tag Test Document”;
kMDItemVersion = “1.3”;
}[/format]

Note - I have seen explicit \n\n characters within the text block for some of my files.

I guess, my more specific question would be, how can I get all of the text between these two delimiters?

[format]kMDItemTextContent = "
[/format]
AND

[format]";[/format]

From Apple’s documentation of kMDItemTextContent:

Hi Shane,

Thanks. Then I’ll need to parse the output of mdimport.

I’ll revise my original post.

  • John

FYI - I found my solution via external help.

[format]mdimport -t -d3 $file | awk -F" ‘/kMDItemTextContent/{print $2}’[/format]

In a do shell script context, I’m finding that the awk delimiter needs to be single-quoted too, otherwise the script throws an error:

do shell script "file=" & quoted form of POSIX path of (choose file) & " ; mdimport -t -d3 $file  | awk -F'\"' '/kMDItemTextContent/{print $2}'"

However, the delimiter approach is no good if the text you’re trying to extract itself contains double-quotes. This sed alternative seems to handle things well:

do shell script "file=" & quoted form of POSIX path of (choose file) & " ; mdimport -t -d3 $file | sed -En '/kMDItemTextContent/ s/^[^\"]+\"(.+)\";$/\\1/p'"

The field separator doesn’t have to be quoted, and it appears malformed with the trailing backslash. This is okay:

(do shell script "file=" & my (choose file)'s POSIX path's quoted form & " ; mdimport -t -d3 $file | awk -F kMDItemTextContent  '{print $2}' ")'s paragraphs as text

Alternatively and perhaps more reliably:

do shell script "file=" & my (choose file)'s POSIX path's quoted form & " ; mdimport -t -d3 $file | awk 'BEGIN{ RS = \"\\r\"  ; FS = \"kMDItemTextContent\" }  {print $2}' "

Thanks Marc. I have no working knowledge of awk at all. But as far as I can tell, the delimiter in the OP’s code is the double quote, which is backslashed to go in the shell script text. The code finds the line containing “kMDItemTextContent” and returns the second double-quote-delimited field from that.

Your scripts appear to use “kMDItemTextContent” itself as the delimiter. The first returns everything that comes after “kMDItemTextContent” in each line — which means nothing in all the lines but one — and then uses delimiter-ignoring AppleScript code to lose the empty lines. The second returns everything that comes after “kMDItemTextContent” in the entire text, including any following lines.

My sed effort above finds the line containing “kMDItemTextContent” and returns only the text between the outermost double quotes in that line. Ideally, I suppose, it should also make some attempt to reduce the amount of backslashing with any quotes within that text. :rolleyes: