Use grep, sed, awk to get content of kMDItemTextContent?

johncatalano · November 8, 2020, 4:22am

I need to parse the output of mdimport.

[format]mdimport -t -d3 $file [/format]

…dumps many attributes and values to stdout, including kMDItemTextContent.

How can I use grep/awk/sed/etc. to leave me with just the value of kMDItemTextContent? You can use this partial output of mdimport -t -d3 one one of my files as an example.

[format] kMDItemPageHeight = 842;
kMDItemPageWidth = 595;
kMDItemPhysicalSize = 12288;
kMDItemSecurityMethod = None;
kMDItemTextContent = “11/8 - Hash Tag Test Document #HashTag1 this is the first hash tag. #HashTag2 this is the second hash tag. The following hash tag is inside and at the end of a paragraph: #HashTag3 The next hash tag #HashTag4 is in the middle of a paragraph.”;
kMDItemTitle = “11/8 - Hash Tag Test Document”;
kMDItemVersion = “1.3”;
}[/format]

Note - I have seen explicit \n\n characters within the text block for some of my files.

I guess, my more specific question would be, how can I get all of the text between these two delimiters?

[format]kMDItemTextContent = "
[/format]
AND

[format]";[/format]

Shane_Stanley · November 8, 2020, 5:09am

From Apple’s documentation of kMDItemTextContent:

johncatalano · November 8, 2020, 6:21am

Hi Shane,

Thanks. Then I’ll need to parse the output of mdimport.

I’ll revise my original post.

John

johncatalano · November 8, 2020, 11:05pm

FYI - I found my solution via external help.

[format]mdimport -t -d3 $file | awk -F" ‘/kMDItemTextContent/{print $2}’[/format]

Nigel_Garvey · November 10, 2020, 1:56pm

In a do shell script context, I’m finding that the awk delimiter needs to be single-quoted too, otherwise the script throws an error:

do shell script "file=" & quoted form of POSIX path of (choose file) & " ; mdimport -t -d3 $file  | awk -F'\"' '/kMDItemTextContent/{print $2}'"

However, the delimiter approach is no good if the text you’re trying to extract itself contains double-quotes. This sed alternative seems to handle things well:

do shell script "file=" & quoted form of POSIX path of (choose file) & " ; mdimport -t -d3 $file | sed -En '/kMDItemTextContent/ s/^[^\"]+\"(.+)\";$/\\1/p'"

Marc_Anthony · November 11, 2020, 12:22am

The field separator doesn’t have to be quoted, and it appears malformed with the trailing backslash. This is okay:

(do shell script "file=" & my (choose file)'s POSIX path's quoted form & " ; mdimport -t -d3 $file | awk -F kMDItemTextContent  '{print $2}' ")'s paragraphs as text

Alternatively and perhaps more reliably:

do shell script "file=" & my (choose file)'s POSIX path's quoted form & " ; mdimport -t -d3 $file | awk 'BEGIN{ RS = \"\\r\"  ; FS = \"kMDItemTextContent\" }  {print $2}' "

Nigel_Garvey · November 11, 2020, 9:41am

Thanks Marc. I have no working knowledge of awk at all. But as far as I can tell, the delimiter in the OP’s code is the double quote, which is backslashed to go in the shell script text. The code finds the line containing “kMDItemTextContent” and returns the second double-quote-delimited field from that.

Your scripts appear to use “kMDItemTextContent” itself as the delimiter. The first returns everything that comes after “kMDItemTextContent” in each line — which means nothing in all the lines but one — and then uses delimiter-ignoring AppleScript code to lose the empty lines. The second returns everything that comes after “kMDItemTextContent” in the entire text, including any following lines.

My sed effort above finds the line containing “kMDItemTextContent” and returns only the text between the outermost double quotes in that line. Ideally, I suppose, it should also make some attempt to reduce the amount of backslashing with any quotes within that text. :rolleyes: