Unable to parse pdf document using Tika Parser in AEM6.0

Forum|Forum|10 years ago
January 8, 2016
10 replies
7608 views

Hi Team,

I am unable to parse or read the text of the pdf file using Tika parser.

Asset asset = DamUtil.resolveToAsset(dataResource);
       Resource original = asset.getOriginal();
   InputStream is = original.adaptTo(InputStream.class);
       ContentHandler handler = new BodyContentHandler(10 * 1024 * 1024);
       Metadata metadata = new Metadata();
       AutoDetectParser parser = new AutoDetectParser();
       ParseContext context = new ParseContext();
       try {

           context.set(AutoDetectParser.class, parser);
           parser.parse(is, handler, metadata, context);

is.close();

       } catch (Exception e) {
           throw new Exception("Error parsing file"+asset.getPath(), e);
       }

Getting Tika parse exception

Please help me resolve this issue or share the link where I can go through.

Thank a lot

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.

Best answer by kautuk_sahni

Hi

Parsing large, broken, or malicious input causes excessive memory or CPU use during indexing. And it may result in JVM crashes.

Link:- https://helpx.adobe.com/experience-manager/kb/outOfProcessTextExtraction.html

I am not sure, if this is problem with you.

Please share the complete error log which you are encountering.'

Thanks and Regards

Kautuk Sahni

Jitendra_S_Toma

New Participant

@Sumit,

Try reducing the number of documents in one go. Also, Check the RAM, Thread pool configuration, etc in your instance. Some of the techniques can be found here.

https://docs.adobe.com/docs/en/aem/6-1/deploy/configuring/performance.html

Jitendra

_

_SumitSinghalAuthor

New Participant

Hi,

I have around 7K documents which I am parsing using tika parser in the batch of 1K documents at a time but after 1K the workflow process goes to stale state and never comes back to parse the remaining documents. Might be JVM crashes or out of memory issue. Please help me how to handle such scenario.

PDF files are not encrypted.

Thank you

joerghoh

Employee

Looks like you're dealing with encrypted PDF documents. I am not an PDF expert, but please reach out to Daycare support and ask how to make it work.

Jörg

_

_SumitSinghalAuthor

New Participant

Hi,

Please find the full logs:

Caused by: org.apache.tika.exception.TikaException: PDF parse error
   at com.adobe.internal.pdf.tika.GibsonParser.parse(GibsonParser.java:252)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
   at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at com.leggmason.gd.webservices.utils.SolrIndex.getFileContent(SolrIndex.java:1079)
   ... 12 common frames omitted
Caused by: com.adobe.internal.pdftoolkit.core.exceptions.PDFSecurityAuthorizationException: Security Manager for decryption is not set
   at com.adobe.internal.pdftoolkit.core.encryption.EncryptionImpl.getStreamEncryption(EncryptionImpl.java:223)
   at com.adobe.internal.pdftoolkit.core.encryption.EncryptionImpl.getStreamDecryptionHandler(EncryptionImpl.java:290)
   at com.adobe.internal.pdftoolkit.core.cos.CosEncryption.getStreamDecryptionStateHandler(CosEncryption.java:674)
   at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStreamForCopying(CosStream.java:422)
   at com.adobe.internal.pdftoolkit.core.cos.CosStream.copyStream(CosStream.java:367)
   at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStream(CosStream.java:468)
   at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStreamDecoded(CosStream.java:293)
   at com.adobe.internal.pdftoolkit.pdf.document.PDFContents.getContents(PDFContents.java:141)
   at com.adobe.internal.pdftoolkit.pdf.content.ContentParser.<init>(ContentParser.java:94)
   at com.adobe.internal.pdftoolkit.pdf.content.ContentParser.<init>(ContentParser.java:81)
   at com.adobe.internal.pdftoolkit.pdf.content.ContentReader.<init>(ContentReader.java:54)
   at com.adobe.internal.pdftoolkit.pdf.content.ContentReader.newInstance(ContentReader.java:83)
   at com.adobe.internal.pdftoolkit.services.textextraction.impl.TEContentStreamHandler.extractTextObjects(TEContentStreamHandler.java:302)
   at com.adobe.internal.pdftoolkit.services.textextraction.impl.TEContentStreamHandler.extractTextObjects(TEContentStreamHandler.java:193)
   at com.adobe.internal.pdftoolkit.services.textextraction.TextExtractor.extractROTEWords(TextExtractor.java:348)
   at com.adobe.internal.pdftoolkit.services.textextraction.TextExtractor.getROTEWordsIterator(TextExtractor.java:505)
   at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.getReadingOrderedTextFromPDF(ReadingOrderTextExtractor.java:275)
   at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.extractParagraphs(ReadingOrderTextExtractor.java:566)
   at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.getParagraphIterator(ReadingOrderTextExtractor.java:465)
   at com.adobe.internal.pdf.tika.GibsonParser.parse(GibsonParser.java:194)