Unable to parse pdf document using Tika Parser in AEM6.0 | Community
Skip to main content
New Participant
January 8, 2016
Solved

Unable to parse pdf document using Tika Parser in AEM6.0

  • January 8, 2016
  • 10 replies
  • 7608 views

Hi Team,

I am unable to parse or read the text of the pdf file using Tika parser.

 

Asset asset = DamUtil.resolveToAsset(dataResource);
        Resource original = asset.getOriginal();
        InputStream is = original.adaptTo(InputStream.class);
        ContentHandler handler = new BodyContentHandler(10 * 1024 * 1024);
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        try {
        
            context.set(AutoDetectParser.class, parser);
            parser.parse(is, handler, metadata, context);

            is.close();

        } catch (Exception e) {
            throw new Exception("Error parsing file"+asset.getPath(), e);
        }

Getting Tika parse exception

Please help me resolve this issue or share the link where I can go through.

 

Thank a lot

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by kautuk_sahni

Hi 

Parsing large, broken, or malicious input causes excessive memory or CPU use during indexing. And it may result in JVM crashes.

Link:- https://helpx.adobe.com/experience-manager/kb/outOfProcessTextExtraction.html

I am not sure, if this is problem with you. 

Please share the complete error log which you are encountering.'

 

Thanks and Regards

Kautuk Sahni

10 replies

Jitendra_S_Toma
New Participant
January 15, 2016

@Sumit,

Try reducing the number of documents in one go. Also, Check the RAM, Thread pool configuration, etc in your instance. Some of the techniques can be found here.

https://docs.adobe.com/docs/en/aem/6-1/deploy/configuring/performance.html

 

Jitendra

New Participant
January 15, 2016

Hi,

I have around 7K documents which I am parsing using tika parser in the batch of 1K documents at a time but after 1K the workflow process goes to stale state and never comes back to parse the remaining documents. Might be JVM crashes or out of memory issue. Please help me how to handle such scenario.

PDF files are not encrypted.

Thank you

joerghoh
Employee
January 14, 2016

Looks like you're dealing with encrypted PDF documents. I am not an PDF expert, but please reach out to Daycare support and ask how to make it work.

Jörg

New Participant
January 11, 2016

Hi,

 

Please find the full logs:

Caused by: org.apache.tika.exception.TikaException: PDF parse error
    at com.adobe.internal.pdf.tika.GibsonParser.parse(GibsonParser.java:252)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at com.leggmason.gd.webservices.utils.SolrIndex.getFileContent(SolrIndex.java:1079)
    ... 12 common frames omitted
Caused by: com.adobe.internal.pdftoolkit.core.exceptions.PDFSecurityAuthorizationException: Security Manager for decryption is not set
    at com.adobe.internal.pdftoolkit.core.encryption.EncryptionImpl.getStreamEncryption(EncryptionImpl.java:223)
    at com.adobe.internal.pdftoolkit.core.encryption.EncryptionImpl.getStreamDecryptionHandler(EncryptionImpl.java:290)
    at com.adobe.internal.pdftoolkit.core.cos.CosEncryption.getStreamDecryptionStateHandler(CosEncryption.java:674)
    at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStreamForCopying(CosStream.java:422)
    at com.adobe.internal.pdftoolkit.core.cos.CosStream.copyStream(CosStream.java:367)
    at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStream(CosStream.java:468)
    at com.adobe.internal.pdftoolkit.core.cos.CosStream.getStreamDecoded(CosStream.java:293)
    at com.adobe.internal.pdftoolkit.pdf.document.PDFContents.getContents(PDFContents.java:141)
    at com.adobe.internal.pdftoolkit.pdf.content.ContentParser.<init>(ContentParser.java:94)
    at com.adobe.internal.pdftoolkit.pdf.content.ContentParser.<init>(ContentParser.java:81)
    at com.adobe.internal.pdftoolkit.pdf.content.ContentReader.<init>(ContentReader.java:54)
    at com.adobe.internal.pdftoolkit.pdf.content.ContentReader.newInstance(ContentReader.java:83)
    at com.adobe.internal.pdftoolkit.services.textextraction.impl.TEContentStreamHandler.extractTextObjects(TEContentStreamHandler.java:302)
    at com.adobe.internal.pdftoolkit.services.textextraction.impl.TEContentStreamHandler.extractTextObjects(TEContentStreamHandler.java:193)
    at com.adobe.internal.pdftoolkit.services.textextraction.TextExtractor.extractROTEWords(TextExtractor.java:348)
    at com.adobe.internal.pdftoolkit.services.textextraction.TextExtractor.getROTEWordsIterator(TextExtractor.java:505)
    at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.getReadingOrderedTextFromPDF(ReadingOrderTextExtractor.java:275)
    at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.extractParagraphs(ReadingOrderTextExtractor.java:566)
    at com.adobe.internal.pdftoolkit.services.readingorder.ReadingOrderTextExtractor.getParagraphIterator(ReadingOrderTextExtractor.java:465)
    at com.adobe.internal.pdf.tika.GibsonParser.parse(GibsonParser.java:194)

 

Thanks

New Participant
January 11, 2016

Hi Kautuk,

Thank you for the response.

You are right, I got the Tika Parsing Exception only for Large PDF files which may be of size greater than 1 MB. 

Please help me how can we parse the large file using tika in AEM6.0.

Thank you

Sumit

kautuk_sahni
kautuk_sahniAccepted solution
Employee
January 11, 2016

Hi 

Parsing large, broken, or malicious input causes excessive memory or CPU use during indexing. And it may result in JVM crashes.

Link:- https://helpx.adobe.com/experience-manager/kb/outOfProcessTextExtraction.html

I am not sure, if this is problem with you. 

Please share the complete error log which you are encountering.'

 

Thanks and Regards

Kautuk Sahni

Kautuk Sahni
New Participant
January 8, 2016

I got the tika exception:

org.apache.tika.exception.TikaException: PDF parse error

On line     parser.parse(is, handler, metadata, context);

Thanks

New Participant
January 8, 2016

Hi,

Thank you for the quick reply.

The above is working fine with JAVA application to read the pdf content but same code is not working in AEM6.0.

I got the tika exception:

org.apache.tika.exception.TikaException: PDF parse error

On line     parser.parse(is, handler, metadata, context);

Thanks

Jitendra_S_Toma
New Participant
January 8, 2016
would you mind sharing exception details and by the way,  what kind of issue,  you are facing? ----- Jitendra
smacdonald2008
New Participant
January 8, 2016

Are you following online documentation to guide you on this use case or is this a custom implementation.