Last Updated: 

Convert DOC to PDF using iText in Java

In the world of document processing, converting a Microsoft Word document (DOC) to a PDF is a common requirement. Java developers often turn to libraries to achieve this task efficiently. One such powerful library is iText, which provides various APIs to work with PDF documents. In this blog post, we will explore how to convert a DOC file to a PDF using iText in Java. We'll cover the core concepts, typical usage scenarios, common pitfalls, and best practices to help you use this technique effectively in real-world applications.

Table of Contents#

  1. Core Concepts
  2. Typical Usage Scenarios
  3. Prerequisites
  4. Code Example
  5. Common Pitfalls
  6. Best Practices
  7. Conclusion
  8. FAQ
  9. References

Core Concepts#

iText#

iText is an open-source Java library for creating and manipulating PDF files. It provides a rich set of classes and methods to handle various aspects of PDF generation, such as adding text, images, tables, and more. However, iText itself does not have native support for reading DOC files. To convert a DOC to a PDF, we usually need to rely on other libraries to extract text from the DOC file and then use iText to create a PDF with that content.

DOC File Format#

A DOC file is a binary file format used by Microsoft Word. It contains text, formatting information, images, and other elements. To convert it to a PDF, we need to extract the relevant content and format it appropriately for the PDF output.

PDF File Format#

PDF (Portable Document Format) is a widely used file format that preserves the layout and formatting of a document across different platforms and devices. When converting a DOC to a PDF, we aim to maintain the original document's appearance as much as possible.

Typical Usage Scenarios#

  1. Document Sharing: You may want to share a document with others in a format that can be easily viewed and printed without the need for specific software. PDF is a great choice for this purpose.
  2. Archiving: Storing documents in PDF format ensures that their content and formatting are preserved over time.
  3. Legal and Regulatory Requirements: Some industries have legal or regulatory requirements to submit documents in PDF format.

Prerequisites#

  • Java Development Kit (JDK): You need to have Java installed on your system. A version like Java 8 or higher is recommended.
  • iText Library: Download the iText library from the official website and add it to your Java project's classpath. You can also use a build tool like Maven or Gradle to manage the dependencies.
  • Apache POI: Since iText does not directly support DOC files, we'll use Apache POI to read the DOC file. Add the Apache POI libraries to your project as well.

Maven Dependencies#

<dependencies>
    <!-- iText -->
    <dependency>
        <groupId>com.itextpdf</groupId>
        <artifactId>itextpdf</artifactId>
        <version>5.5.13.2</version>
    </dependency>
    <!-- Apache POI for DOC processing -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>5.2.3</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-scratchpad</artifactId>
        <version>5.2.3</version>
    </dependency>
</dependencies>

Code Example#

import com.itextpdf.text.Document;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfWriter;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
 
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
 
public class DocToPdfConverter {
 
    public static void convertDocToPdf(String docFilePath, String pdfFilePath) {
        try {
            // Step 1: Read the DOC file using Apache POI
            FileInputStream fis = new FileInputStream(docFilePath);
            HWPFDocument doc = new HWPFDocument(fis);
            WordExtractor extractor = new WordExtractor(doc);
            String[] paragraphs = extractor.getParagraphText();
 
            // Step 2: Create a new PDF document using iText
            Document pdfDocument = new Document();
            PdfWriter.getInstance(pdfDocument, new FileOutputStream(pdfFilePath));
            pdfDocument.open();
 
            // Step 3: Add the extracted text to the PDF document
            for (String paragraph : paragraphs) {
                pdfDocument.add(new Paragraph(paragraph));
            }
 
            // Step 4: Close the PDF document
            pdfDocument.close();
            fis.close();
            System.out.println("Conversion successful!");
        } catch (IOException | com.itextpdf.text.DocumentException e) {
            e.printStackTrace();
        }
    }
 
    public static void main(String[] args) {
        String docFilePath = "input.doc";
        String pdfFilePath = "output.pdf";
        convertDocToPdf(docFilePath, pdfFilePath);
    }
}

Code Explanation#

  1. Reading the DOC file: We use Apache POI's HWPFDocument and WordExtractor classes to read the DOC file and extract the text paragraphs.
  2. Creating a PDF document: We create a new Document object using iText and a PdfWriter to write the content to the output PDF file.
  3. Adding text to the PDF: We iterate over the extracted paragraphs and add them to the PDF document using the add method.
  4. Closing the document: Finally, we close the PDF document and the input file stream.

Common Pitfalls#

  1. Formatting Loss: Since the conversion process involves extracting text from the DOC file and then creating a new PDF, some formatting information may be lost. For example, complex formatting like advanced tables, graphics, and custom styles may not be fully preserved.
  2. Encoding Issues: If the DOC file contains special characters or non-standard encodings, there may be encoding issues in the PDF output. Make sure to handle character encoding properly.
  3. Memory Consumption: Reading large DOC files can consume a significant amount of memory, especially if you load the entire file into memory at once. Consider processing the file in chunks if memory is a concern.

Best Practices#

  1. Test with Different Documents: Test the conversion process with a variety of DOC files, including those with different formatting and content, to ensure that the output PDF meets your requirements.
  2. Handle Exceptions Properly: As shown in the code example, catch and handle exceptions such as IOException and DocumentException to make your application more robust.
  3. Optimize Memory Usage: If you are dealing with large DOC files, consider using techniques like lazy loading or streaming to reduce memory consumption.

Conclusion#

Converting a DOC file to a PDF using iText in Java is a practical and achievable task. By combining iText's PDF generation capabilities with Apache POI's DOC file reading features, we can extract text from a DOC file and create a PDF document. However, it's important to be aware of the common pitfalls and follow best practices to ensure a successful conversion. With the knowledge and code examples provided in this blog post, you should be able to implement this functionality in your Java applications.

FAQ#

Can iText directly convert a DOC file to a PDF?#

No, iText does not have native support for reading DOC files. You need to use other libraries like Apache POI to extract text from the DOC file and then use iText to create a PDF.

How can I preserve the formatting of the DOC file in the PDF?#

Preserving all formatting can be challenging. You can try using more advanced libraries or techniques to handle specific formatting elements like tables and images. However, some loss of formatting may still occur.

What if the DOC file is very large?#

If the DOC file is large, you may encounter memory issues. Consider processing the file in chunks or using streaming techniques to reduce memory consumption.

References#