Klasse ImagePdfToTextExtensions

java.lang.Object
io.github.astrapisixtynine.pdf.to.text.tess4j.ImagePdfToTextExtensions

public final class ImagePdfToTextExtensions extends Object
The class ImagePdfToTextExtensions provides functionality to convert a PDF file into a text file using either direct text extraction or Optical Character Recognition (OCR) through Tesseract OCR
  • Methodendetails

    • convertPdfToTextfile

      public static ConversionResult convertPdfToTextfile(File pdfFile, File outputDir, String datapath, String language) throws IOException, net.sourceforge.tess4j.TesseractException
      Converts a text or image PDF file to text using image processing and OCR
      Parameter:
      pdfFile - the input PDF file
      outputDir - the directory where the output files will be stored
      datapath - the path to Tesseract data files
      language - the language to use for OCR
      Gibt zurück:
      the result of the conversion containing image files, text files, and the final result text file
      Löst aus:
      IOException - if an I/O error occurs
      net.sourceforge.tess4j.TesseractException - if an error occurs during OCR
    • getTextFiles

      public static List<File> getTextFiles(List<File> imageFiles, File resultDir, String datapath, String language) throws IOException, net.sourceforge.tess4j.TesseractException
      Converts text or image PDF files into text files using Tesseract OCR
      Parameter:
      imageFiles - the list of image files to be processed
      resultDir - the directory where the text files will be stored
      datapath - the path to Tesseract data files
      language - the language to use for OCR
      Gibt zurück:
      the list of generated text files
      Löst aus:
      IOException - if an I/O error occurs
      net.sourceforge.tess4j.TesseractException - if an error occurs during OCR
    • getTextContent

      public static String getTextContent(List<File> imageFiles, String datapath, String language) throws net.sourceforge.tess4j.TesseractException
      Converts image files into text using Tesseract OCR
      Parameter:
      imageFiles - the list of image files to be processed
      datapath - the path to Tesseract data files
      language - the language to use for OCR
      Gibt zurück:
      the String object that contains the result
      Löst aus:
      net.sourceforge.tess4j.TesseractException - if an error occurs during OCR
    • extractTextFromImage

      public static String extractTextFromImage(File imageFile, String datapath, String language) throws net.sourceforge.tess4j.TesseractException
      Extracts text from a single image file using Tesseract OCR
      Parameter:
      imageFile - the image file to process
      datapath - the path to Tesseract data files
      language - the language to use for OCR
      Gibt zurück:
      the extracted text
      Löst aus:
      net.sourceforge.tess4j.TesseractException - if an error occurs during OCR
    • isTesseractInstalled

      public static boolean isTesseractInstalled()
      Checks if Tesseract OCR is installed on the system by executing the "tesseract --version" command
      Gibt zurück:
      true if Tesseract is installed and available in the system's PATH; false otherwise