java.lang.Object

io.github.astrapisixtynine.pdf.to.text.tess4j.ImagePdfToTextExtensions

public final class ImagePdfToTextExtensions extends Object

The class ImagePdfToTextExtensions provides functionality to convert a PDF file into a text file using either direct text extraction or Optical Character Recognition (OCR) through Tesseract OCR

Methodenübersicht

Modifizierer und Typ

Methode

Beschreibung

static ConversionResult

convertPdfToTextfile(File pdfFile, File outputDir, String datapath, String language)

Converts a text or image PDF file to text using image processing and OCR

static String

extractTextFromImage(File imageFile, String datapath, String language)

Extracts text from a single image file using Tesseract OCR

static String

getTextContent(List<File> imageFiles, String datapath, String language)

Converts image files into text using Tesseract OCR

static List<File>

getTextFiles(List<File> imageFiles, File resultDir, String datapath, String language)

Converts text or image PDF files into text files using Tesseract OCR

static boolean

isTesseractInstalled()

Checks if Tesseract OCR is installed on the system by executing the "tesseract --version" command

Von Klasse geerbte Methoden java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methodendetails
- convertPdfToTextfile
  
  public static ConversionResult convertPdfToTextfile(File pdfFile, File outputDir, String datapath, String language) throws IOException, net.sourceforge.tess4j.TesseractException
  
  Converts a text or image PDF file to text using image processing and OCR
  
  Parameter:
  
  pdfFile - the input PDF file
  
  outputDir - the directory where the output files will be stored
  
  datapath - the path to Tesseract data files
  
  language - the language to use for OCR
  
  Gibt zurück:
  
  the result of the conversion containing image files, text files, and the final result text file
  
  Löst aus:
  
  IOException - if an I/O error occurs
  
  net.sourceforge.tess4j.TesseractException - if an error occurs during OCR
- getTextFiles
  
  public static List<File> getTextFiles(List<File> imageFiles, File resultDir, String datapath, String language) throws IOException, net.sourceforge.tess4j.TesseractException
  
  Converts text or image PDF files into text files using Tesseract OCR
  
  Parameter:
  
  imageFiles - the list of image files to be processed
  
  resultDir - the directory where the text files will be stored
  
  datapath - the path to Tesseract data files
  
  language - the language to use for OCR
  
  Gibt zurück:
  
  the list of generated text files
  
  Löst aus:
  
  IOException - if an I/O error occurs
  
  net.sourceforge.tess4j.TesseractException - if an error occurs during OCR
- getTextContent
  
  public static String getTextContent(List<File> imageFiles, String datapath, String language) throws net.sourceforge.tess4j.TesseractException
  
  Converts image files into text using Tesseract OCR
  
  Parameter:
  
  imageFiles - the list of image files to be processed
  
  datapath - the path to Tesseract data files
  
  language - the language to use for OCR
  
  Gibt zurück:
  
  the String object that contains the result
  
  Löst aus:
  
  net.sourceforge.tess4j.TesseractException - if an error occurs during OCR
- extractTextFromImage
  
  public static String extractTextFromImage(File imageFile, String datapath, String language) throws net.sourceforge.tess4j.TesseractException
  
  Extracts text from a single image file using Tesseract OCR
  
  Parameter:
  
  imageFile - the image file to process
  
  datapath - the path to Tesseract data files
  
  language - the language to use for OCR
  
  Gibt zurück:
  
  the extracted text
  
  Löst aus:
  
  net.sourceforge.tess4j.TesseractException - if an error occurs during OCR
- isTesseractInstalled
  
  public static boolean isTesseractInstalled()
  
  Checks if Tesseract OCR is installed on the system by executing the "tesseract --version" command
  
  Gibt zurück:
  
  true if Tesseract is installed and available in the system's PATH; false otherwise

Klasse ImagePdfToTextExtensions

Methodenübersicht

Von Klasse geerbte Methoden java.lang.Object

Methodendetails

convertPdfToTextfile

getTextFiles

getTextContent

extractTextFromImage

isTesseractInstalled