Klasse ImagePdfToTextExtensions
java.lang.Object
io.github.astrapisixtynine.pdf.to.text.tess4j.ImagePdfToTextExtensions
The class
ImagePdfToTextExtensions
provides functionality to convert a PDF file into a
text file using either direct text extraction or Optical Character Recognition (OCR) through
Tesseract OCR-
Methodenübersicht
Modifizierer und TypMethodeBeschreibungstatic ConversionResult
convertPdfToTextfile
(File pdfFile, File outputDir, String datapath, String language) Converts a text or image PDF file to text using image processing and OCRstatic String
extractTextFromImage
(File imageFile, String datapath, String language) Extracts text from a single image file using Tesseract OCRstatic String
getTextContent
(List<File> imageFiles, String datapath, String language) Converts image files into text using Tesseract OCRConverts text or image PDF files into text files using Tesseract OCRstatic boolean
Checks if Tesseract OCR is installed on the system by executing the "tesseract --version" command
-
Methodendetails
-
convertPdfToTextfile
public static ConversionResult convertPdfToTextfile(File pdfFile, File outputDir, String datapath, String language) throws IOException, net.sourceforge.tess4j.TesseractException Converts a text or image PDF file to text using image processing and OCR- Parameter:
pdfFile
- the input PDF fileoutputDir
- the directory where the output files will be storeddatapath
- the path to Tesseract data fileslanguage
- the language to use for OCR- Gibt zurück:
- the result of the conversion containing image files, text files, and the final result text file
- Löst aus:
IOException
- if an I/O error occursnet.sourceforge.tess4j.TesseractException
- if an error occurs during OCR
-
getTextFiles
public static List<File> getTextFiles(List<File> imageFiles, File resultDir, String datapath, String language) throws IOException, net.sourceforge.tess4j.TesseractException Converts text or image PDF files into text files using Tesseract OCR- Parameter:
imageFiles
- the list of image files to be processedresultDir
- the directory where the text files will be storeddatapath
- the path to Tesseract data fileslanguage
- the language to use for OCR- Gibt zurück:
- the list of generated text files
- Löst aus:
IOException
- if an I/O error occursnet.sourceforge.tess4j.TesseractException
- if an error occurs during OCR
-
getTextContent
public static String getTextContent(List<File> imageFiles, String datapath, String language) throws net.sourceforge.tess4j.TesseractException Converts image files into text using Tesseract OCR- Parameter:
imageFiles
- the list of image files to be processeddatapath
- the path to Tesseract data fileslanguage
- the language to use for OCR- Gibt zurück:
- the
String
object that contains the result - Löst aus:
net.sourceforge.tess4j.TesseractException
- if an error occurs during OCR
-
extractTextFromImage
public static String extractTextFromImage(File imageFile, String datapath, String language) throws net.sourceforge.tess4j.TesseractException Extracts text from a single image file using Tesseract OCR- Parameter:
imageFile
- the image file to processdatapath
- the path to Tesseract data fileslanguage
- the language to use for OCR- Gibt zurück:
- the extracted text
- Löst aus:
net.sourceforge.tess4j.TesseractException
- if an error occurs during OCR
-
isTesseractInstalled
public static boolean isTesseractInstalled()Checks if Tesseract OCR is installed on the system by executing the "tesseract --version" command- Gibt zurück:
- true if Tesseract is installed and available in the system's PATH; false otherwise
-