Task Explanation
This task is designed to demonstrate how we can use programming to solve a real-world problem.
We will extract images from pdf files and classify them into 3 categories:
- Text
- Diagram
- Image
Some sample pdf files can be downloaded here.
Examples of the image types are:
Text:
Diagram:
Image:
For this task we will use the OpenCV library.
The environment setup is described here and here.
For the image class we can further split the images into their individual photos ex:
This is an individual task but we will collaborate on this during the session.
You can ask questions any time via email or google hangouts and also during the training.
The expected output is as follows:
- One directory per PDF file
- Inside the directory a list of images from the pdf file
- The directory should also contain a text file with the image name and the image type
- Photos extracted from the Image types should be prefixed with extracted ex: extracted_###.jpg
Text file should look like this:
img_1.jpg
Text
img_2.jpg
Text
img_3.jpg
Diagram
img_4.jpg
Diagram
img_5.jpg
Image
extracted_001
Image
extracted_002 Image
Some hints:
For extracting images from PDF files check Linux command pdfimages
Then we need to loop through the files in a directory ex:
for directory, subdirectories, files in os.walk(source_folder):
Loops are common programming structures – Python For Loops