Image Classification Task

This task is designed to demonstrate how we can use programming to solve a real-world problem.

We will extract images from pdf files and classify them into 3 categories:

Some sample pdf files can be downloaded here.

Examples of the image types are:

Text:

Diagram:

Image:

For this task we will use the OpenCV library.
The environment setup is described here and here.

For the image class we can further split the images into their individual photos ex:

This is an individual task but we will collaborate on this during the session.

You can ask questions any time via email or google hangouts and also during the training.

The expected output is as follows:

One directory per PDF file
Inside the directory a list of images from the pdf file
The directory should also contain a text file with the image name and the image type
Photos extracted from the Image types should be prefixed with extracted ex: extracted_###.jpg

Text file should look like this:

img_1.jpg Text

img_2.jpg Text

img_3.jpg Diagram

img_4.jpg Diagram

img_5.jpg Image

extracted_001 Image

extracted_002 Image

Some hints:

For extracting images from PDF files check Linux command pdfimages

Then we need to loop through the files in a directory ex:

for directory, subdirectories, files in os.walk(source_folder):

Loops are common programming structures – Python For Loops