|
| 1 | +In computer vision, we normally solve one of the following problems: |
| 2 | + |
| 3 | +- **Image Classification** is the simplest task, when we need to classify an image into one of many predefined categories, for example, distinguish a cat from a dog on a photograph, or recognize a handwritten digit. |
| 4 | +- **Object Detection** is a bit more difficult task, in which we need to find known objects on the picture and localize them, that is, return the **bounding box** for each of recognized objects. |
| 5 | +- **Segmentation** is similar to object detection, but instead of giving bounding box we need to return an exact pixel map outlining each of the recognized objects. |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +*Image from [CS231n Stanford Course](https://cs231n.stanford.edu/)* |
| 10 | + |
| 11 | +## Images as tensors |
| 12 | + |
| 13 | +Computer Vision works with Images. As you probably know, images consist of pixels, so they can be thought of as a rectangular collection (array) of pixels. |
| 14 | + |
| 15 | +In the first part of this module, we'll deal with handwritten digit recognition. We'll use the MNIST dataset, which consists of grayscale images of handwritten digits, 28x28 pixels. Each image can be represented as 28x28 array, and elements of this array would denote intensity of corresponding pixel - either in the scale of range 0 to 1 (in which case floating point numbers are used), or 0 to 255 (integers). A popular Python library called `numpy` is often used with computer vision tasks, because it allows you to operate with multidimensional arrays effectively. |
| 16 | + |
| 17 | +To deal with color images, we need some way to represent colors. In most cases, we represent each pixel by 3 intensity values, corresponding to Red (R), Green (G), and Blue (B) components. This color encoding is called RGB, and thus color image of size W×H will be represented as an array of size H×W×3 (sometimes the order of components might be different, but the idea is the same). In array representation, the height (number of rows) comes before the width (number of columns), which is the opposite of the common image convention of W×H. |
| 18 | + |
| 19 | +Multi-dimensional arrays are also called **tensors**. Using tensors to represent images also has an advantage, because we can use an extra dimension to store a sequence of images. For example, to represent a video fragment consisting of 200 frames with 800x600 dimension (width × height), we may use the tensor of size 200x600x800x3. Remember that tensor dimensions use H×W (row-major) order, not the W×H convention commonly seen in image editors. The order here's frames × height (600) × width (800) × channels. This ordering is known as `channels_last` and is the default in TensorFlow; some other frameworks place channels before height and width (`channels_first`). |
| 20 | + |
| 21 | +```python |
| 22 | +import tensorflow as tf |
| 23 | +import keras |
| 24 | +import matplotlib.pyplot as plt |
| 25 | +import numpy as np |
| 26 | + |
| 27 | +# Prints the installed TensorFlow version |
| 28 | +print(tf.__version__) |
| 29 | +``` |
| 30 | + |
| 31 | +We're going to use the [Keras](https://keras.io/) framework for our experiments. Throughout this module, we use `import keras` (the standalone Keras 3 import style), which requires **TensorFlow 2.16 or later** (or a standalone installation via `pip install keras>=3.0`). If you're using an older TensorFlow 2.x version, replace `import keras` with `from tensorflow import keras`. |
| 32 | + |
| 33 | +```python |
| 34 | +(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data() |
| 35 | + |
| 36 | +# Output: (60000, 28, 28) (60000,) |
| 37 | +print(x_train.shape, y_train.shape) |
| 38 | +# Output: (10000, 28, 28) (10000,) |
| 39 | +print(x_test.shape, y_test.shape) |
| 40 | +``` |
| 41 | + |
| 42 | +## Visualize the digits dataset |
| 43 | + |
| 44 | +Now that we have downloaded the dataset we can visualize some of the digits: |
| 45 | + |
| 46 | +```python |
| 47 | +fig, ax = plt.subplots(1, 7) |
| 48 | +for i in range(7): |
| 49 | + ax[i].imshow(x_train[i]) |
| 50 | + ax[i].set_title(y_train[i]) |
| 51 | + ax[i].axis('off') |
| 52 | +# Displays a row of seven handwritten digit images with their labels |
| 53 | +``` |
| 54 | + |
| 55 | +## Dataset structure |
| 56 | + |
| 57 | +We have a total of 60,000 training images and 10,000 test images, and each image has a size of 28×28 pixels: |
| 58 | + |
| 59 | +```python |
| 60 | +print('Training samples:', len(x_train)) |
| 61 | +print('Test samples:', len(x_test)) |
| 62 | +print('Tensor size:', x_train[0].shape) |
| 63 | +print('First 10 digits are:', y_train[:10]) |
| 64 | +print('Type of data is ', type(x_train)) |
| 65 | +# Output: |
| 66 | +# Training samples: 60000 |
| 67 | +# Test samples: 10000 |
| 68 | +# Tensor size: (28, 28) |
| 69 | +# First 10 digits are: [5 0 4 1 9 2 1 3 1 4] |
| 70 | +# Type of data is <class 'numpy.ndarray'> |
| 71 | +``` |
| 72 | + |
| 73 | +As you can see, the type of data is `numpy` array. Each pixel intensity is represented by an integer value between 0 and 255: |
| 74 | + |
| 75 | +```python |
| 76 | +print('Min intensity value: ', x_train.min()) |
| 77 | +print('Max intensity value: ', x_train.max()) |
| 78 | +# Output: |
| 79 | +# Min intensity value: 0 |
| 80 | +# Max intensity value: 255 |
| 81 | +``` |
| 82 | + |
| 83 | +The reason it is between 0 and 255 is because each pixel is represented by an 8-bit integer. In many cases, especially when working with neural networks, it's more convenient to scale all values to the range [0, 1] by dividing by 255. This process is called **normalization**: |
| 84 | + |
| 85 | +```python |
| 86 | +x_train = x_train.astype(np.float32) / 255.0 |
| 87 | +x_test = x_test.astype(np.float32) / 255.0 |
| 88 | +# Pixel values are now floating point numbers in the range [0, 1] |
| 89 | +``` |
| 90 | + |
| 91 | +Now we have the data, and we're ready to start training our first neural network! |
0 commit comments