Use In-Memory Buffers to Avoid Disk IO - Develop Freedom: Blog by Shubham Chaudhary

At Zomato, we handle a vast number of images, with close to a hundred thousand new images daily. Often, we need to download, process, and then pass these images to our models. The traditional workflow involves fetching an image from a URL, saving it to a file, and then passing that file path for further processing.

Traditional Workflow

import logging
import os
import tempfile

import cv2
import requests

def download_image(url):
    logging.info('Downloading image from url: %s', url[:100])
    response_object = requests.get(url)
    file_descriptor, filename = tempfile.mkstemp(prefix='image-', suffix='.jpg')
    logging.info('Saving file: %s', filename)
    with open(file_descriptor, mode='wb') as f:
        f.write(response_object.content)
    return filename

url = 'https://chaudhary.page.link/test-zomato-img'
image_path = download_image(url)

img = cv2.imread(image_path)
resized_img = cv2.resize(img, (299, 299))
# preprocess(resized_img)
# prediction_score = model.predict(resized_img)
os.remove(image_path)

While this approach works for a few images, it creates significant unnecessary disk IO when processing millions of images at Zomato’s scale. Additionally, in a dockerized environment, it results in numerous temporary files.

Optimized Workflow with In-Memory Buffers

To eliminate unnecessary disk IO, we can use in-memory buffers. In Python, io.BytesIO allows you to create a buffer in RAM, which can be used like a file pointer and is automatically deleted when closed or goes out of context when using context manager.

from io import BytesIO

import cv2
import numpy as np
import requests


url = 'https://chaudhary.page.link/test-zomato-img'
response_object = requests.get(url)
image_data = BytesIO(response_object.content)
file_bytes = np.asarray(bytearray(image_data.read()), dtype=np.uint8)
img = cv2.imdecode(file_bytes, cv2.IMREAD_COLOR)
image_data.close()
resized_img = cv2.resize(img, (299, 299))
# preprocess(resized_img)
# prediction_score = model.predict(resized_img)

Using imdecode, we can simplify the process further, eliminating the need for a bytes IO buffer.

import cv2
import numpy as np
import requests

url = 'https://chaudhary.page.link/test-zomato-img'
response_object = requests.get(url)
file_bytes = np.asarray(bytearray(response_object.content), dtype=np.uint8)
img = cv2.imdecode(file_bytes, cv2.IMREAD_COLOR)
resized_img = cv2.resize(img, (299, 299))
# preprocess(resized_img)
# prediction_score = model.predict(resized_img)

Performance Analysis

To analyze the performance of these methods, I conducted a simple test. Here are the results on my system:

With File IO: 35.4 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
With Bytes IO: 35.1 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
With Direct Decode: 34.6 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The Bytes IO reduce unnecessary disk IO, which isn’t measured in this test, even though the performance difference is minimal. Splitting the process into multiple scripts and adding strace can help see the number of OPEN calls, which will be lower in the in-memory methods.

You can find the code to generate these performance numbers here. Let me know if you achieve similar results.

Conclusion

Using in-memory buffers can significantly optimize image processing workflows by reducing disk IO. This approach is especially beneficial at large scales, such as at Zomato, where it can lead to considerable performance improvements and resource savings.