This is a summary for the research paper End-to-End Interpretation of the French Street Name Signs Dataset.
This is a model that takes the multiple shots of street view signs as input and outputs the name in the format that will be directly shown in google maps. A full end to end model. This includes reading image, parsing text, converting text for google maps standard and combining text from multiple images into the most accurate version. Pretty interesting problem and solution. This is one of the inspiration paper for Tesseract LSTM model.
First of all they broke the street sign transcription (img to text) into a simpler problem for their human moderators. They detected the street signs using a neural network that gave the bounding box of street signs. Then they collected multiple views of same sign using image capture geo coordinates. Then each image was transcribed using ocr, recaptcha and human respectively. OCR gave basic data for recaptcha, humans verifies recaptcha input, incorrect transcriptions were forwarded to humans. They never transcribed the text as it was shown in image, but the was they wanted it to be shown in Google Maps.
Recurrent Model - STREET
Then using this dataset they trained the STREET model (StreetView Tensorflow Recurrent End-to-End Transcription) for the end to end problem, from using a set of 4 views of street sign as input to transcribing the street name to be used in Maps as output.
CNN
Images are detiled into 4 images from single image, 2 convolution with max pooling is applied to reduce dimensions from 150x150 to 25x25.
Text Finding & Reading
Vertically summarizing Long Short-Term Memory (LSTM) cells are used to find text lines. A vertically summarizing LSTM is a summarizing LSTM that scans the input vertically. It is thus expected to compute a vertical summary of its input, which will be taken from the last vertical timestep.
Three different vertical summarizations are done and then combined later:
- Upward to find the top textline.
- Separate upward and downward LSTMs, with depth-concatenated outputs, to find the middle textline.
- Downward to find the bottom textline.
Although each vertically summarizing LSTM sees the same input, and could theoretically summarize the entirety of what it sees, they are organized this way so that they only have to produce a summary of the most recently seen information.
Since the middle line is harder to find, that gets two LSTMs working in opposite directions. Each output from the CNN layers are passed to a separate bi-directional horizontal LSTM to recognize the text. Bidirectional LSTMs have been shown to be able to read text with high accuracy. The outputs of the bidirectional LSTMs are concatenated in the x-dimension, to string the text lines out in reading order.
Character Position Normalization and Combination of individual outputs
All four input images may have text positioned differently, the network is provided ability to shuffle data in x dimension by adding two more LSTM layers - scanning left to right & right to left.
After this a unidirectional LSTM is used to combine the four views of each input image to produce the most accurate text. This is the layer that will also learn the Title Case normalization. A 50% dropout if added b/w reshape for regularization.
Final Network
Paper
Here’s the full research paper with the important parts highlighted by me.
You can download the pdf for free from Scribd.