ML Solution to Image-to-Text Conversion
How Tensorway created a model for generating accurate descriptions for images with the Machine Learning technology
Cold blue neon living room with blue sofa, black stone coffee table, metal lamp in front of stone wall
Story behind
Large real estate websites like that of our client require descriptions of multiple pictures of interiors, and creating them manually and moderating them afterward takes much time and effort. A solution to this challenge is a trained model able to generate accurate image captions with no humans involved. However, descriptions created by models available at that point were vague and primitive, so Tensorway’s team specializing in Deep Learning took on developing the one meeting our client’s needs.
Generating accurate text descriptions for images with Machine Learning has certain challenges. Top tech companies - Microsoft, Google, Meta - aim at extracting as much information as they can from images, videos, and other sources, but still, there’s no perfect algorithm yet. Developing the one closer to perfect is what our DL team has been doing so far.
Previous solution:
a room with a bed a chair and a window
Our solution:
a bedroom with a bed with white sheets and a desk and chair in front of brick wall
Previous solution:
a bedroom with a bed and a large window
Our solution:
a bedroom with a white bed a pink curtains on the window and blue rug on the floor
Goal
Based on one of the available solutions, develop an advanced image caption generation model that would distinguish the color, texture, and materials of objects and surfaces, as well as recognize smaller objects in a picture. Such a model would be especially useful for visually impaired people, people with a poor Internet connection, and companies using it - for the latter, the model would save much time and money.
Challenges
The existing solutions for image captioning generated short texts that lacked detail. Similar captions were generated for different images which negatively affected their descriptiveness. These solutions also had issues with object detection and spatial relations between objects. All of the above was due to the following challenges:
Lack of labeled data: no interior captioning databases were available.
Existing data sets had short and too generalized descriptions.
Object detector made mistakes in classifying objects and detected non-existent objects.
Training the algorithm required a lot of computational resources and time: its reproduction with state-of-the-art models like Oscar from Microsoft could take months.
Tensorway’s solution
Step 1
To eliminate the problems above, first, the DL team explored the existing image captioning data sets on which the algorithm was trained. It was revealed they all had descriptions with an average length of nine words which turned out to be short.
To collect custom data sets, the team fine-tuned models with additional images and descriptions. Developers grabbed some pictures of interiors from a photo bank, mocked up their own data set, and added image-text pairs.
Gray living room with sofa, coffee table, balcony and kitchenette
Spacious living room with black and white wooden furniture chairs, table and sofa
However, this experiment did not bring improvements, probably because the team had few custom images. It was concluded that fine-tuning alone was not enough, and for notable results, quality descriptions for the used images were needed as well. This required retraining of the language model.
Step 2
Next, to improve the object detection issue, the team started using a scene-graph object detection model. It not only fixed the object classification problem but also extracted even more information about objects, such as their color, texture, and size. This provided for longer and more diverse descriptions generation. Beyond that, the detector was trained to detect objects previously unseen.
Previously used model
Scene-graph object detection model
Step 3
Then, the team decided to use VIVO (VIsual VOcabulary pre-training) from Microsoft. Its advantage is that it stores images and texts of the most important objects together and uses previously learned captions to describe items unseen before. Modifying VIVO to generate longer captions was a great opportunity to create the strongest possible solution.
Previous solution
...a refrigerator
Our solution
...a couple of washing machines...
Previous solution
a table and chairs sitting under ab umbrella
Our solution
a patio with a couple of lawn chairs and a red umbrella ... in front of a white building
As a result...
More elaborate descriptions were obtained by developing a customized caption generation algorithm that recognizes, for example, wall-facing types and the color of rugs. The model is now capable of delivering its main function - generating accurate image descriptions that capture many details. For even more accuracy, our DL team enabled detecting objects from predefined lists: the detector checks if a particular object from the list is present in a picture.
Other possible
applications
Although the model was developed primarily for the real estate industry, it can be applied far beyond it. For accessibility control, it will analyze photos of public spaces and detect the presence of ramps, handrails, etc., and filter buildings based on how friendly they are to disabled people. For all kinds of online stores, the model can simplify SEO by auto-generating alt texts and product descriptions thus improving listings visibility on search engines.
Need qualified professionals?
What you’ve seen is just a tip of what Tensorway’s Deep Learning team is capable of. The team consists of dedicated professionals with a strong mathematical background, scientific publications, multiple Kaggle competition medals, and open-source contributions. Their areas of expertise include: