ML Solution to Image-to-Text Conversion

How Tensorway created a model for generating accurate descriptions for images with the Machine Learning technology

Cold blue neon living room with blue sofa, black stone coffee table, metal lamp in front of stone wall

Generates accurate alt texts for real estate listings
Detects accessibility issues in public spaces
Improves SEO for product catalogs
Developed by a team of DL experts

Story behind

Large real estate websites like that of our client require descriptions of multiple pictures of interiors, and creating them manually and moderating them afterward takes much time and effort. A solution to this challenge is a trained model able to generate accurate image captions with no humans involved. However, descriptions created by models available at that point were vague and primitive, so Tensorway’s team specializing in Deep Learning took on developing the one meeting our client’s needs.

Generating accurate text descriptions for images with Machine Learning has certain challenges. Top tech companies - Microsoft, Google, Meta - aim at extracting as much information as they can from images, videos, and other sources, but still, there’s no perfect algorithm yet. Developing the one closer to perfect is what our DL team has been doing so far.

a bedroom with a bed with white sheets and a desk and chair in front of brick wall

Previous solution:

a room with a bed a chair and a window

Our solution:

a bedroom with a bed with white sheets and a desk and chair in front of brick wall

a bedroom with a white bed a pink curtains on the window and blue rug on the floor

Previous solution:

a bedroom with a bed and a large window

Our solution:

a bedroom with a white bed a pink curtains on the window and blue rug on the floor

Goal

Based on one of the available solutions, develop an advanced image caption generation model that would distinguish the color, texture, and materials of objects and surfaces, as well as recognize smaller objects in a picture. Such a model would be especially useful for visually impaired people, people with a poor Internet connection, and companies using it - for the latter, the model would save much time and money.

Challenges

The existing solutions for image captioning generated short texts that lacked detail. Similar captions were generated for different images which negatively affected their descriptiveness. These solutions also had issues with object detection and spatial relations between objects. All of the above was due to the following challenges:

Lack of labeled data: no interior captioning databases were available.

Existing data sets had short and too generalized descriptions.

Object detector made mistakes in classifying objects and detected non-existent objects.

Training the algorithm required a lot of computational resources and time: its reproduction with state-of-the-art models like Oscar from Microsoft could take months.

Tensorway’s solution

Step 1

To eliminate the problems above, first, the DL team explored the existing image captioning data sets on which the algorithm was trained. It was revealed they all had descriptions with an average length of nine words which turned out to be short.

To collect custom data sets, the team fine-tuned models with additional images and descriptions. Developers grabbed some pictures of interiors from a photo bank, mocked up their own data set, and added image-text pairs.

Gray living room with sofa, coffee table, balcony and kitchenette

Gray living room with sofa, coffee table, balcony and kitchenette

Spacious living room with black and white wooden furniture chairs, table and sofa

Spacious living room with black and white wooden furniture chairs, table and sofa

However, this experiment did not bring improvements, probably because the team had few custom images. It was concluded that fine-tuning alone was not enough, and for notable results, quality descriptions for the used images were needed as well. This required retraining of the language model.

Step 2

Next, to improve the object detection issue, the team started using a scene-graph object detection model. It not only fixed the object classification problem but also extracted even more information about objects, such as their color, texture, and size. This provided for longer and more diverse descriptions generation. Beyond that, the detector was trained to detect objects previously unseen.

Previously used model

Scene-graph object detection model

Step 3

Then, the team decided to use VIVO (VIsual VOcabulary pre-training) from Microsoft. Its advantage is that it stores images and texts of the most important objects together and uses previously learned captions to describe items unseen before. Modifying VIVO to generate longer captions was a great opportunity to create the strongest possible solution.

...a couple of washing machines...

Previous solution

...a refrigerator

Our solution

...a couple of washing machines...

a patio with a couple of lawn chairs and a red umbrella ... in front of a white building

Previous solution

a table and chairs sitting under ab umbrella

Our solution

a patio with a couple of lawn chairs and a red umbrella ... in front of a white building

As a result...

More elaborate descriptions were obtained by developing a customized caption generation algorithm that recognizes, for example, wall-facing types and the color of rugs. The model is now capable of delivering its main function - generating accurate image descriptions that capture many details. For even more accuracy, our DL team enabled detecting objects from predefined lists: the detector checks if a particular object from the list is present in a picture.

Other possible
applications

Although the model was developed primarily for the real estate industry, it can be applied far beyond it. For accessibility control, it will analyze photos of public spaces and detect the presence of ramps, handrails, etc., and filter buildings based on how friendly they are to disabled people. For all kinds of online stores, the model can simplify SEO by auto-generating alt texts and product descriptions thus improving listings visibility on search engines.

Need qualified professionals?

What you’ve seen is just a tip of what Tensorway’s Deep Learning team is capable of. The team consists of dedicated professionals with a strong mathematical background, scientific publications, multiple Kaggle competition medals, and open-source contributions. Their areas of expertise include:

Natural Language Processing (NLP)

Audio

Tabular Data

Recommendation Systems

Reinforcement Learning

Computer Vision

Contact Us
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.