Cognitive Services: Smart image tagging with AI (part 1)

The challenge sounds simple: How to generate a collection of keywords describing an image?

By description I don't mean its metadata like width or MIME type, but rather the actual content of it. Of course, solutions like asking your mom to spend her afternoon doing that manually for you are not considered good practice - especially when talking about thousands of images and you hoping for any presents next Christmas.

Talking seriously, this task sounds ideal for usage of AI/ML which have become very hot topics these days and are trending to be applicable in a rapidly growing number of areas. I have already partially covered this topic providing a good example of Microsoft Cognitive Services utilisation in my previous post. I encourage you to read it if you'd like to broaden your knowledge around usage of it with Sitecore and SXA.

To get to the point and answer the question asked in the first caption: let's make use of Tag Image endpoint to get a collection of relevant tags. In order to have some flexibility with applying the tagging solution let's prepare 2 implementations: one accepting a serialized image while the other a URL to an image available online:

using Sitecore91.Foundation.CognitiveServices.Models;

namespace Sitecore91.Foundation.CognitiveServices.ComputerVision
{
    public interface IComputerVisionService
    {
        TagModel Tag(byte[] image, string language);
        TagModel Tag(string imageUrl, string language);
    }
}

The TagModel class is a C# representation of the API JSON result:

using System.Collections.Generic;

namespace Sitecore91.Foundation.CognitiveServices.Models
{
    public class TagModel
    {
        public List<Tag> tags { get; set; }
        public string requestId { get; set; }
    }

    public class Tag
    {
        public string name { get; set; }
        public double confidence { get; set; }
    }
}

TIP: This model is fairy simple, but for more complex JSON structures you can simply generate it yourself by processing a sample success JSON result from Image Tag docs with json2sharp. Just take a look at different endpoints like Analyze Image where the JSON result maps to a structure of almost 20 C# classes.

Now, here's the service:

using System;
using System.Net.Http;
using System.Net.Http.Headers;
using Sitecore.Diagnostics;
 
namespace Sitecore91.Foundation.CognitiveServices.ComputerVision
{
    public class ComputerVisionService : IComputerVisionService
    {
        private readonly string _subscriptionKey = "<-SUBSCRIPTION KEY->";
        private readonly string _serviceUrl = "https://<-AZURE REGION->.api.cognitive.microsoft.com/vision/v2.0/";
 
        public TagModel Tag(byte[] image, string language)
        {
            var apiMethod = "tag";
            var requestUri = _serviceUrl + apiMethod + $"?language={language}";

            try
            {
                using (var client = new HttpClient())
                {
                    client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);

                    using (var content = new ByteArrayContent(image))
                    {
                        content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
                        using (var response = client.PostAsync(requestUri, content).GetAwaiter().GetResult())
                        {
                            if (response.IsSuccessStatusCode)
                            {
                                var result = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                                return JsonConvert.DeserializeObject<TagModel>(result);
                            }
                            var errorMessage = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                            Log.Error(errorMessage, this);
                            return null;
                        }
                    }
                }
            }
            catch (Exception e)
            {
                Log.Error(e.Message, this);
                return null;
            }
        }

        public TagModel Tag(string imageUrl, string language)
        {
            var apiMethod = "tag";
            var requestUri = _serviceUrl + apiMethod + $"?language={language}";

            try
            {
                using (var client = new HttpClient())
                {
                    client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);

                    using (var content = new StringContent($"{{url:\"{imageUrl}\"}}"))
                    {
                        content.Headers.ContentType = new MediaTypeHeaderValue("application/json");
                        using (var response = client.PostAsync(requestUri, content).GetAwaiter().GetResult())
                        {
                            if (response.IsSuccessStatusCode)
                            {
                                var result = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                                return JsonConvert.DeserializeObject<TagModel>(result);
                            }
                            var errorMessage = response.Content.ReadAsStringAsync().GetAwaiter().GetResult();
                            Log.Error(errorMessage, this);
                            return null;
                        }
                    }
                }
            }
            catch (Exception e)
            {
                Log.Error(e.Message, this);
                return null;
            }
        }
    }
}

If you took a look at my previous post mentioned before, you'll find this piece of code looking almost identical. We'll, in fact all we do is just another API call with some JSON result mapping afterwards. To test if it works fine I prepared a sample controller action registered with custom MVC routing:

public ActionResult TestTagging()
{
    var computerVisionService = new ComputerVisionService();
    var imageUrl = "https://images.pexels.com/photos/849835/pexels-photo-849835.jpeg";
    var model = computerVisionService.Tag(imageUrl, "en");

    return new ContentResult { Content = string.Join("<br />", model.tags.Select(x => $"{x.name}:{x.confidence}")) };
}

 

Now, for the image below:

We get the following collection of tags returned by the service:

sky:0.996140539646149
truck:0.954374551773071
car:0.951652050018311
outdoor:0.936659157276154
snow:0.905518352985382
blue:0.871806204319
transport:0.659421741962433
winter:0.175758534148281

The float value from 0.0-1.0 range next to each tag is its 'confidence', representing how 'sure' the AI is about the relevance of each assigned tag. It's very useful for accurate tagging, as finding a correct threshold allows you to tune the tagging relevance for your own needs.

Comments are closed