In a conventional art gallery, it’s easy to label individual pieces, guide people towards the ones they are going to like, and help them find what they might be looking for. At Zedge we have an enormous inventory of wildly diverse cultural artifacts from all over the world, collected over the last ten years and growing every day, which we use to populate our marketplace and help artists bring their work to people who will love what they have made.
On the back end, this means that our inventory of images and audio files is huge: well north of 20 million wallpapers, for example, and it's growing ever more quickly because everyone can make art using AI now.
Big Media File Formats
With so many files, and each one so large, we are in a real pickle. There are many reasons we need to be able to describe – at scale – what's in those pictures:
- We need to know if they are NSFW, to keep our platform safe.
- We need to know if any are duplicates, to not annoy our users or our artists.
- We need to know if a given one shows a shiny motorcycle or a zen teapot, so that we can serve good search results.
- Most challenging of all, we need to know the general vibe of each one, so that we can show people looking at one picture or listening to a ringtone other items they might also enjoy.
Human Vision, Computer Vision, LLMs and What it all Costs
Humans are very good at judging an image, but it would be madness to try employing real people to do this work at this scale and speed. So for many years we have been using computer vision - the kind of object recognition algorithm that has been widely available since the 2010s - but that tends to interpret things very literally. Whether you know it or not, you have helped train computer vision anytime you've been challenged by a website to prove you're human by selecting every image in a collection of nine that contains a crosswalk. This kind of algorithm is pretty good at identifying physical objects (for example, a teapot) but it can not tell you that it's a Japanimé style illustration of a teapot in a peaceful setting in the style of OnePiece. Art is not its strong suit.
More recent AI systems known as Large Language Models, or LLMs, on the other hand, are showing themselves to be very good at describing the literal but also aesthetic attributes of an image, but we have to use them carefully. Right now, consumers are able to play with these tools largely for free, but be assured they are VERY NOT FREE for use at industrial scale. Deploying a consistent LLM implementation for millions of prompts is astronomically expensive and wildly computationally heavy. With our massive user base constantly uploading images, Zedge can't afford something too expensive running 24/7 (more on LLMs later).
Happily, there is an older, far cheaper tool out there that scales the way we need and does a magical thing where it crushes the image into a string of numbers, and (trust me on this) that string of numbers contains all its attributes. The teapot, the blue lighting, the raindrop on the windowpane, the color balance, the composition, the illustration style. It does so in a completely blind and computery way – this tool has no idea what "japanimé" means but it does know that this string of numbers is pretty similar to THAT string of numbers. In fact, comparing strings of numbers is one of the cheapest and fastest things you can do with a computer, because THAT is what computers are actually FOR.
This process is called “embedding,” and it has proven to be our most scalable and successful image data intervention to date. Embedding enables us to score very cheaply how alike two images are, letting us flag items that are virtually identical (likely clones) and boot them off our platform. But it also lets us know (if they have a still-high-but-not-too-high score) that they are ... pleasantly related. This distinction lets us do a lot for users, such as serving them more images in the same vein, and boosting images that have similar (blind, mathematical) attributes.
Moving Towards LLMs for Tagging
Zedge RTWP is moving toward using LLMs now, as they show much more promise in NSFW content detection and tagging. Image tagging is critical for our business. Without such tags, we cannot grasp, quantify, segment and report on what content works for our users and for our bottom line. It’s also how we can create good collections and return good search results.
Our uploading users usually add tags to each thing they upload, BUT their goals are not always our goals, and they aren't always as complete and careful as we would prefer them to be.
Computer vision algorithms have so far done a good but very literal job of suggesting what's in the image, but even a mid-market LLM can review an image and retrieve the mood, art style, coloration, aesthetics AND literal objects, with a single prompt. Zedge’s Data Team is now figuring out the best, but still economically sane LLM set up that we can use at our scale and speed, with consistent results. We’re incredibly excited for these enhanced tags to help us give users more exciting, nuanced and relevant art.
Wish us luck.