Some say that the most peaceful time period in history was that subsequent to the release of Pokémon Go, the GPS enabled augmented reality smartphone game. If you walked outside in those days, you were likely, in large parts of the world, to see people play this game. In this post, I'll investigate how to identify Pokémon cards from the original card game, using the image processing functions in Wolfram Language.
This is the card that we will be using as an example:
The first step in detecting cards like this, just like in my last post about detecting Aruco markers, is to detect rectangular objects in the image. However, the Douglas-Peucker algorithm that I used in the last article does not work well with the rounded corners of the Pokémon cards. What I will do instead will be similar to that which Jordan Rabet described in Real-time augmentation of a children's card game. It will involve using a Canny edge detector to find the contours of the cards, using morphological component analysis to filter out unwanted contours, and using the Hough transform to detect corners. Once having the corners, we will use that to find the homology of the image, correct the perspective, extract the image of the Pokémon present on the card, and classify it using a so-called perceptual hash.
A Canny edge detector is well suited for edge detection in this application. Take the picture of Vulpix, for example:
This is a good start. To find the edge of the card, we can find the components (contiguous regions of white) and filter out those components that do not have the elongation that we expect a Pokémon card to have. We can also specify that the edge is not enclosed, as is the case with the components that appear inside the card.
Depending on the background and other circumstances, we may have to add additional criteria such as a threshold for the area of the bounding box of the component to filter out small objects. We may also have to use, for example,
Closing to repair broken boundaries if
EdgeDetect is having trouble cleanly separating the boundary from the background. This may also have to be fixed by fine-tuning the second parameter of
EdgeDetect. If needing to repair the boundary, it is a good idea to start by removing as many components as you can before applying
Closing because otherwise the boundary might connect to other components which then become a part of the boundary. Then, after applying
Closing, you can run the filtering a second time because the enclosing criteria will only work if the boundary isn't broken. Generally, this kind of image processing involves a lot of fiddling like this, but if you have a specific type of image, like a specific type of background, then you can usually work it out. And as this article shows, if your background is a table and the lighting conditions are normal, then you are probably good to go with very little work.
The next step is to detect the corners of the cards. This will be useful to determine the homology, which will allow us to correct the perspective and ultimately help us identify the Pokémon.
ImageLines uses the Hough transform to find lines in the image
filteredEdges which we defined earlier in this article. When having rounded edges, it is not clear what constitutes a corner on the card itself. This is why Douglas-Peucker from my article on detecting Aruco markers fails. However, finding lines along the boundaries allows us to define corners as the intersections of those lines.
Extracting the image
We will now extract the image of the Pokémon that's on the card, we shall call those images "portraits." Let's start by defining two functions:
The first function takes the list of corner points and reorders them so that the lower left corner is in the first position, the upper left in the second position, the upper right in the third position, and the bottom right in the fourth position. The second function takes the reordered points and finds a geometric transformation that transforms those corner points into a new coordinate system. Here is an example:
This transform corrects the perspective by projecting the image onto a new plot range (plot range specifies which part of a coordinate system that should be displayed, any points that fall outside of it will not be shown). The transform was specifically designed such that it places the lower left corner of the Pokémon card in the image at the lower left corner of the new plot range, upper left corner of the card at the upper left corner of the plot range, and correspondingly for the other corners.
This establishes a new coordinate system that will be common to all Pokémon cards thus projected, regardless of how they were distorted by perspective in the original photo. The magic of this is that we can now extract features by simply specifying coordinates in this coordinate system. Thus, to extract the image of the Pokémon, we can simply do this:
And this will work regardless of what position or perspective distortion the card had in the original photo. This portrait will later be used to identify the Pokémon.
We are now going to write code for identifying which Pokémon card it is we're detecting. However, first let's run our code on more Pokémons so that we have a couple of portraits to work with. We can do this by running the code on a photo with several cards in it. It works the same as before as we are simply going through the same process for every component in the image. Let's say that we have the following photo:
As we go through the steps, we get the following intermediate and final results:
In the third image, I only drew one set of lines to make the picture clearer.
The portraits uniquely identify the cards. To use this, we first characterize the images in a suitable fashion, such as a hash or a pixel array, and create a dictionary of known portraits. Identifying a Pokémon then becomes a matter of extracting the portrait, like we have done, and comparing it to the dictionary to find the best match.
Perhaps the easiest way to compare images with the dictionary is to save the portraits themselves in the dictionary and then perform a pixel-wise distance measurement between the images. However, in this article we will use a more sophisticated method, called perceptual hash, that I have previously written about on Mathematica Stack Exchange. Using it is easy, we simply run the function on an image and we get a hash:
Let's compare the hashes of all the portraits in the image above:
The two Machop pictures, although being virtually identical, have a Hamming distance of 123. However, the distance is even larger to other pictures. I have tried this with a dictionary of 16 different Pokémons, and that worked well, indicating that the distance measurement is working.
Right now you may be thinking, isn't this a job for neural networks? This is not wrong. We could make it simple for us by using
Classify to identify the Pokémon, for example. Or at least, instead of a perceptual hash we could use the new, neural networks based feature extractor for images that is built into Wolfram Language. But I'll save that for, perhaps, another article. In this article, I stick to more traditional methods.
Finally, we have to actually create the dictionary and a function that picks out the best match:
I will now present a couple of reusable functions encapsulating the things we've done in this post. The final result is a function that takes an images and says which Pokémons are present in the image, as well as a function that prints onto each Pokémon card in the image what the name of that Pokémon is.
Finally, we demonstrate these two functions:
I conceived this article as a follow-up to my previous article, Aruco marker detection. In that article, I used the Douglas-Peucker algorithm together with some other techniques to detect and identify square markers in photos. Douglas-Peucker does not work well for detecting Pokémon cards and therefore, in this article, we turned instead to the Hough transform. The algorithms in these two articles both have a lot of parameters that have to be fine-tuned. However, we have seen that they can be made quite easily to work on "nice" photos, where the size of the objects are large enough, where the background is not too confusing, etcetera. In the future I hope to explore similar problems, or the same, using neural networks. Then I will be able to show how they allow us to replace all those heuristics that I have used in this and my previous article with just a dataset of example images.
Appendix: Another take on edge detection
For those interested, I will also describe how I tried to detect the edges using color analysis. I started my analysis of the border color by annotating 36 photos by selecting regions with border pixels in each photo:
We're going to extract the pixels, characterize them in the HSB color space and find thresholds that we can use to build a border detector. The photos were taken in different lighting conditions so that our characterization of the border will not be representative for just a particular lighting. There is also the idea that one can use histogram equalization,
HistogramTransform in Wolfram Language, to deal with different lighting conditions, but I will not do that in this article.
The purpose of using the HSB color space is that the yellow color occupies a very narrow range in the hue channel. This is what the distributions of the channel values look like for pixels on the yellow border for a single picture:
In the following,
hue is a list of hue values from all the selected pixels in 36 images. Finding the 5 % and 95 % quantiles will give us something that we can use as thresholds:
Now, we take our image and extract all pixels that fall in between these thresholds for the hue channel:
We see here that it does a good job of finding the border, but also finds some other pixels inside the card. The article by Rabet used this edge detection technique in conjunction with the Canny edge detector. He took the binarized image and applied a gaussian filter (
Blur in Wolfram Language) to it, and then multiplied that with the result of the Canny edge detector, which he had also post-processed with convolution with a flat kernel (in Wolfram Language:
ImageConvolve with a
BoxMatrix kernel.) The combined grayscale map indicates which regions both methods agree belong to the border. I mention this idea because I think it's interesting and it could potentially be useful for other computer vision problems, however in my experiments the added value from using the boundary color in this way was not enough to justify the added complexity, so I will not use it in the rest of the article. There are a lot of parameters that would have to be added, like the quantiles to use and the size of the filters. Each of these need to be fine-tuned depending on how the photos were shot, at what resolution, and in what environment.