Leif M. Wright
Author, Journalist, Musician, Programmer, Daddy
"You feel uncomfortable, wincing and crying as you read. The daring style is comparable to Chuck Palahniuk’s novels" — Amy O'Hara

Leif M. Wright's Blog

Spreading it for Google and Facebook (otherwise known as creating structured data for images on a busy site)

Filed under Facebook, I Fucking Work Hard, I am the Media, PHP, Programming, schemaless database
If you're bored by technical stuff, move onto the next entry. 
My news site gets almost 50,000 viewers every day. That means my server, a dedicated linux machine somewhere in the desert, must serve up a minimum of that many pages every day. But the story doesn't stop there.
Because my site only counts each IP address once per 24-hour period no matter how many times it sees that address, my hit count is way higher than 50k. It's closer to 200k per day, for just that one site (I have other sites that get even more viewers per day, but they don't suffer from the problem I'm describing here. 
It's a lot. And the server handles it swimmingly.
Until something big breaks and tons more people hit the site. And it's all Facebook and Google's fault. 
You see, in order to properly index the site, both Facebook and Google want structured data, which I won't get into here except to say EVERYTHING needs to be described, and images even moreso; they both want to know where the image that describes the story is, and how wide, how high and what kind of image it is. 
Because all the images on my stories are html links inside the story itself, I can't just point FB and Google to a file somewhere where they can find the image. Instead, I used a nifty bit of HTML searching to identify the first image, create it in memory, analyze it and send the data to Google and Facebook every time they try to access the story. And it works like a charm.
Until 200,000 people try to see the story at once. All that processing (and analyzing images is pretty overhead-intensive) takes a lot of the computer's memory, and eventually, it runs out of memory and crashes the server. 
Especially because it was doing that for 25 stories on the front page in addition to whatever page (single story) the user is accessing. 
So I came up with a multifaceted solution: 
I cached all the stories, so the server isn't having to access the database 30 times every time someone hits the front page. My site has ads that rotate randomly, so I couldn't just create a static HTML page and call it done. Instead, I had to create a cache system that allows the ads to rotate every time someone access the page. To do that, I cached the HTML for each story, saved it to a file and access it whenever needed. That cut down on a lot of server load. 
I created a schema-less database system, because some stories will have images, some will not, and I didn't want to crawl through all 4,000 previous stories to figure out which was which. "Schemaless" means the database has no idea what data it's looking for when it opens the story and has no idea what structure the other stories have when it goes to write a story. A "schema" is a map of sorts, telling the database what to look for and where. Going without a schema makes a very flexible kind of database, and I think I'll be migrating everything over to it. My site, which formerly worked on XML, and before that on JSON-based flat files, is now operating on what I'm calling the Ineffable Schemaless Database system, which I wrote this morning and have now migrated the entire site over to. 
The advantage of moving to this kind of database is that I can now store image data related to each story inside the database. When Google or Facebook come looking, instead of re-analyzing the images over and over, my system will now feed them the data stored about the images. And if it doesn't find any data, it will send them the logo you see above, with the data stored about it.
Ultimately, what this all means is big-hit days won't crash the server anymore, and as a bonus, because of the caching, the site loads about twice as fast as it did before.
The advantage to making my database schema-less is that in the future, if I find some other data point I want to add to stories, I can just do it without having to worry about whether older stories have that same datapoint and then having to rework the entire database to include it. For instance, if I later want to add, oh, I dunno, comments (spoiler, I won't [remind me to tell you how comments have turned the Internet into a place I hate]), I can actually add them without having to change the structure of my database at all. 
Anyway, the none of you who read this far, I'm going to publish the Ineffable Schemaless Database system as soon as I'm sure it's secure enough, and I'll probably make it open source so others can improve on my work.