Data Manipulation on Client Machines

It’s my third week and I’m discussing the scalability of the real-time web. We’re only talking about text input, realtime search, trend extractions and etc. I had a love growing inside for this instant replier, it makes me feel I’m more connected to real people (sort of egoism). It’s good because we have text, none of the realtime providers are working on more than indexing the “text” for searching, as it revealed. Text is medium of communication and there are a lot more: images, audio, videos etc. You are not really interested in multimedia as I guess because, text is cool. You can skim it, you can select it, easily process it. But the world is not man-made and I cant even imagine a moment where maps are served as text, for example. Typing is human’s built-in analog to digital converter. Love it or not, but we are forced ungracefully by this nature to talk in multimedia when text alone is not efficient enough.

Realtime Data Processing

Realtime environment have to play with data to make connections, be able to provide smart searching that does not only depend on full-text comparison. Imagine that they have to tokenize text input to post-process what’s going on with the message. Almost requires O(n) on CPU. Twitter has about 1 million users, let’s assume averagely every user enters a new twit once in 3 days and average entry is 100 characters long (doesn’t sound realistic, but let’s be optimistic). It converges to 350k posts a day. 350k times tokenizing the 100 character-long input = 35 million characters are processed during a day.  I tried to tokenize a 100 char string 350k times and it took only 41 seconds since I was using the same string over and over again. With helps of caching, CPU minimizes the memory I/O and it made a huge difference, so my 41 seconds were far from being accurate. But besides tokenizing, there are other operations you have to run. And once you fetch it from memory, you’re done. Therefore, I believe it’s not really an extra load to tokenize the input on the server-side.

But, what would you do if you have to post-process terabytes of imagery? I’m not sure if you are aware of Microsoft’s Virtual Earth 3D but, it is more like Google Earth running on your browser.

 

A very long time ago, I was making a demo to my friends and showing the Mt. Rainer in WA distinct in 3D mode. Virtual Earth 3D fetches higher quality imagery for forefront. Since there is no colour balance adjustments at VE imagery, many people thought  the level differences on the scene is sort of a corruption and not good looking. I decided to talk to engineers that we can solve this problem with relative histogram equalization (fancy name but easy method). I didn’t sound perfectly realistic, cause our imagery were tens of terabytes and it was very risky to process them all for such a tiny improvement. Continue reading