Would you Like to Throw Exceptions from a Constructor?

Throwing an exception from a constructor has always been an arguable issue between my friends after a few beers of joy. It firstly came to me on an exam where question was asking me to design a class which lets not more than n number of instances to be constructed.

Not using the most straight-forward trick to have a private constructor, I decided to invoke an exception from constructor if there exists more than n instances of that particular class. Luckily, I was coding in Java and had the comfort of a managed environment under the belt. My answer wasn’t graded thankfully but this case became a major concern and I decided to make several experiments back in the old days.

According to what I have seen while debugging, Java simply assigns a number of nulls or zeros to all of the fields of the class unless constructor is successfully executed. Additionally, right after an exception is thrown from a constructor, a null reference is returned and the newly created invalid instance starts to wait for the garbage collector to pick it up and remove the useless object from memory permanently. In short terms, a new instance is created although platform tends to murder it quickly in silent.

What about Unmanaged Platforms?

It was nice to see somebody was taking care of such exceptional situations but what would you do if you were alone? Fault tolerance isn’t just catching all exceptions: (from slides of Bjarne Stroustrup)

  1. The program must be in a valid state after an exception is caught
  2. Resource leaks must be avoided

This will become a problem where memory management is all up to you. You will have to remove and cleanup resources after an exception has been thrown, otherwise it will cause a leakage in memory and you will have created a zombie object hanging around memory — attached to the mother thread.

Cleaning Up the Mess of C++

An object is not even an object once its constructor finished successfully. So, better try to release resources inside the constructor before you throw an exception. In this case, I’m always using a double-exception throwing pattern that I made up and passionately using it in similar situations I face.

Let’s ask the complier! Two lines of code is always more representative than 10 lines of words. Here below, I’ve written a class called “Foo”, not functional but dangerous and an exception “StreamInitException” was decided to be thrown if there happens a problem while longStreamChar is being initialized. Right after we allocate 10k bytes for longStreamChar, BOOM! Someone pushed the wrong button, somebody shut down the database or trouble makers’ spirits are near. Don’t try to catch it inside the constructor and see what happens. Many orphan kids, once we named them as longStreamChar, are all around and we can’t even access them.

#include <iostream>

struct StreamInitException {
    StreamInitException(char * m){ message = m; }
    char * message;
};

class Foo {
    public:
        Foo(){
            try {
                this->initCharStream();
            } catch(StreamInitException * ex) {
                this->~Foo(); //release resources
                throw ex; // exception originally thrown by initCharStream
            }
        }

        ~Foo(){
            if(this->longCharStream)
            delete this->longCharStream;
        }

        void initCharStream(){
            //create a long char array of 10k
            this->longCharStream = (char*)malloc(10000);
            throw new StreamInitException("Error occurred during init.");
        }
    private:
        char * longCharStream;
    };

int main(){

    Foo * foo = NULL;

    try{
        foo = new Foo;
    } catch(StreamInitException * ex){
        std::cout << ex->message << std::endl;
    }

    return 0;

}

Therefore, I’m handling init related bad moments of the object inside the constructor, deallocate memory etc. and re-throw the same exception finally after I get the hold of the control. Seems acceptable!

Consequently, I have a final question: do you prefer to throw an exception from constructors or search for other workarounds such as assigning the workload to a separate method called Init()?

Domain Name Server Fight is Back!

This is the second time I’m facing a horrible situation in 12 years of my active Internet adventure. One or a few DNSes ignore my change requests and keep pointing to the old IP although two weeks has passed after my edits. Those rebel servers cause the half of the routers to use the old IP address and keep my old end-point  alive recursively. I’m not sure if it’s a TTL issue with an A RECORD or an hacking attempt via DNS injection. Don’t know the answer yet, because I’m not able to access every name server (naturally) although I can traverse my path and know the evil ones.

Does anybody have a clue what is happening behind? Because straight forward actions like switching to my own name servers may not help me with the issue at this moment. The new changes might come to a  deadlock on same cycle. Is there anybody out there who knows what’s happening?

An Introduction to Fundamental Web Crawling Strategies

Generally, any search engine architecture is consisted of four core elements: a crawler, an indexer, a retrieval engine and a user interface interacts with end users. In this post, I’ll make an introduction to crawlers, crawling strategies and the the main challenges search engines face with the growth of the Web.

What is a Web Crawler?

A web crawler is an automatic web page collector to create a cache/base of local copies of pages found. Initially a crawler starts with a beginning set of known URLs. These known links are also known as seed URLs. Then, crawler extracts the links inside the known documents and responsible to download these newly found pages in some order. In crawling field, there are two major crawler types:

  1. Periodic or Snapshot Crawling: Crawler continues to find new pages until the collection hits a desirable size. Then, periodically it runs the same process and replaces new collection with the existing. There are typically very large intervals between two crawlings. This method doesn’t use any existing knowledge comes from the previous crawls.
  2. Incremental Crawling: These crawlers keep searching for new links although collection becomes as large as it is desired to be. Old pages are repeatedly visited in a schedule to update the collection. It’s very hard to crawl a large source (such as Web) this way. Documents are needed to be refreshed one by one, also new links should be explored to be handled by the local collection. Web growth function is an exponential one according to the statistics. It means there are even more newly added pages than the updates of the existing documents. We unfortunately have a limited processing power, so it becomes more critical to decide which page to (re)visit next from the queue?

Figure above shows it is more useful to choose snapshot strategies for large scope search engines. On the other side, if rapid change of documents is guaranteed for a small scale corpus, it’s more effective to use an incremental crawler.

Scheduling the Download Queue

Although I stated it is more likely to use a snapshot crawling for large amount of documents, large gaps between updates makes it absolutely impractical solution for Web search engines while competitors such as Google explores even the most least significant change in hours. Incremental crawling looks (and actually is) very costly if we cannot predict when a page is updated/removed. Re-downloading the whole Web in short periods is also impossible. But, what if we can guess how often a page changes? What if we can prioritize each URL and schedule our download queue? Continue reading

Functional Programming for Beginners

Recently, I’m facing many questions about functional programming. Instead of answering everybody one by one, I decided to write a blog post about functional programming. In this article, I’ll try to introduce you the FP concept. If you are interested, I advice you to have a hands-on experience. There are many widely used functional languages available today: LISP, Haskell, Erlang and F# (new but promising) are a few to name.

Firstly, a Brief History…

A long time ago, in 1930s, when the world was stuck in another economical recession, lives of four extra-ordinary mathematicians were crossed in Princeton, NJ.  These men were not interested in the physical world they were living but trying to create their own universe to find answers about limits of computation – a word not heard by many yet. The area they were interested in was called formal systems and their main problem was to answer which problems are solvable if processing power and memory were infinite. One of them were a truly materialist, a slave of questioning and curiosity, a British guy who decided to move to the new world after graduating from Trinity College. Second was a super brain whose Ph.D. dissertation was accepted when he was just 23 years old, nicknamed “Mr. Why”, a close friend of Albert Einstein. The other two were recent Princeton graduates who decided to go for graduate school. Correspondingly, the names of these men were Alan Turing, Kurt Gödel, Alonzo Church and Stephen Kleene. In 1936, Turing extended Gödel’s study on the limits of proof and computation with replacing Gödel’s universal arithmetic-based formal language with formal devices called Turing machines. At the same time, two young grad students Church and Kleene were designing a universal model of computation which was identical to Turing machines in power. Their formal system were called lambda calculus. Let’s say it in a clearer and less scientific-BS way: they invented a language, lambda calculus, that was capable to be the smallest universal programming language of the whole world.

Lambda Calculus

Lambda calculus is the common programming language of the world. The main aim or their inventors was to prove any computable function can be expressed and evaluated using this formulization. In the universe of lambda calculus, the key elements are <name>, <expression> and <application> where,

 

<name> in lambda calculus cannot be associated with different values, therefore it is not called a “variable.” Imagine your favourite iterative programming language don’t let you change values of the variables by default. Yes, it sounds like a headache at first, but the whole concept is standing on these rules. Now, let’s move on to a more practical example, for instance, to an function multiples its input by 2.

 

For great examples, I suggest you to read “A Tutorial Introduction to the Lambda Calculus” by Rojas. Continue reading

Continuous Scrolling from a User POV

It was a few months ago when I first saw continuous scrolling template on Yahoo’s UI Design patterns library. It is also called infinite scrolling or infinite page, but no matter how fancy the name they try,  the rating on the library alone was explaining what a user-enemy navigation method it is. At that moment, page gave me an instant feeling like a random guy posted a page without inner sense and taste because; that behaviour was absolutely not expected and against common user experience practices. Unfortunately I was wrong! A few weeks ago, I came across a Live Search feature showcase on Channel 9 discusses the new infinite scrolling on the image search. Wow! Totally without prejudice, was curious and wanted to experience. (Comforting me with an in-mind quote: Anyways, who’s using Live Search? Give the guys freedom to try things.)

I searched “Jef Raskin”. A set contains more than 40 pictures I suppose. Therefore, initial scene was very crowded already and before I hit the bottom another 40ish loaded. Everything was starting to be a little bit confusing and page was looking messy. There were no layers or separators between and it was making it harder to scroll back to an image I previously favoured. After 5-6 loads, it became impossible to navigate between older images and the newly loaded sets. As an advantage, human brain is more sensitive to image data. Hashtags are using the same approach and in free form text, it is absolutely more scary.

Additionally, how am I going to navigate friends to the right page, what about sharing?

Me: Hey, look, your old pictures are popping out on a totally irreverent topic. Type in [phrase here], scroll down 3 times, then scroll up approximately 200 pixels, stop, no sorry, scroll down a little bit, scan the heap. There you are. (Instead of sending 4th page’s link.)
My friend: Okeeeeeyyyyyyy.

Finally I’m coming to the point. It doesn’t mean you have to mess up like this, if there is nothing new you can find to improve user experience in navigation. I have seen 4-5 other examples lately, is this becoming the new trend? NO!! I want the good old paging back :’) What’s your feedback?

Edit: I saw the 8px font above informs you about the range, what a tricky improvement, huh? Plus, images stopped to come after hitting 1000. Did the guys think nobody can scroll down that much insistently?

BigTable Concept: Why do the World’s Smartest People Ignore Relational DBs?

In the era of the Internet, the key problem is scalability. As cloud’s popularity climbs up, we are hearing more about the constraints. So far, I only had time to play with Google’s App Engine and Microsoft’s Azure Services Platform. Cloud developers are mainly shocked by the new non-relational databases that cloud services use as the only alternative. Google calls it BigTable and Microsoft finds a new place in its own terminology dictionary for BLOB. Many start to wonder what the hype about the relational databases was over the past 30 years. Foremost, let’s clear that this is not a replacement, but a more efficient way to store data by eliminating not-that-fundamental super engineered functionality layers of the current relational database management systems. Yes, good news for people makes living by designing super large and highly normalized databases to ensure data integrity.

On a relational database, everything is in control; you can add constrains to ensure nobody will be able to enter a duplicated row. Or in deletion, you can program DBMS to handle the useless orphan rows. But the best, a relational DBMS is going to pre-process your SQL query before executing to avoid silly performance mistakes you can make. Think of the environment now: constraints over constraints, query execution strategies, high-level of dependence and complex indexing methods. This package works great unless you want to distribute the tables to different machines. Can you image joining two tables where tables are distributed over 100.000 nodes? In a Google case, this is the everyday problem (or better, call it an every millisecond issue). Luckily, Google’s data has characteristics; according to Jeffrey Dean, they are able to manage constraints DBMSes serve to process data, on the application level. Consequently, Google keeps data in a very basic form as <key,value,timestamp> tuples.

BigTable looks like a very large B+ tree. It has 3 levels of hierarchy. All of tables are sorted and those tables are separated into pieces called tablets. First two levels are made of metadata tables to locate you to the right  tablet. Root tablet is not distributed, but with helps of prefetching and extreme caching, it is actually not the bottleneck of the system. Final level tablets points to physical files (managed by Google File System). GFS provides 3 copies for each file on the system, so no matter if a machine is going down, they still have 2 other copies somewhere else. In the 2nd figure, a row of a tablet is illustrated. com.cnn.www is the key in this case and value has three different columns: contents, anchor:cnnsi.com and anchor:my.look.ca. Notice the timestamps, these fields may contain more than one version of entry. In this case, as Google crawler finds updated content on www.cnn.com, a new layer is being added. This enables and leads BigTable to provide a three dimensional data presentation.

In the end of the day, BigTable is not rocket science. It is compact and easy to adopt. It is very straight-forward. Many friends know I came with a very similar concept while designing Rootapi two years ago, those were the times I havent heard of BigTable. Additionally I was saving values as JSON (equality operation was enough in querying) in blocks which were multiples of the sector size of my physical hard drives. IO operations were super fast, JSON based web services were super fast and it was highly distributable, although I couldn’t find a great environment to explore the severe situations deeply.

As we move on the cloud, this is the way we are going to look at data storage. If you need more technical details, I highly recommend you to take a look at the following references:

  1. BigTable: A Distributed Structured Storage System
  2. Bigtable: A Distributed Storage System for Structured Data – Original publication paper of BigTable, appeared in OSDI’06.
  3. Google File System An introduction to GFS by Aaron Kimball.