Why do Code Reviews Matter?

We build software together. Team sizes vary a lot but it’s usually not 2 or 3. Team members leave, new members join and in the end of the day codebases are shared among large numbers of  different and diverse people from development, testing and deployment.
Code reviewing process is what you take into action every time you make modifications to the code’s itself. Even though you change a single line, before committing the code to repository, a peer reviews it and confirms it can be submitted. Reviews can help to decrease the number of bugs, vulnerable code pieces, misuses of coding standards and etc. In an environment with no reviewing practices, actual code wont be directly visible to developers before an other member opens file(s) for editing. Without reviews, even though build passes QA tests,

  1. Readability of the code,
  2. Compatibility with coding standards,
  3. Organization of the codebase,
  4. Documentation inside the code /* comments */,

will stay as surprises. These may lead to a dirty and badly organized base after a few years of continuous development. Asking a peer before you integrate new code into your product is a better approach since you may most probably have more time to fix, reshape and enhance your code if you take the action earlier. Usually only one other team will have time to review your code. But reviewing should be done with two people. First one should be a master, the one that knows the existing system well and can see the big picture. He can guess the impacts of your certain changes among the codebase. Other peer should be ideally someone who is not very familiar with the code, so you can test how easy to get into your code if you have to assign a bug to a new member, how well it suits with patterns in software development, how simple and obvious your solution is.

Consequently, whatever you are working on is shared among people. Asking for a review is always better than not asking and keep your actual contribution as a secret until it needs to be changed.

Let’s modify our representation of addresses in adr microformat

Microformats define a representation spec for addresses, called adr. This year, I made two distinct proposal to modify the current draft, but turned down each time I tried. In this post, I’m going to address the current problems and how tiny enhancements can bring new horizons in the retrieval of location based information.

The problem

Current spec does not serve as a latitude, longitude carrier. Current properties only include post-office-box, extended-address, street-address, locality, region, postal-code and country-name which are text fields to form an address. This schema is defined in vCard and migrated to hCard microformat in 2004. Then, the need of address-based extraction led them to copycat this format and call it adr. vCard’s final design spec was way before we had online maps. Nowadays, we have addresses all over the Web. Automatically directing these text addresses to locations on maps or providing a preview on hovers would be the first basic attempts to improve our data representations. But unfortunately, maps are talking more in mathematics than text addresses. In practise, there is a process that takes addresses and transforms them into a latitude, longitude couple and pans map to that location. This process is called geocoding, and it is far away from perfection in today’s scale. Instead of depending on a geocoder to transform addresses into mathematical locations, I suggest microformats to enable built-in (lat, long) arrays in adr.

Extending adr with a set of latitudes and longitudes

What I’m going for is to extend adr with an optional list of (lat, long) values. So, in cases where coordinates are given, instead of asking a geocoder to land us on a location we can directly move. But why to use a list of coordinates and not a single point? Because, in spatial domain different geometric structures are being represented as different shapes. Examples are below.

  1. If you are talking about a city centre, it’s most likely to be a Point.
  2. Mississippi river is a long long Line.
  3. And a university campus is obviously a Polygon.

In the image above, British Museum is represented by 12 latitudes and longitudes as the inner area which these points compose. On the other hand another representation may be made with the centre point of the museum with (51.529038,-0.128403). Formally speaking, the museum is located on “British Museum, Great Russell Street, London, WC1B 3DG, UK“. And this translates to the coordinate I gave. What about using them together to form:

<div class="adr">

<div class="street-address">Great Russell Street</div>
<span class="locality">London</span>,
<span class="postal-code">WC1B 3DG</span>,
<div class="country-name">UK</div>

<div class="geo"> <!-- optional coordinates attribute from geo -->
<span class="latitude">51.529038</span>,
<span class="longitude">-0.128403</span>
</div>

</div>

In the example above, I’ve used geo to include the single point <lat, long> optionally to map the address to a physical location. More useful structures can be defined within standards to enable multiple point entries to provide polygons such as 12-point representation of British Museum in the image above. Or basically, multiple geo entries inside an adr may work.

TODO: Write about the impact this usage can bring.

Testing for Accuracy and Precision

Software testing has no boundaries at all. This discipline is so unique that it’s not very common to see systematic approaches due to the variety of material and the changing tradeoffs. A few weeks ago, I came across to a decent software testing article from a Microsoft engineer which was published on Live Spaces. Unfortunately, it was followed by 2 spam comments — was very ironic to see such an assertive article was ruined by two regular Russian spammers.

I love machine learning and classification. My whole life is being spent between two parameters: accuracy and precision. These are the common statistical values to determine how successful your system is. If you have a search engine, accuracy may tell you what percentage of retrieved documents are really relevant. And percentage is a value to determine how likely your results cover all the relevant documents available.

Surprisingly a few days ago, I was asked to break a machine learning system during a job interview. I was asked to come up with some possible cases. According to my own philosophy, accuracy and precision are parts of the system requirements. They are related with the quality of the overall product. But how are you going to collect information to come up with these numbers? Imagine you are working on a searching engine. Is it manageable to find n people and ask them manually if they like the results or not? Will your sample (n people) reflect your user base? How costly it will be and how objective? Is it really scalable? Is it possible to for a human to read all of the documents on the Web and decide which are really related to his search phrase? These are a few introductory level problems with analysis of accuracy and precision.

Post-Processing and the Importance of Feedback

It may not be critical for you to release a product with a target accuracy and precision. Mostly, consumer market suits this model the best. But this alone should not be translated into the “inessentiality of the quality tracking”. I am just advising you to track the quality after the release (similar to ship-then-test method). Detect which results are exit-links, provide instant feedback tools for users to relocate their results and etc. Use acquired feedback to improve the existing system. Testing may not be done with the release, you may need to discuss and analyze if your product is performing well and report to your development team and influence them with scalable user-oriented improvements.

Addresses Not Found in High Traffic

My sister found herself a new downloading hobby and I was not planning to be the hobby killer until everything became inaccessible for both of us. She’s heavily downloading recently, I’m not sure about the material but it’s  high load. Pages were coming slower on my side as it was expected and I’m not saying I have a wide bandwidth but overall bottleneck was not just the slower uploads or downloads.

UDP 53, what’s wrong there?

I started to recognize a pattern. My downloads were even more slower because resolving was failing miserably every time I try. I was not even able to resolve domain names to IP addresses. Had to check myself what might cause this problem. As a quick note, if your local DNS cache (managed by operating systems) doesn’t have a record of the domain name you’re trying to visit, you make a request to one of the nearby DNS servers to return the associated IP. If your nearby server doesn’t have that record, it asks to root servers etc. Most of my reader audience knows the story well. This communication is made on UDP port 53. UDP is a connectionless way to transmit data. Unlike TCP, you don’t have to spend time on three-way-handshakes to make a proper connection that both of the sides are aware of. But if your packets get lost, nobody is responsible. It’s like playing a game, many tradeoffs similar to every engineering issue.

I gently asked my sister to stop a while, and started receive not timed-out UDP answers back. Resolving problem was fixed. But I had to be convinced that UDP is the best ever been chosen from. I understood the fact the essential parameter was latency. We have to be fast, faster and fastest as possible. Wanted to take time back to understand why it is designed this way and my problem appeared with a solution in milliseconds.

Why DNS is using UDP?

Reliability versus fastness. Remind the rule. If you don’t have the address, ask a nearby name server. Is it implicitly saying “Don’t go too far.”? Probably it is. You’re not on a very reliable connection and if your traffic load is very high, there will be many conjunctions, long delays and large jitters. My dns requests most probably couldn’t even making it to the name server. And since my ISP’s name servers are not reliable, I was using OpenDNS. Translation: I was far far far away from the source.

I fixed the issue. Even crazy downloading is again on, my domains are resolving rapidly at the moment. I’m extremely happy. If you’re using OpenDNS at office or LANs which have more than 20+ clients, make yourself a favor and set up a local name server today.

Data Manipulation on Client Machines

It’s my third week and I’m discussing the scalability of the real-time web. We’re only talking about text input, realtime search, trend extractions and etc. I had a love growing inside for this instant replier, it makes me feel I’m more connected to real people (sort of egoism). It’s good because we have text, none of the realtime providers are working on more than indexing the “text” for searching, as it revealed. Text is medium of communication and there are a lot more: images, audio, videos etc. You are not really interested in multimedia as I guess because, text is cool. You can skim it, you can select it, easily process it. But the world is not man-made and I cant even imagine a moment where maps are served as text, for example. Typing is human’s built-in analog to digital converter. Love it or not, but we are forced ungracefully by this nature to talk in multimedia when text alone is not efficient enough.

Realtime Data Processing

Realtime environment have to play with data to make connections, be able to provide smart searching that does not only depend on full-text comparison. Imagine that they have to tokenize text input to post-process what’s going on with the message. Almost requires O(n) on CPU. Twitter has about 1 million users, let’s assume averagely every user enters a new twit once in 3 days and average entry is 100 characters long (doesn’t sound realistic, but let’s be optimistic). It converges to 350k posts a day. 350k times tokenizing the 100 character-long input = 35 million characters are processed during a day.  I tried to tokenize a 100 char string 350k times and it took only 41 seconds since I was using the same string over and over again. With helps of caching, CPU minimizes the memory I/O and it made a huge difference, so my 41 seconds were far from being accurate. But besides tokenizing, there are other operations you have to run. And once you fetch it from memory, you’re done. Therefore, I believe it’s not really an extra load to tokenize the input on the server-side.

But, what would you do if you have to post-process terabytes of imagery? I’m not sure if you are aware of Microsoft’s Virtual Earth 3D but, it is more like Google Earth running on your browser.

 

A very long time ago, I was making a demo to my friends and showing the Mt. Rainer in WA distinct in 3D mode. Virtual Earth 3D fetches higher quality imagery for forefront. Since there is no colour balance adjustments at VE imagery, many people thought  the level differences on the scene is sort of a corruption and not good looking. I decided to talk to engineers that we can solve this problem with relative histogram equalization (fancy name but easy method). I didn’t sound perfectly realistic, cause our imagery were tens of terabytes and it was very risky to process them all for such a tiny improvement. Continue reading

Why Do I See Software Development Job Titles as Nothing but Virtual Positions?

There are so various areas and people involved with software that “software development” phrase alone doesn’t give you enough perception about what the real job is. I proclaim that there are stereotypes in this field: code monkeys, every job developers, developers specialized in an area, revolutionary brains, problem solvers etc. They all share some characteristics, some behaviours, some little common knowledge. The mutual attribute is only the names. We all know that they can code. But they implement coding behaviour entirely in different ways.

I started to code with a fundamental function to find roots of a 2nd order equation and easily understood how programming can reduce the amount of time I’m spending on trivial calculations. The Internet was becoming widely available (at least in the community I was living). There was a technology boom, the access rate to the “new” were increasing wildly. I decided to learn as far as I can and in the end I became a giant language and framework addict, although I had no deep knowledge in anything. But it was driving me because, the particular technology I was interested in was helping me to turn my fancy ideas into concrete products. So, not for doing all for learning the technology’s itself doesn’t make me a software developer now? How dramatic.

Being a software developer with no purpose

It’d really taken me far too long to get aware of the ecosystem of a software-dominated system without formal education. However, before  I had reasons to learn, I had never dreamt of being a professional software developer – cause all alone it was meaningless, especially in a situation when software you’re bringing out doesn’t have any purpose other than making profit. Nobody does have dreams like being the greatest developer in this or that field. People like me usually tend to be remembered with

  1. a groundbreaking product, a method changes the concepts, a more efficient way of doing things.
  2. being a leader in a specific area to influence other people.

They put meaning into every single duty they are working. I personally don’t look programming jobs as jobs. It’s a great opportunity to access a company’s tools and existing audience to make differences quicker than forming a new start-up, gain respect and deal with financial stuff. But again with purpose, you always have the “If you’re not hiring me, I’ll be doing the same on my own” advantage under the belt obviously. Continue reading

Using Paper and Pencil Before Coding

I don’t remind myself starting to code, even a single tiny function, without illustrating it on paper. Do I write code on paper? Yes, I also do that. I adore coding on whiteboards where many people can come together on a coffee break and discuss. Most of the young developers ignore that single rule in software development. To write decent code, you have to spend more time on thinking, criticising, brainstorming – determining what the problem is before getting into the solution. Defining the constraints, getting aware of the exceptions, and so on.

Early Starting or Deep Thinking?

I wish the first idea comes to mind was the perfect one. But usually, it is not even close enough to be a good solution – it is often greedy and very straight-forward. The more you learn about the problem and its characteristics, the better and more efficient solution will pop up. If you start coding in the first hour, you will probably never ever finish the task. I’d like to share a case study from Jon Bentley’s amazing book, Programming Pearls.

The story is about Andrew Appel and his experience on reducing the run time of a system from one year to one day. They had to simulate the motions of N objects in 3D space, given their masses and initial positions and velocities. They basically were interested to predict the positions of planets, stars, galaxies 10 years, 100 years, 1000 years later. In a 2D space the input may look like the image on the left: a large set of vectors.

The regular simulation programs divide time into small steps and computes the progress of each object at each step. Due to the existing of gravitational field and mass attractions, the computational cost were proportional to N2. Appel estimated that 1000 time steps of such a system with N=10000 objects would require more than one year on a VAX-11/780 or a day on a Cray-1. Andrew had to develop a system far more efficient than a one-year-running turtle. After several speedup improvements, he reduced the runtime to less than a day on VAX-11/780 — almost 400 times faster than the initial setup. Continue reading

JPEG to Compress Vector Graphics?

Compressing, entropy and information theory have things in common with economics. Nobody’s going to turn his head unless you stop talking maths. I truly understand they are a bit boring and complex for your grandmother, but surprisingly most of the technical people doesn’t bother to understand the underlying concept either. In this post, I’m going to pass over the following topics in a daily tongue to illustrate the overall scheme on your mind:

  1. What is image compression and why do we use it?
  2. A short brief of JPEG compression.
  3. A review of Google Maps and Live Search Maps for serving images as the primary content.

Introduction to Image Compression

If you understand the term "compression", it doesn’t mean we are over with the definition. Hold on.

Image compression, the art and science of reducing the amount of data required to represent an image, is one of the most useful and commercially successful technologies in the field of digital image processing. (Digital Image Processing 3rd Ed., Gonzalez & Woods, page 525)

Let’s first try to understand how compression became one of the most commercially successful field in image processing? With the irrepressible popularity of television and Internet (after mid-90s), images and videos become significant elements to represent information. With no compression;  a coloured standard TV broadcast, 640×480 wide with refresh rate of 30 frames per seconds, requires 27,648,000 bytes to be transmitted per second. Even in tomorrow’s technology, supplying a connection of almost 30Mbytes/second just for a TV cast doesn’t seem to be possible with no doubt. It’s no surprise to hear many failed quotes back from early 1900s that television will never be able to find opportunity to be on the market.

Data versus Information

When the issue is compression, it refers to the compression of data. Data is being transferred to carry information. Therefore, we might be able to reduce the amount of data to represent a given quantity of information. A parrot in a very populated barber shop in downtown loves to say "Hello" to every new customer that comes in. How would you transmit the words it spells in text most efficiently?

image

Statistically hello is the most common word. Representing it in a bit is highly acceptable, instead of transferring 5 characters (5*8 = 40 bits). In the example above, it’s very clear that statistics say "stranger" is the second word we most likely to hear from parrot and so on. Converting (mapping) the string array into a bit stream saves 94% of bandwidth in this case. Huffman coding, which is going to remind you the method above, guarantees you to use minimum possible number of bits if you have statistical information of data.

Image Compression Techniques

Generally, image compression techniques are separated into two columns:  lossy and lossless compression. Lossy methods takes the advantage of capabilities of human vision range, eliminates details and loose information to reduce amount of data. Lossy methods are mostly used for natural images. In lossless compression, encoding process finds a smart way to represent same amount of information in lesser amount of data, just like in the example above with Mr. Parrot. And some may use hybrid models to mix advantages of both sides. Continue reading

Would you Like to Throw Exceptions from a Constructor?

Throwing an exception from a constructor has always been an arguable issue between my friends after a few beers of joy. It firstly came to me on an exam where question was asking me to design a class which lets not more than n number of instances to be constructed.

Not using the most straight-forward trick to have a private constructor, I decided to invoke an exception from constructor if there exists more than n instances of that particular class. Luckily, I was coding in Java and had the comfort of a managed environment under the belt. My answer wasn’t graded thankfully but this case became a major concern and I decided to make several experiments back in the old days.

According to what I have seen while debugging, Java simply assigns a number of nulls or zeros to all of the fields of the class unless constructor is successfully executed. Additionally, right after an exception is thrown from a constructor, a null reference is returned and the newly created invalid instance starts to wait for the garbage collector to pick it up and remove the useless object from memory permanently. In short terms, a new instance is created although platform tends to murder it quickly in silent.

What about Unmanaged Platforms?

It was nice to see somebody was taking care of such exceptional situations but what would you do if you were alone? Fault tolerance isn’t just catching all exceptions: (from slides of Bjarne Stroustrup)

  1. The program must be in a valid state after an exception is caught
  2. Resource leaks must be avoided

This will become a problem where memory management is all up to you. You will have to remove and cleanup resources after an exception has been thrown, otherwise it will cause a leakage in memory and you will have created a zombie object hanging around memory — attached to the mother thread.

Cleaning Up the Mess of C++

An object is not even an object once its constructor finished successfully. So, better try to release resources inside the constructor before you throw an exception. In this case, I’m always using a double-exception throwing pattern that I made up and passionately using it in similar situations I face.

Let’s ask the complier! Two lines of code is always more representative than 10 lines of words. Here below, I’ve written a class called “Foo”, not functional but dangerous and an exception “StreamInitException” was decided to be thrown if there happens a problem while longStreamChar is being initialized. Right after we allocate 10k bytes for longStreamChar, BOOM! Someone pushed the wrong button, somebody shut down the database or trouble makers’ spirits are near. Don’t try to catch it inside the constructor and see what happens. Many orphan kids, once we named them as longStreamChar, are all around and we can’t even access them.

#include <iostream>

struct StreamInitException {
    StreamInitException(char * m){ message = m; }
    char * message;
};

class Foo {
    public:
        Foo(){
            try {
                this->initCharStream();
            } catch(StreamInitException * ex) {
                this->~Foo(); //release resources
                throw ex; // exception originally thrown by initCharStream
            }
        }

        ~Foo(){
            if(this->longCharStream)
            delete this->longCharStream;
        }

        void initCharStream(){
            //create a long char array of 10k
            this->longCharStream = (char*)malloc(10000);
            throw new StreamInitException("Error occurred during init.");
        }
    private:
        char * longCharStream;
    };

int main(){

    Foo * foo = NULL;

    try{
        foo = new Foo;
    } catch(StreamInitException * ex){
        std::cout << ex->message << std::endl;
    }

    return 0;

}

Therefore, I’m handling init related bad moments of the object inside the constructor, deallocate memory etc. and re-throw the same exception finally after I get the hold of the control. Seems acceptable!

Consequently, I have a final question: do you prefer to throw an exception from constructors or search for other workarounds such as assigning the workload to a separate method called Init()?

Domain Name Server Fight is Back!

This is the second time I’m facing a horrible situation in 12 years of my active Internet adventure. One or a few DNSes ignore my change requests and keep pointing to the old IP although two weeks has passed after my edits. Those rebel servers cause the half of the routers to use the old IP address and keep my old end-point  alive recursively. I’m not sure if it’s a TTL issue with an A RECORD or an hacking attempt via DNS injection. Don’t know the answer yet, because I’m not able to access every name server (naturally) although I can traverse my path and know the evil ones.

Does anybody have a clue what is happening behind? Because straight forward actions like switching to my own name servers may not help me with the issue at this moment. The new changes might come to a  deadlock on same cycle. Is there anybody out there who knows what’s happening?