One of the greatest mathematicians and computer scientist we were contemporary with was Edsger W. Dijkstra. Among many of his contributions to these fields he let us a wonderful simple algorithm which helps us finding the shortest path between two points on a map (see example).
The algorithm itself is clever and simple and that's why probably the most of us, that work with such toys, we just love it. One sad thing I noticed is that the most websites where you can read about this topic do not provide an easy explanation of the algorithm in a way that even a child of 5 years old would understand and that's a shame. As I've said before, the algorithm is simple and clever and we should not hide its beauty under sophisticated formal mathematics!
A less formal formulation of Dijkstra shortest path algorithm
Having N distinct points (cities) on a graph (map) what is the shortest distance between two of them? Of course, if we are talking about the same city the answer is instantaneous, it's zero!
Obviously any other point could be located "at the next door" (in which case the distance would be minimal) or, if we link these points in a straight line then the farthest point could located at the distance equal with the sum of those small paths between every two neighbour points (ie the sum of the distances of the other N-1 points).
To translate a bit more the concept, the map of the points is called graph, the points are called nodes or vertices (singular vertex) and the arcs that link them are called edges. The edges may have orientation/direction (ie. they have an arrow that show their direction) in which case the graph is called ordered/directed graph or may not (undirected/unordered graph). In this post I will refer only the case of an ordered/directed graph (sometimes called digraph). Such a graph may look like the one below:
A,B,...H are the nodes (vertices), A is the initial/starting node/vertex, any connecting lines between these are the edges. For example the edge between the node A and B has the length 4. We could note that like L(A,B)=4 although I'm not sure if I'm using exactly the formal mathematical notation.
To go from node A to G we could, for instance, use the path A->D->G or A->G or A->C->F->G or even A->C->E->F->G. The path lengths may be different and thus we might want to follow one path or other. In real life we don't deal only with 8 different nodes. For instance there are roughly 2.5 millions cities around the world. There could be tremendous ways of going from a point A to B.
Things to read/watch:
- Shortest Paths: Properties, Dijkstra's Algorithm, Breadth-first Search (MIT lecture)
- Undirected graphs (Princeton paper)
- The CS Department of Rochester Institute of Technology (Lecture) <- recommended!
The steps of the algorithm
To find the shortest path between a starting/initial node A and a randomly chosen node X assumes that you have the information about all possible path such that you can compare them. Therefore would be reasonably to start building all possible routes from A to any other node on graph and finally to find that path which has the minimum length from A to X.
I wish I would be that smart to be able to keep all the numbers, traversed nodes and so on in memory but I'm not. Thus any other people like me would create a small table where would store all the steps done and any other information. We start by creating a table 3 rows x N+1 columns, where N is the number of nodes. In our case the table would look like the one below.
Note that our starting/initial node is A. Any node that is our current working node I will mark it with Gray background. For each node k in row 3 we determine the edge length from the starting node A to that node k. If A and k are directly linked then we know that exists a edge of length L(A,k) that takes us from A to k. We write that value L in the line Cost corresponding to that node k. If there is no such a node we cannot assume neither that exists or not exists, thus we assume that the cost to travel there would by extremely large. Strictly mathematically we could assume it as being infinity (in practice we could take sum of all edges at the same power or the largest number that our computer knows). To know what does that cost represent we note/write in the second row the name of the node/vertex that the cost is relative to. The initial values are relative to the initial node, ie. A, and therefore we must set the previous node for all nodes as being relative to A (with the exception of A as it cannot be its own parent). Whenever we start analyzing the paths from one node to the next we mark somehow (in our table) that node as begin processed (the Gray color) and the others that were already visited/processed/analyzed with yellow.
|Cost from A to nodes||0||4||1||9||inf||inf||7||inf|
Now, on the next step we identify the next unvisited node that starts from and is linked directly with A and which has (according to the table!!!) the minimal cost. That would be the node C. We should also mark the node C as being the current working node (with Gray color), otherwise we may forget where we are and/or what we are doing.
|Cost from A to nodes||0||4||1||9||2||4||7||inf|
From that node we look after the next unvisited nodes which starts from and are directly linked with C. We always start with that one which has the smallest edge value. That would be node E. If we take a look at the first version of the above table we'll see that when Node=E the Cost from A to E was infinity. The length of the current path from A to E which goes through C is L(A,C)+L(C,E)=1+1=2 which is smaller than the old value for L(A,E)=â and thus, we update the current version of our table with the following information: when we go through C the path from A to E is 2. So, we write 2 in the Cost row corresponding to the E node and moreover we note in the same column of row 1 (Prev node , ie. previous node) that the value noted earlier is valid when the previous node of E toward A is C. Just to be easier to see what we do at each step I will colour each update in red color.
I also use this chance (since I am already on C) to check if the route from A to F would be shorter via C than is directly from A->F. By doing the math we see that L(A,C)+L(C,F)=1+3=4 is shorter than the old value L(A,F)=â and thus, we update the current version of our table with the following information: when we go toward C the path from A to F the route costs 4. So, we write 4 in the Cost row corresponding to the F node and moreover we note in the same column of row 1 that the value noted earlier is valid when the previous node of F toward A is C. We colour this update in red so we know that it was changed here and not in the former version of table.
Since we are already on C node we could continue our endeavour by checking if going from A via C to B or G or H or D the length of the new path (to these later nodes) would become shorter than it currently is. It won't change too much because it would require enormous effort to go directly from C to B since there is no connection between these two (I usually imagine an edge equal with learning differential equation after you have already gone through algebra and calculus and not having a bridge equal with jumping from the first grade of primary school to the last year of Computer science at MIT (I was never there but my imagination is limitless); such a jump would require a tremendous effort like infinity). So, as I've said there is no way to go from A to either B,D,G or H via C, ie. the cost would be infinity. So we've got infinity but infinity was also before (check the older version of table) and due the fact that infinity is not smaller than infinity we cannot assert that we found a shorter path, thus we won't change/update anything for these nodes.
So far we are done with the nodes A and C. We move forward by looking for the next unvisited node that has (according to the table!!!) the smallest cost. The node E has the smallest cost thus E become our current working node. We mark it with Gray and any previously worked nodes in yellow.
|Cost from A to nodes||0||4||1||9||2||4||7||inf|
Likewise always we address this question: is there any path from A to any other unvisited node in graph that goes through E which is shorter that what we already know? Let's analyze for instance node B. Check in the table for the cost of going from A to column B: it's 4. What would be the cost if we follow the route A->C->our currently working node E->B? It would be L(A,E)+L(E,B)=2+7=9 is is much larger than the old value which is 4 (by going directly from A->B). No improve, no update. Ok, but what about the other nodes, eg. F? If we take a look in the table the current known cost for going from A to F (which was via C) is 4. Right now we are in E. The route through this node would cost L(A,E)+L(E,F)=2+2=4 which is the same like the old value. No matter if we take one path or the other the cost would be the same so it's up to us which path we take. We could update this new path or, since it's mathematically the same, we could let it unchanged. I have no reason of changing so I move forward. Since I am already in E I would like to know if there is a shorter path that starts from A and goes through E to any other unvisited node. Since E is not directly linked to any other node it's logically to assume that the cost would be tremendous, ie. infinity. No matter what value was before in the table, the infinity is anyway larger (or at least the same) than that so we have no reason to compare and eventually update/change the table. We move forward, like always.
What is the next unvisited node (according to the table!!!) that has the minimal cost? The table shows us only two possible candidates, F and B but F has a shorter edge, 2=L(E,F)<L(E,B)=7 so we choose the next working node as being F. We could choose B but since I can choose whatever I want (due to the fact they have the same value) I will try F In that case the current working version of our table will look like the one below.
|Cost from A to nodes
The same question as always: is there any path that starts from A and goes through F to any other unvisited node? We look in the graph, there is only one edge which start from F to outer nodes. The rest of them are oriented backwards, from the other nodes inward F. We are a bit advanced now, we already know that there is no point of studying the path from X to Y when there is no bridge between them 'cause if there is a path then traversing that path would cost us an infinity. So we'll spend our time only checking the path from A via F to G. The cost would be L(A,F)+L(F,G)=4+2=6. The old value (according to the older table) for traversing from A to G was 7, also greater the the last analyzed path. 6<7 thus 1 unit improvement, I'll buy it! We jump to the table, update this value in red like this: the cost from A to node G via F is 6. We already said the for the other nodes the cost would be infinity which is the largest possible ever, thus no update/change. We are done also with this node. Let's move forward. Which is the next unvisited node which has (according to the table!!!) the smallest cost? That's B (I think I said that before but I randomly chose F since B and F has the same cost; now it's B time). We mark B as our current working node:
|Cost from A to nodes
Is there any path from A to any other unvisited nodes which by going through B would be shorter than what we already know (according to the table) ? I cannot see any connection that exists from B toward any other nodes so obviously to go from B to any other node would cost infinity. We were lucky, there is really nothing more we can check for this node. We just move forward to the next unvisited node that (according to the table) has the minimal cost. There are only three left: D=9, G=6 and H=infinity. Obviously the answer is G. So we mark G as the next working node.
|Cost from A to nodes
How lucky could someone be? Two in a row! There is no way to go from G to anywhere since all the directions go toward G and there is no connection that starts from G to any other node. If there is a path (we cannot see it but if theoretically if there is such a path) it would cost tremendous ie. infinitely much. Nothing to change, just move forward you lucky bastard! OK, enough with the joy. Back to work! What's the next unvisited node that (according to the table!!!) has the minimal cost? Two possible choices (D or H) but only one, really: D.
|Cost from A to nodes
Now, is there any path by starting from A via D to any other unvisited node which would cost less? Well, we could evaluate the A->D->G route I believe. It would cost L(A,D)+L(D,G)=9+3=12 much larger than what we already have, ie. L(A,G)=6 through F (check the table!). Well, that being said there is really nothing more to check about this D. Move forward!
What's the next unvisited node? Well, there is only one, H which cost infinity.
|Cost from A to nodes
There is no connection/arrow that goes into H, only backwards. So there is nothing really we can do just to move forward, ie. mark that H node as visited (yellow). So the final version of the table looks pretty much the same like above.
|Cost from A to nodes
What we achieved? What we know? How can we use this info in a usable way?
Well, we know the cost from A to any node X by looking just to the table. We know even which is the parent of X and that is Prev node. So if I ask how much would it cost to go from A to F the answer is 4. What is the route such that the cost would be only 4? Well, the destination is F. The parent (prev node) of F is C. The parent of C is A. So the route would be A->C->F and this would cost only 4. We could use this table to determine the shortest path between A and G and I can assure you, it's not - despite what the intuition will tell you - the A->G arc. Let's check: the parent of G is F; the parent of F is C; the parent of C is A so the route is A->C->F->G and would cost (according to the table) only 6. The direct arc L(A,G) cost (according to the graph) 7. So the table doesn't lie. The table is clever. The table is simple.
The Dijkstra algorithm is clever and simple!
The algorithm has many uses and it's not only about the distance between cities, points, whatever. It has many applications. For instance I use it to export the data from a 1000 tables MS-SQL database. I use it to check which are my most included/required files of a 300 files PHP source files (and this is not always that simple as it seems; eg. A includes B which include C which includes D, thus A includes not only B but C,D). That has nothing to do with the shortest path, isn't it? Well, it has!
In the case of MS-SQL export I use it to calculate which are those tables that don't depend on any other table. I export theirs records first! The I check which are the next most independent tables and export theirs records. Then I continue the same process until I get to the other extreme, ie. "which are the most dependent tables?" and those are exported last. Why? Because I want the foreign keys/records to be there (at destination) before I refer them, so the data is exported almost in exactly the same order that the users have recorded them historically in the source database.
Both graphs above assumes the equal weight/cost of edges thus we can regard those graphs as unweighted graphs.
Likewise I use this algorithm for finding the most dependent/independent PHP files of my project. There are many uses of such information, probably in other article.
A PHP implementation of Dijkstra shortest path algorithm (that includes also the building of graph by adding nodes and edges) can be found at my public repository hosted by BitBucket.
The repository address is: https://bitbucket.org/eugenmihailescu/dijkstra-shortest-path.
The source contains the Dijkstra.php and the GenericGraph.php classes as well as three example files:
- an example for an directed/ordered graph that solves the problem shown above
- an example for an ordered graph that finds the most/least included files of a PHP project (see also this)
- two examples for finding the shortest route between two cities when the distances between any two adjacent cities is given (one with real cities, one with randomly generated cities)
Notes on journey
- the BFS algorithm may be used to find the shortest path in terms of the least number of cities visited which doesn't automatically imply the shortest distance
- probably the most easiest but also most costly alternative would be to search all possible paths - the exhaustive search (there are N2 different routes).
- another easy alternative is by using a greedy approach although it doesn't guarantee the optimal path
- the algorithm works only for positive weights/costs, weighted/unweighted and directed/undirected graphs
Now, if you think that this article was interesting don't forget to rate it. It shows me that you care and thus I will continue write about these things.