[Data Mining] HITS Algorithm Example in Tcl

Hyperlink-Induced Topic Search (HITS) is a system used for evaluating the usefulness of a web pages on the Internet. HITS establishes a website as an Authority or a Hub. Authorities are those sites that are linked to by Hubs. And you probably guessed it by now. Hubs are those sites that point to Authorities. A site can “easily” be identified as a Hub or an Authority by looked at its calculated weights. The higher weight will determine whether it is identified as an Hub or an Authority. For example, let’s say we have a site A. If after our calculation we end up with an authority weight for A of 24 and the hub weight for A Hub of 16, then we can say site-A is an authority because there are a lot of sites that point to A.

Let’s take another example. Given the following table of linked web sites, which forms a neighborhood graph:

+---------------------------+
| Site | Linked to the Site |
+------+--------------------+
| A    | C, B               |
| B    | A                  |
| C    | A, B               |
+------+--------------------+

We can find the authority and hub in the neighborhood graph using some, N, number of iterations of the HIS algorithm. It is said eight to nine iterations is usually enough, but that depends on the amount of data to be processed. Anyway, when we look at the above neighborhood graph we say that site A has in its source, ‘href’ tags that link to sites C and B or B and C. The order is unimportant. And sites C and B are referenced by site A. Hence, C and B are being pointed at by site A. The HITS algorithm in essence states, according to Wikipedia:

  1. Start with each node having a hub score and authority score of 1.
  2. Run the Authority Update Rule
  3. Run the Hub Update Rule
  4. Normalize the values by dividing each Hub score by the sum of the squares of all Hub scores, and dividing each Authority score by the sum of the squares of all Authority scores.
  5. Repeat from the second step as necessary.

In my code write up, the normalization rule is omitted. Here is a sample run of two iterations from the neighborhood graph above:

 Iteration 1 ==========================================
===>>> Calculating Auth(A)
>> 2 Linking us: Hub(B:1) Hub(C:1)
===>>> Auth(A) == 2

===>>> Calculating Hub(A)
>> Linking(2) to: Auth(B:1) Auth(C:1)
===>>> Hub(A) == 2

===>>> Calculating Auth(B)
>> 2 Linking us: Hub(A:1) Hub(C:1)
===>>> Auth(B) == 2

===>>> Calculating Hub(B)
>> Linking(1) to: Auth(A:1)
===>>> Hub(B) == 1

===>>> Calculating Auth(C)
>> 1 Linking us: Hub(A:1)
===>>> Auth(C) == 1

===>>> Calculating Hub(C)
>> Linking(2) to: Auth(A:1) Auth(B:1)
===>>> Hub(C) == 2

 Iteration 2 ==========================================
===>>> Calculating Auth(A)
>> 2 Linking us: Hub(B:1) Hub(C:2)
===>>> Auth(A) == 3

===>>> Calculating Hub(A)
>> Linking(2) to: Auth(B:2) Auth(C:1)
===>>> Hub(A) == 3

===>>> Calculating Auth(B)
>> 2 Linking us: Hub(A:2) Hub(C:2)
===>>> Auth(B) == 4

===>>> Calculating Hub(B)
>> Linking(1) to: Auth(A:2)
===>>> Hub(B) == 2

===>>> Calculating Auth(C)
>> 1 Linking us: Hub(A:2)
===>>> Auth(C) == 2

===>>> Calculating Hub(C)
>> Linking(2) to: Auth(A:2) Auth(B:2)
===>>> Hub(C) == 4

This source file and “executable” can be used to calculate the hub and authority weights from the initial graph trees above without normalization. Maybe I might add the normalization part later.

About these ads
This entry was posted in Grad School. Bookmark the permalink.

One Response to [Data Mining] HITS Algorithm Example in Tcl

  1. Akshay Raikwar says:

    What if i want output such as how many links are from site A to Site B

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s