Search Tool Data Analysis

Caitlin De Gregorio (Caitd in BIT330, Fall 2008)

Questions and queries

Web search engines

I want to learn more information about RRS feeds. Where can I find general background information on RRS files. What is a headline syndication, and what benefits would I gain from syndicating my headlines with RSS?

Windows Live Query:
“RSS feeds”
Google:
“RSS feeds”
Yahoo Web:
“RSS feeds”

Blog search engines

I would like to learn more information on cloud computing. What is cloud computing? What companies assisted with the creation of cloud computing? Who can use cloud computing today?

Technorati:
"Cloud computing"
Google Blog Search:
"Cloud computing"
Bloglines:
"Cloud computing"

Data that I collected

Search engine overlap data

Web search Live Google Yahoo Web
Live 20 35 40
Google 30 25
Yahoo Web 25
All 25
Blog search Technorati Google Blog Bloglines
Technorati 30 15 15
Google Blog 25 10
Bloglines 35
All 0

Search engine ranking overlap data

This table provides a measure of how much of Google's responses are reproduced by Yahoo.

GY Yahoo
Google 5 10 20
5 3 3 4
10 3 3 4
20 4 4 5

This table provides a measure of how much of Yahoo's responses are reproduced by Google.

YG Google
Yahoo 5 10 20
5 3 3 4
10 3 3 4
20 4 4 5

This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.

BG Google
Bloglines 5 10 20
5 0 0 0
10 0 0 0
20 1 2 2

This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.

GB Bloglines
GBlog 5 10 20
5 0 0 1
10 0 0 2
20 0 0 2

Results

Web search

Web search
Precision Overlap All
Live Google Yahoo L/G L/Y G/Y L/G/Y
Mean 42.73684211 54.57894737 51.68421053 18.47368421 20.36842105 20.78947368 10.21052632
Median 42 57 52 20 20 20 10
Standard Deviation 22.12550591 19.50873249 21.79462888 9.299751616 11.17144438 7.692557351 7.322559869
Mode 15 70 70 10 10 25 10

Mean for the precision of the Live data was 42.73684%. This means that on average the Live data was 42.7368% precise. Precision is B / (B+C) where B is the number of websites that are retrieved and relevant and c is the number of websites that are retrieved by the search engine but not relevant. One wants this ratio to be as high as possible because one wants the highest percentage of relevant websites retrieved. The mode for the Live precision data set was 15. The mode is the value that occurs the most frequently in a data set or a probability distribution. This means that of the 20 data entries for Live precision 15% meaning that the most percentage of relevant sites retrieved was 15%. Higher values can offset the mean and make it higher than the mode. The median for Live precision data was 42%. This is the data value that separates the upper half of the data from the lower half of the data set. The standard deviation of the set for Live data set was 22.12551%. This represents the square root of the variation. This means that the data was on average plus or minus 22.12551% deviation from the mean which was 42.7368%. The mean for the Google precision data was 54.57895%. This means that on average the data was 54.57% precise. The mode for the precision data was 70% which means that the most common value was 70% precise for the Google precision data. The standard deviation for the Google precision data was 19.50873%. This means that the data was on average plus or minus 19.50873% deviation from the mean which was 54.57%. The median for the data set for the Google precision data was 57% that means that 57% was the data set value that cut the upper half of the data and the lower half of the data. The mean for the Yahoo precision was 51.68421%. This means that on average the Yahoo search engine was 51.68% precise or it returned relevant websites 51.68% of the time. The mode was 70% which means that the most common value was 70% precise. The median was 52%. The median is the value that divides the upper set of data from the lower set of data. The standard deviation was 21.79%. This means that 68.27% percentage of the data falls between plus or minus 21.79% of the mean. The mean for the Live/Google data was 18.47368 which means on average there was 18.47368% overlap of sites retrieved. The most common value was 10 meaning that the most common value set 10% overlap. The standard deviation was 9.299752% that means that approximately 68% of the data falls between plus or minus 9.299% of the mean. The median which is the data that cuts the data set in half is 20%. The mean for the Live over Yahoo data set was 20.36842%. The mode was 10%, the median was 20% and the standard deviation was 11.17144%. The mean for the Google and Yahoo overlap was 20.78947%. The mode was 25% and the median was 20% and the standard deviation was 7.692557%. The Mean for the Yahoo, Live and Google results was 10.21% meaning that they had on average 10.21% shared results. The mode was 10%, the standard deviation was 7.322% and the median was 10%.

GY
o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Mean 1.058823529 1.352941176 1.647058824 1.294117647 2 2.647058824 1.647058824 2.470588235 3.705882353
Mode 1 0 0 1 1 4 1 3 5
Standard Deviation 1.197423705 1.32009358 1.411611511 1.212678125 1.322875656 1.729926894 1.221739358 1.545867356 2.114376559
Median 1 1 2 1 2 3 1 3 4
YG
o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Mean 1.058823529 1.176470588 1.647058824 1.470588235 1.941176471 2.470588235 1.882352941 2.647058824 3.764705882
Mode 1 0 1 1 3 3 1 4 5
Standard Deviation 1.197423705 1.286239389 1.366618842 1.23073388 1.390619836 1.58578242 1.268973647 1.729926894 2.077540967
Median 1 1 1 1 2 3 2 3 4

These tables are showing the amount of reproduction that each search engine has within each other. These numbers are not percentages but merely a count of the number of times for example a website in the top five of Google's results were also in the top five of Yahoo's results or how many of the top five of Google's results were in Yahoo's top ten on its results page. The class results show the average amount of reproduction of Google’s responses in Yahoo’s top five were higher than the average amount of reproduction of Yahoo’s responses in Google’s top five.

Blog search

Blog search
Precision Overlap All
Technorati GBlog Bloglines T/G T/B G/B T/G/B
Mean 33.42105263 52.63157895 44.63157895 3.894736842 9.526315789 7.210526316 1.578947368
Median 30 45 48 0 10 5 0
Standard Deviation 20.61907367 21.56182461 13.95711896 6.943380035 7.662088641 6.373372972 3.355191491
Mode 35 40 50 0 5 5 0

The Technorati precision mean was 33.42% meaning that average precision for the class for Technorati is 33.42%. The mode was 35%, the standard deviation is 20.619% and the median was 30%. The Google blog mean was 52.63%, the mode was 40, the standard deviation 21.56% and the median was 45%. The Bloglines data was on average 44.63%, the mode was 50, the standard deviation was 13.95 and the median was 48. The class Precision is B / (B+C) where B is the number of websites that are retrieved and relevant and c is the number of websites that are retrieved by the search engine but not relevant. One wants this ratio to be as high as possible because one wants the highest percentage of relevant websites retrieved. The mode is the value that occurs the most in a data set. The standard deviation is a measure of how spread out your data are. Meaning that it measures the variation between the data in the in the collection sample and the mean. The first standard deviation away from the mean on either side holds approximately 68% of the data. The median is the data value that cuts the upper half of the data set from the lower half of the data set. The class data shows that there is less overlap between the different blog search engines than the websearch engines.

GB
o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Mean 0.285714286 0.357142857 0.5 0.428571429 0.428571429 0.785714286 0.785714286 0.785714286 1.071428571
Mode 0 0 0 0 0 0 0 0 0
Standard Deviation 0.468807231 0.633323694 0.650443636 0.646206173 0.755928946 1.050902281 0.974961256 1.188313053 1.268814451
Median 0 0 0 0 0 0 0.5 0 1
BG
o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Mean 0.285714286 0.357142857 0.642857143 0.428571429 0.5 0.857142857 0.571428571 0.857142857 1.142857143
Mode 0 0 0 0 0 0 0 0 1
Standard Deviation 0.468807231 0.633323694 0.928782732 0.646206173 0.759554525 1.167320591 0.646206173 1.027105182 1.231455852
Median 0 0 0 0 0 0.5 0.5 0.5 1

These tables are showing the amount of reproduction that each search engine has within each other. Reproduction means the number of times a website from one search engine's results page is shown in another search engine's results page. This is significant because it is important to know how unique each search engine's results really are. The class results for the blog search engines was significantly lower than the results for the web search engines. The class average for reproduction was above one twice, both for the top 5, top 10, top 20 within all the search results. The mode was zero except again for the all results response column. The standard deviation was never above 1.5. This means that approximately 65% of the data lies between plus or minus 1.5 of the mean.

Discussion

Web search

  • The first set of data is measuring the precision and amount of overlap within similar searches. We can see from the class results that Google has the highest average precision among the different search engines. The most overlap is between Google and Yahoo on average. Precision measures the amount of retrieved searches that are relevant over the total amount of searches. One wants this number to be as large as possible. The second set of data was measuring the amount of reproduction. For example, the (5,5) column measures the amount of top five results from the Google search appears in the Yahoo search. The (10,5) column represents the amount of top five or top ten Google search results appear in the Yahoo search. The (20,5) column shows the number of top five, ten, and 20 Google search results that appear in the Yahoo search results. Reproduction and overlap are measuring the amount of same search results that similar search queries yield. Reproduction measures are in counts while the overlap data was a percentage of the total amount of search results. This is important for the user. Should one be using just one web search engine to do all of his or her searching? It depends on the user. It shows that the highest overlap amount was only approximately 20%. If the user wants the most complete search possible, perhaps he or she should try multiple web search engines.
  • I would recommend several things. The first thing I would recommend is to use concise search queries. A simple query is a basic form of a search query. An example of a simple query is a word or word group. The problem with using these types of queries is that it leads the user to a high number of irrelevant results which means the user’s precision would be low. The goal is to have a high precision rate. It is hard to pick the relevant results out of the possible thousands of irrelevant ones. I would recommend that a person use phrases within quotation marks. For example, instead of searching common cold, remedy, I would search for “Common cold remedy” all within the quotations to get more precise results. Unique phrases help make your search results more precise by focusing on a linked group of words instead of processing each word separately. The user can also use operators to make the search more precise. There are two types of operators the kind that lets the user search for different portions of a site and the kind of operators that let the user make the query into a formula. For example, one can search for a specific webpage title by using the search query intitle: or one can restrict the search to the URL by using inurl:. Other operators change the way a search engine processes the query. By using the word AND or an addition sign one can make the search engine search for Cold + Remedy.
  • I first learned what properly constituted similar search queries. When I first did the assignment, I used search queries that were not similar enough and I got bad results. I also learned about the different kinds of search queries. I learned that the best way to get the most relevant data was to use phrases and operators. I also found out about the different ways to search for a specific area of a website by using operators that let one search for different areas of a website. I also found that it would be helpful when doing my research not always to use just one search engine. The data that we collected as a class shows that Live, Google and Yahoo all have precise results. Also, it shows that the three search engines do not really have all the same results. I can find more relevant websites to find data if I use more than one search engine.
  • I think that instead of similar search queries we should use the same search queries. I would also like to research how often people use different search engines. Students can really benefit from knowing that by using different search engines and not just different search queries they can get better results. Often, I do not use different search engines just search queries. Are there search engines that aggregate search results from other search engines so that the user does not have to do this task herself? I also think we should be required to use operators in our search queries to make them more precise.

Blog search

  • The first set of data is measuring the precision and amount of overlap within similar searches. We can see from the class results that Google has the highest average precision among the different search engines. The most overlap is between Google blog and Bloglines on average. Precision measures the amount of retrieved searches that are relevant over the total amount of searches. One wants this number to be as large as possible. The second set of data was measuring the amount of reproduction. For example, the (5,5) column measures the amount of top five results from the Google Blog search appears in the Bloglines search. The (10,5) column represents the amount of top five or top ten Google Blog search results appear in the Bloglines search. Reproduction and overlap are measuring the amount of same search results that similar search queries yield. Reproduction measures are in counts while the overlap data was a percentage of the total amount of search results. It is important to note that the reproduction and the amount of overlap should correlate. There should not be a high percentage of overlap and only a very small percentage of reproduction.
  • Again, I would suggest making the search query as concise as possible. Do not use simple queries. An example of a simple query is a word or word group. The problem with using these types of queries is that it leads the user to a high number of irrelevant results which means the user’s precision would be low. This is especially true with blog search engines. The goal is to have a high precision rate. I would recommend that a person use phrases within quotation marks. Unique phrases help make your search results more precise by focusing on a linked group of words instead of processing each word separately. The user can also use operators to make the search more precise. There are two types of operators the kind that lets the user search for different portions of a site and the kind of operators that let the user make the query into a formula. For example, one can search for a specific webpage title by using the search query intitle: or one can restrict the search to the URL by using inurl:. I found intitle: to be particularly helpful while searching blog search engines to find the types of blog posts for which I was looking. It was usually easier to find relevant blog posts if the topic I was searching for was in the title of the blog post.
  • I also learned that it is important also to check blog search engines. Before this class, I was not familiar with them. I did not use intitle: in this assignment however while searching blog search engines on my own, I found intitle: to be particularly useful while searching. It helped me find search results that were relevant and precise. In general while blog search engine results do not have as high of a precision rate as the web search engines they do have some relevant blog posts to read. Also, the amount of overlap and reproduction with blog sites is very small so one can check multiple different blog search engines to find information.
  • I would also like to know if there is a certain operator to use to get more precise data from blog search engines. Blogs often provide an editorial aspect to a student's research; however, it is often hard to find the articles and blog posts for which one searches. It often takes multiple different search queries to find good results. The class results exemplify this problem. The class results showed that the blog search engines had very low precision as well as overlap. I think it would make sense to learn more about ways to get more precise information from blogs as a next step because they had the lowest precision.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License