Documents  
CIPA: Which Filtering Software to Use?   
Derek Hansen of the University of Michigan's School of Information summarizes the key findings of a substantial filtering software research project--required reading if you are evaluating filtering solutions.
@Not applicable

No matter what your personal opinion about CIPA, there are some important facts about filtering that will help you decide what product (if any) to purchase, what configuration options are available, and how to evaluate a filter's effectiveness. The University of Michigan's School of Information and Health System recently completed an extensive study of filtering software for the Kaiser Family Foundation, in a project headed up by Professors Paul Resnick and Caroline Richarson. Derek Hansen, a member of the research team, offers this report.

Not All Filters Are Created Equal

There is an enormous amount of variation in the types of filtering software available. Some key variables and related questions that should be considered when selecting a product include:

  • Hardware/Software compatibility: Will the filtering software be installed on each individual computer (i.e., client) or on a central computer (i.e., server)? Does the filtering software require you to have additional software?

  • Cost: Are there any ongoing fees associated with updating the blocklist? Are there separate installation fees?

  • Ease of installation and maintenance: Does the company install the software for you? How frequently does it need to be updated, modified, etc.? How difficult is it to add/remove computers?

  • Monitoring and reporting capabilities: What statistics are kept and how can they be accessed? Are there standard ready-made reports that you need?

  • Effectiveness of filtering technology: How frequently are blacklists/whitelists updated? Does the filter have the ability to dynamically classify a site, even when that site has not already been placed on a blacklist/whitelist?

  • Configuration Options: Can different computers be configured differently? How many blocking categories (e.g., pornography, gambling, hate, web chat) are offered, and are they categories that correspond with the libraries policies?

  • Error handling capabilities: How difficult is it to turn on or off the filter for a given computer? Can custom messages be displayed when a site is blocked? How difficult is it to add a site to or remove it from a whitelist or blacklist? Who can initiate such a procedure?

Filters are Flexible--Make Them Stretch

Over the past several years, most commonly used filters (at least in the library and school setting) have become extremely versatile. As a result, the goals of the institution using the filter can be better met than in prior years. Notably, most products offer a wide range of categories that can be blocked (often numbering in the dozens), including categories like “pornography,” “hate,” “internet chat,” and “gambling,” as well as some categories that can be allowed through, like “health” and “sex education.” Currently, few (if any) filters include a category that only blocks sites that meet the legal criteria outlined in CIPA. Instead, to comply with CIPA, libraries must block all categories that include any of the content not permitted by CIPA. This results in more content being blocked than the law itself would technically require. In most cases, blocking only the “pornography” category (which may be called “sexually explicit” or something related) is sufficient; however, depending upon how the categories are defined there may be additional categories that must be blocked in order to comply with CIPA (e.g., “extreme” or “adult”).

The decision about which categories to block (i.e., how to configure the product) is at least as important as which filtering product to purchase, and should not be made in haste. In fact, in a recent study focused on finding health information, we found that the configuration of the product had a far greater impact on the amount of over- and underblocking than the choice of product itself (see report titled “See No Evil: How Internet Filters Affect the Search for Online Health Information” available at http://www.kff.org/entmedia/20021210a-index.cfm). It is also worth noting that most server-based filters allow different computer stations to be configured differently, so that terminals in the Youth or Young Adult area of the library can be configured to block more categories than the computers in the main section of the library, if desired.

Another important way in which filters are flexible is their ability to deal with errors. Many products allow custom messages that appear when an attempt is made to access a blocked site. These messages can include text or links that prompt the user about what steps to take if they believe the site should not be blocked. Many products currently include a link which, if selected, will send a message to the filtering company prompting them to re-evaluate the site. However, this process takes days at best and could take months, which is hardly beneficial in the short-term. In addition, some products can be set up so that a library patron could submit a site (perhaps even anonymously) to a librarian, who can review the site and add it to the “allowed” list (if appropriate) without much difficulty. While this is a bit more labor intensive from the library standpoint, it considerably reduces the negative effects of overblocking.

No Filter is Free from Errors…But Some are More Error-Free than Others

Because of the enormous amount of uncensored, constantly changing information on the Internet, no filter will ever be free from over- or underblocking errors. Overblocking refers to situations where an “appropriate” site is erroneously blocked by the filter. Underblocking refers to situations where an “inappropriate” site is not blocked. The rates (i.e., percentage) of over- and underblocking are important measures of the effectiveness of a filter. There are a few things to keep in mind when looking at over- and underblocking rates.

  1. There is often a tradeoff between over- and underblocking. Similar to the interplay between recall and precision, it is often the case that as one measure improves the other worsens. The moral of the story: look at both the over- and underblocking rates when determining the effectiveness of a filter.

  2. Over- and underblocking rates depend upon a variety of factors, including the filtering product, the product's configuration (as described previously), and the set of URLs being tested. In fact, in a recent study performed for the Kaiser Family Foundation, we found that a product's configuration and the topic of the URL list made a larger difference in the amount of overblocking than did differences between products (see http://www.kff.org/content/2002/20021210a/ for details). In summary, comparisons of filtering products must use as similar a configuration as possible and be based upon the same set of URLs to be meaningful.

  3. Unfortunately, there are two completely different (and independent) statistics used to measure the percentage of overblocking (and underblocking, although there is generally more confusion and disagreement related to overblocking), both of which are commonly referred to as the “overblocking rate.” One overblocking rate is calculated as a percentage of all “appropriate” sites in the test set, while the other overblocking rate is taken as a percentage of all blocked sites. Filtering critics like to look at the fraction of all blocked items that are in error since this number is generally larger; however even if that percentage is large patrons may rarely experience a block while looking at innocuous information. (See the postscript below for a more thorough description of these error rates.) Always understand which error rate is being presented and how it should be interpreted.

  4. There have historically been many methodologically weak studies that have calculated and publicized misleading over- and underblocking rates. The primary problems with these studies relate to:

    1. the selection of URLs to be tested (e.g., no objective and repeatable process of selecting particular URLs; too small a sample of URLs)

    2. testing of the filtering software itself (e.g., unclear as to which configuration is used)

    3. classifying of sites (e.g., researchers do not follow or document a consistent procedure when classifying sites)

    4. error rate reporting (e.g., researchers don't present all of the over- and underblocking error rates and/or misinterpret their meaning - see the postscript below)

Lesson: don't believe everything you read.

With these comments in mind, let's look at a few actual numbers (in Table 1) from two of the better designed recent studies. These two studies are especially pertinent because they review some of the most popular products in use in libraries and schools and they use definitions similar to those outlined in CIPA to classify sites themselves before comparing them against the filters to see if they will be blocked. However, these studies are certainly not the last word, since the Department of Justice study (available at http://www.etestinglabs.com/clients/reports/usdoj/usdoj.pdf) includes a rather small sample of websites and the Kaiser study published in JAMA (available at http://jama.ama-assn.org/ ) focuses on health information.

Table 1

 

Smartfilter

8e6

Websense

CyberPatrol

Symantec

N2H2

% of Non-Pornography URLs blocked (i.e., OK-sites overblock rate)

DOJ

7.1

N/A

0.0*

6.1

N/A

1.0

Kaiser**

           

Least

2.3

1.1

0.6

1.6

1.9

0.8

Moderate

5.8

4.5

3.8

2.8

7.6

6.5

Most

18.2

15.1

35.4

22.4

33.5

19.5

% of Pornography URLs blocked (e.g., 1 - Bad-sites underblock rate)

DOJ

94.4

N/A

92.4

82.7

N/A

98.0

Kaiser**

           

Least

87.2

89.1

83.9

85.7

87.8

89.5

Moderate

88.7

90.9

91.3

85.7

89.3

92.8

Most

89.0

92.1

93.8

87.2

90.5

94.0

Categories blocked under “Least” & presumably sufficient to comply with CIPA

Blocked

Sex and Extreme

Pornography

Sex (in Adult Material)

Adult/Sexually Explicit

Sex/acts

Pornography

Allowed

     

All Exceptions

 

All Exceptions

*Numbers with the lowest error rates are italicized

**Three different configurations were used in the Kaiser study including: Least restrictive (designed to only block out content forbidden by CIPA - this configuration exactly matches the DOJ configuration), Moderate restrictive (modeled after a large school system), and Most restrictive (which blocks all categories except educationally related ones)

So what do these numbers tell us? They tell us that there is a significant, but small difference between products. There is a large difference in the amount of overblocking when the configurations differ. When configured at the least restrictive setting (as required by CIPA), only a very small portion of the “appropriate” sites encountered by a library patron would be erroneously blocked. Even on the most restrictive setting, statistically speaking nearly 1 in 10 “inappropriate” sites are not blocked by filters, implying that librarians may need to rely upon other methods such as education and monitoring to completely eliminate pornography from the library.

Finally, one piece that is not obvious from these numbers, but which comes out more in the detailed Kaiser report (found at http://www.kff.org/content/2002/20021210a/) is the impact of different topics on overblocking rates in particular. For example, even on the Least restrictive setting, around 9% of websites that came up when searching for “safe sex” or “condom” were erroneously overblocked. This percent increases to over 50% on the Most restrictive setting. These numbers emphasize that overblocking can become a problem when searching for certain topics, even if the overblocking rates for most topics are low. When overblocking becomes a problem, it is helpful to have a workaround as described earlier in this paper.

Conclusions

Filters are a bit like children. They come in all shapes and sizes. They don't always do what they are told, although they generally get it right. They are at their best when they are taught to use all of their capabilities. And at times they require some discipline. In short, they'll never be perfect, but they can influenced to reach their potential.


Contribute to this topic
Do you have an article, presentation, or other content to share on this topic?
You can post it on this topic page. Find out more about submitting documents in the Member Center.
Ratings You must be signed in to rate this item
Average (0 Votes)
Comments