Home

A tiny guide to Web Content Filtering

Software

Although there are many filtering softwares (please see dmoz category) today, only two of them are open-source softwares, on the best of my knowledge.
SquidGuard (www.squidguard.org)
SquidGuard works with a squid proxy server and limits the web access based on white/black lists of URLs. It does not check contents(i.e. words or phrases) of web pages actually.
DansGuardian (dansguardian.org)
Working as a HTTP proxy, DansGuardian checks both URL and contents of web pages. PICS is also supported.

Blacklist

These are freely available blacklists for squidGuard and DansGuardian.
The squidGuard Blacklist
This list is generated by a dumb robot(squidGuardRobot) which crawls the Web to find web pages containing hazardous expressions.
dmozlists
With the script and data provided on this web page, you can use URLs listed on any category of Open Directory Project as black/white lists.

dmozlists

Warning: I no longer maintain the script and data. Jean-Francois Levesque updated the script and data. Please visit his web site at: http://isak.gplindustries.com/wiki/Dmozlists

Use of the Open Directory data is subject to the terms of the Open Directory License, which can be read at http://dmoz.org/license.html

Help build the largest human-edited directory on the web.
Submit a Site - Open Directory Project - Become an Editor
Lists of URLs distributed here are modified for filtering purpose

Followings are the contents of the archive dmozlists-current.tar.gz (as of 2002/Apr/05).

    dmozlists/adult/domains         31995 lines
    dmozlists/adult/urls            61394 lines
    dmozlists/kids_and_teens/domains 5783 lines
    dmozlists/kids_and_teens/urls   10003 lines
    dmozlists/society/domains       84717 lines
    dmozlists/society/urls          77737 lines
    dmozlists/regional/domains      486905 lines
    dmozlists/regional/urls         185349 lines
    dmozlists/computers/domains     68493 lines
    dmozlists/computers/urls        36181 lines
    dmozlists/health/domains        30955 lines
    dmozlists/health/urls           17825 lines
    dmozlists/world/domains         337702 lines
    dmozlists/world/urls            169126 lines
    dmozlists/shopping/domains      97808 lines
    dmozlists/shopping/urls         11823 lines
    dmozlists/home/domains          9327 lines
    dmozlists/home/urls             13268 lines
    dmozlists/science/domains       28610 lines
    dmozlists/science/urls          38550 lines
    dmozlists/arts/domains          92133 lines
    dmozlists/arts/urls             125871 lines
    dmozlists/sports/domains        41171 lines
    dmozlists/sports/urls           36711 lines
    dmozlists/business/domains      182118 lines
    dmozlists/business/urls         20130 lines
    dmozlists/recreation/domains    56816 lines
    dmozlists/recreation/urls       39819 lines
    dmozlists/games/domains         15583 lines
    dmozlists/games/urls            23576 lines
    dmozlists/news/domains          4622 lines
    dmozlists/news/urls             31565 lines
    dmozlists/reference/domains     20358 lines
    dmozlists/reference/urls        22995 lines
    dmozlists/ages/mteen/domains    3060 lines
    dmozlists/ages/mteen/urls       5163 lines
    dmozlists/ages/kids/domains     3306 lines
    dmozlists/ages/kids/urls        6459 lines
    dmozlists/ages/teen/domains     4038 lines
    dmozlists/ages/teen/urls        7297 lines
These directories are correspondent to top categories of dmoz.org, except for dmozlists/ages/{kids,teen,mteen}. The contents of ages directories are duplicates of kids_and_teens's, but separated according to <ages> tag. You can easily customize the script to use more specific categories, such as /Business/Investing/.

Probably you would like to use dmozlists/adult/{domains,urls} as blacklists and dmozlists/kids_and_teens/{domains,urls} as whitelists. It is usually OK, but you'd better check the lists before using them. For example, dmozlists/kids_and_teens/ contains several web sites about sex such as safersex.org. You may want to use only dmozlists/ages/kids as whitelist.

The script generates rules from the RDF dump as follows:

Download the script and data
This is free software. Licence: GPL2
Copyright(c)2002 Masanori Harada, All rights reserved.
The script and data hasn't been updated by me anymore. Please visit Jean-Francois Levesque's site for newer script and data.
Open Directory RDF Dump
the location of the latest RDF dump


Last-Update: 2002/Apr/05
harada@ingrid.org