ןימז לשממ
עדימ תחטבא
טפשמו קוח
הקיתא
תונעדימו עדימ
PKI םכח סיטרכ
םיטרדנסו תירבע
יללכ בושקת
םירתא תקוזחת
עדימה תמחלמ
תימצע הדימל
הכרדהו הרשכה
ינורטקלא רחסמ
םלועב IT ירתא
לאתר משרד האוצר

< תונעדימו עדימ
Information filtering in the Internet
 
Information filtering in the Internet


Written: 17/7/97
updated: 27/7/97
Version: 1.3
Author: Boaz Chen

168 Kb
 
Background

The Internet, since its inception, has been global and "anarchic", promoting a wide range of activities and freedom of expression and information. On the other hand, as it has expanded and millions of new subscribers have been added, the problematical issues associated with this policy have surfaced. This is expressed by the type of sites, discussion groups and IRC channels that include hard pornography or incitements to violence, racism, terror and others. Attempts to pass laws against the criminal use of the Internet (based on existing laws for other types of media) have met with strong opposition as well as problems of enforcement.
The issue has become the topic of community discussion due to the increasing numbers of children and teenagers who use the Internet and with the intention of protecting them from information of this nature.

The Problem

Attempts to bridge the gap between freedom of expression and conservation of the values we cherish have resulted in a search for solutions that will enable filtering of information.

The requirements for filtering information are different and varied, for example:
  • Protecting children and youth from pornography, violence, racism
  • Prohibiting entry to amusement sites from the workplace
  • Prohibiting entry to types of services or sites during peak hours
and so on.

Mechanized Filtering

The original solutions put forward were based on analysis of information during attempts to enter the site. This refers to programs that check the information before it is displayed to the user, to ensure it meets certain prescribed standards. For example: programs which scan the text to ensure the non-appearance of specific words or programs which check for the amount of uncovered flesh before displaying pictures.
Solutions such as those have a number of difficulties:

The solutions are specific to particular criteria and will not be compatible with information transferred or presented in a different format. For example, undesirable text that is displayed as a picture, program or even in a different language, will not be effectively filtered.
Mechanical filtering is lacking because of the difficulty in properly defining cross-sections and the overall context. Here are a number of examples:

Programs that prevent the entry to a site containing the word SEX will also prevent academic information on the subject and even sites which contain sexual advice for teenagers. On the other hand, sites that only contain pornographic pictures will not be filtered out.

Attempts to prohibit entry to racially prejudiced sites will often inhibit attempts to enter anti-racially sites as well.

Programs that filter pictures will also filter out artistic ../images as well as medical, instructional pictures etc.

Checking of information at the time of loading slows down the data acquisition process.

The natural difficulty in defining filtering thresholds and the intense difficulties in translating them into programs, are made harder due to the difference in approaches and cultural backgrounds of the peoples in the virtual world.

The Classification and Labeling Solution


An innovative approach to this problem, which we will discuss in detail in this survey, is based on human classification and evaluation of information.
The intention is, in order to ease the filtering process, to check the information, grading and labeling it based on predefined criteria and a format which enables mechanized use of these labels.
This is similar to the classification methods already in use in the fields of movies and videocassettes, newspapers or books, except that the programs that will base their filtering criteria on labeling will enable monitoring and enforcement while making the filtering suited to the user.
In general terms, there are three elements involved in the process: the distributor of the information, the evaluator responsible for classification based on standards and the user whose filtering program checks the information's labeling and decides whether or not to allow its retrieval.
This is shown in the diagram:

diagram

This approach, which solves many of the problems associated with the automatic analytical tools, creates new dilemmas and even the suspicion that it is creating censorship. The major questions it raises are:
  • Who is responsible for the classifications?
  • According to what categories will the information be classified?
  • What will be the criteria used for grading within each level?
  • Who will set the filtering parameters?
These questions and others made the World Wide Web Consortium1 to set new standard called PICS2 (Platform for Internet Content Selection).

The standard defines two things:
Standard label structure - technical structure that defines the format of the label.
A standard structure for the labeling server - definition of how the labeling service provider defines the categories upon which labels are based, the various levels of division within categories and the interface for presentation of grading criteria.

These definitions are completely technical and as such enable disconnection between the world of content of the classification and the technical realization of it. (Similar to what the HTML standard has done for www pages).
This definition, enables different authorities to make classifications without the dependence on filtering programs, and vice versa. Software developers will be able to develop filtering mechanisms without having to comply with a specific classification authority or standard.
In this way, the standard planners hope to maintain the atmosphere of democracy in the Internet, so every user will be able to choose the software, classification authorities, sorting category and levels in each category he wishes to use.

For example:

A user will be able to connect with an authority specializing in the fight against racism and choose which level in each category constitutes a limitation, while making use of the criteria distributed by that authority. Such a choice might look like the following:

Category Level Criteria
Type of publisher 3 Institutions and organizations, no private information
History 5 Show all historical fact
Language 2 Enable condemnatory nicknames and jokes, but stop at that.

The information labeled by this authority will be displayed to the user only if it meets the standards (it is possible to define how to deal with totally unlabeled information).

Labels

The label is a formatted information, that is to say, information that is readable by a computer and therefore usable by a filtering program. The standard does not define the types of information in the label and, by this, it enables labeling according to different categories. For an example of a technical application of a label, see Appendix A.

Labels may be used in two main ways:
Embedded in the information page - the label may be included as part of the page header. This is applicable to those who wish, to cooperate with the classification authority (a good reason for doing so would be the fact that in most filtering programs it is possible to disable access to all information which has no classification).

Separated - It is also possible to keep groups of labels, which relate to information all across the Internet, in different servers separate from the information. The technical implementation of this comes when, at the time of accessing the information, the filtering program also checks the server containing the labels and searches to confirm the existence of the appropriate label. In this way, classification of information whose distributors do not want to cooperate can also be achieved.

The advantage of the first method is that the classification function can be carried out as a self-initiated or even independent basis and changing the location of the information does not influence it. The authenticity of the labecan be confirmed by use of a digitized signature on the label, by its creator.

The second method has two advantages:

Independence between the publisher of the information and its various evaluators. The publisher of the information does not have to do anything for the information to be labeled. (Or even to know that it has been done).

Prevention of increasing in size of all files containing information. This is especially important when there are a large number of evaluators. That is to say, rather than adding a large number of labels to the information, each user can take the labels he requires from the appropriate classifying server. The ability to manage this procedure can be increased by concentrating all these labels together and copying them all into the local server (caching). This has the added advantage of enabling storage of local labels that are sensitive to the language and norms of the user's location.

Every label is relates to a URL and therefore labels can refer to information in sites and HTML pages, FTP sites and specific files, CHAT servers, rooms and notices posted on them and the like.

Classification

Classification can be done by the information's distributor (by independently adding a label to the document) or by an external authority.

In addition to defining the label format, the PICS also defines the method of transfer of data between classification servers and the programs that use them. The server contains a vocabulary that defines the types of categories that are used to label sites (every "word" constitutes an evaluation dimension for the site being labeled). For a technical definition of the classification element, see Appendix A.

In this manner, every program will be able to work with every server, by defining the categories of external labeling in it. For example: The Recreational Software Advisory Council which classifies computer games, decided to classify them according to five levels of each of the following dimensions: violence, nudity, sex and foul language.

The classifiers are generally experts in a particular field (software, games, databases, etc.) or specific types of dangers (racism, violence, fraud, etc.). It is reasonable to expect that authorities that are already doing similar type of classification in different environments will probably provide some of the classification services.

The two types of classifier can be distinguished by:
Professionals - companies or organizations who employ subject specialists, e.g: SafeSurf
Democrats - authorities that classify based on evaluations from other users. This method enables the receipt of evaluation of information based on what similar thinking people have said about it (based on similarities with parallel evaluations).

Filtering

The filtering programs are those which, upon requesting information, analyze the related labels and prevent the viewing of information which has not been authorized (according to the user's preferences).

There are three main methods of filtering:

Filtering by Browser
The Microsoft Internet Explorer Ver. 3.02 already supports filtering using PICS and it is anticipated that the Netscape Communicator will also support it in its forthcoming version.

In one of the option screens, it is possible to determine, based on the type of categories of different authorities, the threshold levels beyond which the browser will not allow to display information. This is an inexpensive solution (free of charge) and does not require special installation. It is though limited:

1. The filtering of the information is relating only to that that is brought by the browser, and there is no protection against information that arrives via IRC, FTP, e-mail or discussion groups.

2. The restrictions can be override by installing a different browser.

Designated Filtering Programs
Comprehensive software package that monitors the information flow in and out of a computer. These programs enable control of other types of information as well, by filtering the computer's communications channels (accordance to the user's preferences).

The programs also support individualized definitions for the enabling access filtering to the different sources of information according to different user profiles.

These packages generally include additional safety devices, for example, CyberPatrol3 (today's market leader) which enables the blocking of transmission of one's personal particulars (name, address, credit card number, etc.), even during the course of IRC conversation, definition of the permitted hours of activity per user, locking of local computer applications (accounting programs, for example) and protection against attempts to cancel or erase programs.

Software programs of this nature are the best solution for the home and private businesses.

Filtering by the Server
Software programs similar to those mentioned above can be purchased as server versions. These programs connect to the proxy server and enable filtering for everyone using the network. This capability is very important for large organizations who wish to prevent improper use of the network during working hours, or to enable such use only during restricted hours (breaks, night shifts, etc.).

The advantage of such an installation is the ease of maintenance since it eliminates the need to install and update versions in end-user stations.

Large organizations can also form their own independent evaluation authorities to sort and label, in similar manner to the general authorities, using categories and criteria which are suited to their organization. This makes it possible, by using standard, universal tools, to sectionalize information even within the organization. Another advantage in the use of filtering mechanisms of this type is the adding of additional defenses against viruses or other programs that are found on the Internet.

It should be remembered that the addition of programs is likely to somewhat slow the connection to the network because it serves as the bottleneck for the flow of information.

Additional Uses

The motivation for the labeling mechanism was primarily designed to avoid accessing negative information but in the near future, it is expected that the use of labels (in PICS standard) will develop for other purposes in the fields of sorting and organizing information on the Internet.

This expectation arises from the fact that the more the information and the number of users of the Internet increase, the problem of searching for information on the Internet becomes one of qualitative location of relevant information.

Examples of uses on the horizon:

Qualitative assessments of sites and information on different subjects - similar to what exists at the moment in a number of catalogs. It will be possible to obtain in future, information about the quality of the site and the kind of information it contains. This will be done by organizations who create labels with this information without their dependence on specific sources for their information.

The addition of the information contained in the labels will enable the search engines to work with greater abilities. This ability will enable more efficient searches integrating the mechanized abilities for locating of information with human evaluation.

The use of labels for classifying users in discussion groups or messages posted from them. In addition to the filtering which is mentioned above, it will be possible to evaluate the message writers based on their messages. Filtering categories such as professional interest level, style, level of succinctness and more will enable the user to filter the messages by definition of thresholds.

And so on.

The Fly in the Ointment

The process of labeling, and the filtering abilities described above constitute a degree of danger to two major principles of the Internet:

Censorship - in spite of the standardization of the labels, and intent to establish a free market of different evaluators and classifiers, there are those who claim that a number of large organizations will control this field and establish a kind of elitism that controls the accessibility of the information on a global basis.

Accessability- resources required to carry out human labeling of the information and updating of this labeling, will mean that not every site could be labeled. It is expected that the use of the possibility to prohibit all unlabeled information from being accessed will create a situation wherein access to minority group sites will be barred. The same goes for non-institutionalized sites, whether this was the original intentional or not. This will affect the special character of the Internet, in which every site is accessible worldwide.

Summary - Filtering as a Service to Government Offices

The major advantages of the Internet arise from the great accessibility of information in it to users worldwide. In our process of evaluation of this working tool, the advantage is also shown to be a disadvantage inasmuch as the existing organization of most institutions means that they are unable to monitor the use of this tool or to enforce regulations concerning its use.

Like all organizations, the various Government offices can use the filtering possibilities that have been described herein for the purpose of solving this problem, should it exist.

The Government Internet Committee decided neither to issue policy guidelines nor to enforce use of this nature. This is because the decision to use this tool and the method of filtering by it are issues that should be determined by the individual offices.

In the event that a Government office does decide to use the Internet in this way, we recommend that it should be done by purchase of specialized filtering software, which supports PICS, on the office server. Use of this nature will enable central regulation and bring about the greatest degree of efficiency.

Resources

We recommend the sites motioned in the footers and in addition:
Article in the Scientific American by Prof. Paul Resnick, Chairman, PICS
      www.sciam.com/0397issue/0397resnick.html
Article in Hot Wires, by Simson Garfinkel
      www.hotwired.com/packet/garfinkel/nc_today.html
PICS newsletter on the organization W3C
      www.w3.org/PICS
      www.w3.org/pub/WWW/PICS

Appendix A. - Technical Definitions

The technical definitions of the PICS standard are based on text with the intention being to make them easy to construct and update.
As we have already mentioned, the standard defines two elements:
· The authority that classifies definition of vocabulary or categories based on which its labels are defined.
The label itself.
For demonstration purposes, we will characterize definitions and type based on the American standards for movie classification (MPAA).

Classification Factor

In the URL address which defines the classification should be part of the page which includes the text:

The Text The Meaning
((PICS-version 1.1)  
(rating-system "http://MPAAscale.org/Ratings/Description/") Site address of access
(rating-service "http://MPAAscale.org/v1.0") Organization site address
(icon "icons/MPAAscale.gif") Organization Symbol
(name "The MPAA's Movie-rating Service") Organization Name
(description "A rating service based on the MPAA's movie-rating scale") Organization Description
   
(category  
(transmit-as "r") Definition of Category (word)
(name "Rating")  
(label (name "G") (value 0) (icon "icons/G.gif")) For each (numerical) level of the category the name is defined (or the literal description of the criteria) and a symbol which shows it for graphics purposes in the filtering programs
(label (name "PG") (value 1) (icon "icons/PG.gif"))
(label (name "PG-13") (value 2) (icon "icons/PG-13.gif"))
(label (name "R") (value 3) (icon "icons/R.gif"))
(label (name "NC-17") (value 4) (icon "icons/NC-17.gif"))))

The Labels

The following text can be included in the headers of a page of information or separately.

The Text The Meaning
(PICS-1.1  
"http://MPAAscale.org/v1.0"labels Type of label (organization of filtering which will define it)
no "1996.11.05T08:15-0500" Date it became valid
until "1997.12.31T23:59-0000" Date it became invalid
for "http://www.company.com/filmname.htm" Information referred to (enables separate placement)
by "John Doe" Evaluator
ratings (r 2)) Rating based upon each category (one in this case)

1. The World Wide Web Consortium (W3C for short) is a joint venture of industry and universities in the USA, Europe and Asia, for development of protocols
2. For details see www.w3.org/pics or www.w3.org/pub/www/pics
3.The program includes a list of recommended and forbidden sites, and thereby is acts as a labeling authority as well - www.cyberpatrol.com
 :



 
 
רתאה תפמ רשק רוצ שדח המ
© Copyright 2002 The State of Israel All Rights Reserved || 2002 לארשי תנידמ תורומש תויוכזהלכ ©