Indexing Systems and Techniques

1) Discuss the processes of subject indexing.
The processes of subject indexing consist of two stages: (1) establishing the concepts expressed in a document; and (2) translating these concepts into the components of the given indexing language. Establishing the concepts expressed in a document calls for understanding the overall content of the
document by examining the important parts of the text like title, abstract, introduction, the opening phrases of chapters and paragraphs, illustrations, tables, diagrams and conclusions. /after examining the document, the indexable concepts are to be identified and then selected in the light of the purpose for which the indexing data will be used. Finally, it is to be followed by the translation of the selected concepts into the components of the given indexing language.

2) What do you mean by ‘Exhaustivity’ and ‘Specificity’ in indexing?
Exhaustivity is a measure of the extent to which all the distinct topics discussed in a document are considered for indexing. In other words, it is a measure of the number of index terms per document in an indexing system. A high level of exhaustivity increases recall. Exhaustivity is a matter of indexing
policy. Specificity, on the other hand, is the degree of preciseness of the subject to express the thought content of the document. Specificity is an intrinsic quality of the indexing language itself to represent the specific subject exactly and co-extensively. A high level of specificity increases precision.

3) “Kaiser started from the point where Cutter left”—Discuss.
C. A. Cutter was the first person that first gave a generalized set of rules for subject indexing in his Rules for Dictionary Catalog, published in 1876. Cutter provided rules for specific as well as compound subject headings. Cutter regarded subjects as specific and classes as broad. But, in  practice, it was these broad classes that Cutter entered his specific subjects. Further, according to Cutter, order of the component terms in a compound subject heading should be the one that is decidedly more significant. But Cutter could not provide any guideline as to how one will come forward to decide which one is more significant. The question of significance varies from user to user
and from indexer to indexer. J. O. Kaiser started from this point where Cutter failed to provide the guideline regarding the question of significance. Kaiser, in his Systematic Indexing, published in 1911, prescribed that the compound subject should be analysed by determining the relative significance of the different component terms of a compound subject through classificatory approach of categorization of terms. He categorized the component terms into two fundamental categories: concrete and process. Kaiser provided the guideline that concrete is more significant than process and so, he laid a rule that a process should follow concrete.

4) Mention the different types of relationships and their respective relational
operators as proposed J.E.L. Farradane in his Relational Indexing.
J. E. L. Farradane has identified the nine relationships in his Relational Indexing. These nine relationships and their respective relational operators are
a) Concurrence / 0
b) Self-activity / *
c) Association / ;
d) Equivalence/ =
e) Dimensional / +
f) Appurtenance / (
g) Distinctness/ )
h) Reaction / -
i) Causation / :

5) How are the syntactical and semantic relationships dealt with in PRECIS?
Syntactical relationships in PRECIS are handled by means of a set of logical rules, role operators and codes. They regulate the organisation of terms in the input string by the indexer and their manipulation to generate index entries by the computer. Role operators act as instructions to the computer. Semantic relationships in PRECIS are regulated by a machine-held thesaurus that serves as a source of See and See also references in the index. A thesaurus is generated simultaneously with the preparation of input string.

6) Discuss the basic assumptions that led to the development of POPSI-Specific.
The basic assumption leading to the development of POPSI-Specific is that subject indexing is always a specific purpose-oriented activity. But, there has always been a tradition to depend upon a designer of a subject indexing language, and such dependency always found to be inadequate to meet the specific requirement of subject indexing at the local level. Differences in requirement would call for differences in syntax of the subject proposition. It has been stated that the flexibility should be the rule of syntax, not the rigidity. Based on this assumption, POPSI tries to find out what is logically
basic, known as POPSI-Basic, and is readily amenable to the systematic manipulation to generate purpose-oriented specific versions, known as POPSI-Specific.

7) Distinguish between pre-coordinate and post-coordinate indexing systems?
Pre-coordinate indexing involves coordination of component terms in a compound subject by the indexer at the time of indexing in anticipation of users’ approach. In post-coordinate indexing, component terms in a compound subject are kept separately uncoordinated by the indexer, and the user does the coordination of terms in accordance with his requirements at the time of searching. In pre-coordinate indexing system, the rigidity of the significance order associated with the syntactical rules may not meet the approaches of all the users of the index file. But, in post-coordinate indexing, the searcher has wide options for the free manipulation of the terms at the time of searching in order to achieve whatever logical operations are required.
8) What are the devices used to eliminate false drops in post-coordinate indexing?
The following devices are used to eliminate false drops in post-coordinate indexing:
a) Use of bound terms;
b) Links;
c) Roles; and
d) Weighting.
9) What are the different versions of keyword indexing?
Different versions of keyword indexing are:
a) KWIC (Keyword –In-Context) Index;
b) KWOC (Keyword Out-of-Context) Index;
c) KWAC (Keyword Augmented-in-Context) Index;
d) KWWC (Keyword-with-Context) Index;
e) KEYTALPHA (Key Term Alphabetical) Index;
f) WADEX (Word and Author Index);
g) DKWTC (Double KWIC) Index;
h) KLIC (Key-Letter-In-Context) Index;
i) KWIT (Keyword-In-Title) Index; and
j) SWIFT (Selected Words In Full Titles) Index.

10) Mention the methods adopted in measuring the word significance in computerized indexing system.
Different methods adopted in measuring the word significance in computerized indexing are
a) Weighting by location;
b) Relative frequency weighting;
c) Use of noun phrase;
d) Use of thesaurus;
e) Use of association factor; and
f) Maximum-depth indexing.

11) What are the different levels of knowledge used in natural language understanding for the NLP-based subject indexing system?
Different levels of knowledge used in natural language understanding for the NLP-based subject indexing system falls into the following groups:
a) Morphological knowledge;
b) Lexical knowledge;
c) Syntactic knowledge;
d) Semantic knowledge;
e) Pragmatic knowledge; and
f) World knowledge.
12) What are the different parts of Science Citation Index?
There are three parts in Science Citation Index:
a) Citation Index;
b) Source Index; and
c) Permuterm Subject Index.

13) Give the meaning and scope of ‘Web indexing’.
The term ‘Web indexing’ refers to (a) search engine indexing of the Web, (b) creation of metadata, (c) organization of Web links by category, and (d) creation of a Website index that looks and functions like a back-of-book index. It will usually be alphabetically organised, give detailed access to
information, and contain index entries with subdivisions and cross-references. In the most general sense, Web indexing means providing access points for online information materials which are available through the use of World Wide Web browsing Software. The following issues also fall within the scope of Web indexing:
a) Uploading of ‘traditional’ indexes (and the documents to which they refer) on to the Web to provide a wider audience with access to them.
b) ‘Micro’ indexing of a single Web page, in order to provide users with hyperlinked access points to the materials on the page.
c) ‘Midi’ indexing of multiple pages, largely or wholly contained within a single Web site and falling under the responsibility of a single Webmaster.
d) ‘Web-wide’ indexing, providing users with centralised access to widely scattered materials, which fall under a single heading (e.g. every web page dealing authoritatively with ‘breast cancer’).
e) ‘Macro’ schemes designed to simplify or unify access to large number of Web pages falling under many different headings (e.g. every web page dealing authoritatively with any medical topic).
f) The addition of comments and annotations to provide users with some guidance before they link to selected sites and pages.

14) Discuss the functions of the different components of a Search Engine. 
Functions of the different components of a search engine are :
a) Spider: Computer program that visit websites following all links it comes across collecting data for search engine indexes, identifying and reading Web pages;
b) Index: A searchable database containing indexing terms created from the Web pages by the spider; and
c) Search engine mechanism: Software that enables users to query the index and that usually returns results in relevancy ranked order.

15) What are the key technologies involved in the development of the Semantic Web?
Key technologies involved in the development of the Semantic Web are :
a) Uniform Resource Identifier (URI);
b) eXtensible Markup Language (XML);
c) Resource Description Framework (RDF);
d) Ontology; and
e) Agent software

KEYWORDS
Amplified Phrase Order : The order of component terms in a phrase achieved by using the necessary prepositions in between them. It is a corollary of the order of significance provided by E J Coates.
Analet : Two or more isolates linked by relational operators according to J E L Farradane’s Relational Indexing system constitute Analet.
Analysis : It refers to the conceptual analysis, which involves deciding what a document is about—that is, identification of different component ideas (concepts) associated with the thought content of
the document.
Associative Classification: It refers to the association of a subject with other subjects without reference to its COSSCO relationships and results in a relative index.
Chain Indexing : A method of deriving alphabetical subject entries from the chain of successive subdivisions leading from the general to most specific level needed to be indexed. For this, it takes the class number of the document concerned from a preferred classification scheme for deriving subject index entries.
Citation Index : An ordered list of cited articles (references), each of which is accompanied by a list of citing articles (sources).
Citation Indexing : Techniques of bringing together the documents (cited documents) which manifest association of ideas to establish the relevancy of information in a document (citing document) through mechanical sorting of citations in a citation index. The relationship existing between the cited documents and citing documents forms the basis of Citation indexing.
Concrete : One of the fundamental categories propounded by Kaiser, which refers to things, place and abstract terms not signifying an action.
Consistency in Indexing : It is a measure that relates to the work of two or more indexers.
Controlled Vocabulary : A controlled vocabulary refers to an authority list of terms showing their interrelationships and indicating ways in which they may usefully be combined to represent specific subject of a document.
COSSCO Relationships : It refers to Coordinate-Superordinate-SubordinateCollateral relationships in organising classification.
Default Syntax : A logical meaning of the use of space between the words when entering more than one word in a Web search engine carries out a search.
Exhaustivity : The use of enough terms to cover the all topics discussed in a document. It relates to the breath of coverage in indexing. It is sometimes called depth indexing.
Expert System : It is the embodiment, within the computer, of a knowledge-based component derived from an expert in such a form that the computer can offer intelligent advice or take an intelligent decision about the processing function.
eXtensible Markup : A subset of Standard Generalized Markup Language Language (XML) (SGML), a widely used international textprocessing standard. XML is being designed to
bring the power and flexibility of generic SGML to the Web, while maintaining interoperability with
full SGML and HTML. XML is the first step in bringing meaning to the Web.
False Drops : Retrieval of unwanted items because of the false coordination of terms at the time of searching.
Hypertext Markup Language (HTML) :  The standard text-formatting language for documents on the World Wide Web. HTML text files contain content that is rendered on a computer screen and markup, or tags that can be used to tell the computer how to format that content. HTML tags can also be used to encode metadata and to tell the computer how to respond to certain user actions, such as a mouse click.
Links : Special symbols used to group all the related concepts in a document separately for the elimination of false drops.
Index : A systematic guide to the contents of documents, comprising a series of entries, with headings arranged in alphabetical or other chosen order and with references to show where each item indexed is located
Indexing : The process of evaluating information entities and creating indexing terms, normally subject or topical terms, that aid in finding and accessing the entity. Index terms may be in natural language or controlled vocabulary or a classification notation.
Indexing Language : An indexing language is an artificial language consisting of a set of terms and devices for handling the relationship between them for providing index description. It is also referred to as a retrieval language.
Indexing Program : Computer software used to order things; frequently used to refer to software that alphabetizes some or all of the terms in one or more electronic documents.
Input String : It refers to a set of terms arranged according to the role operators in PRECIS.
Invisible Web : It refers to those pages what we can’t see in the results pages after we run a search on the Web. Here, the pages remain hidden due to a variety of technical reasons, to search engines excluded them because the pages do not allow them access and also to pages, which restrict entry through the log-in password method.
Item Entry System : A type of post coordinate indexing system in which items are posted on the term. A single entry is prepared for each item, which permits access to the entry from all appropriate headings.
Keyword Indexing : An indexing system based on the usage of natural language terminology for deriving the index entries. Significant words denoting the subject, known as keywords, are taken mainly from the title and/or sometimes from abstract or full text of the document for the purpose of indexing.
Knowledge Representation : Method of codifying knowledge to enable a computer to store, process and to draw inference from the codified knowledge.
Metadata : In general, it is data about data. Functionally, a metadata is structured data about data, which describes the characteristics of a resource. A metadata record consists of a number of predefined elements representing specific attributes of a resource, and each element can have one or
more value.
Meta Search Engine : A search engine that simultaneously searches multiple search engines in response to a query.
Meta tag : The HTML element used to demarcate metadata on a Web page.
Natural Language : It refers to the area that attempts to make the Processing (NLP) computer understand natural language.
Ontology : A specification of a representational vocabulary for a shared domain of discourse—definitions of classes, relations, functions, and other objects.
Organising Classification : It refers to the categorization of concepts and their organsation on the basis of genus-species, whole-part, and other inter-facet relationships. It is used to distinguish and rank each subject from all other subjects with reference to its COSSCO relationships.
Parser : A computational process that takes individual sentences or concerned texts and converts them to some representational structure useful for further processing.
Parsing : It refers to the use syntax to determine the functions of words in the input sentences in order to create a data structure that can be used to get at the meaning of the sentence.
Post Coordinate Indexing : Component terms in a compound subject are kept separately uncoordinated by the indexer and the searcher does the coordination of the component terms at the time of searching. Also called Coordinate indexing.
Pre-coordinate Indexing : the indexer carries out Coordination of component terms in a compound subject at the time of indexing by following the syntactical rules of a given indexing language.
Process : One of the fundamental categories propounded by Kaiser, which includes mode of treatment of the subject by the author, an action or process  described in a document, and an adjective related to the Concrete as component of the subject.
Quality of Indexing : The ability to retrieve what is wanted and to avoid what is not wanted.
Relational Indexing : An indexing system developed by J E L Farradane, which involves the identification of the relationship between each pair of terms of a given subject statement and representation of those relations by relational operators.
Relational Operators : Special symbols used to link the isolates in Relational Indexing to create analets.
Relative Index : An index showing various aspects of an idea and its relationship with other ideas.
Resource Description : A generic framework for describing and interchanging metadata on
Framework (RDF) the Internet. RDF metadata expresses the meaning of terms and concepts in the XML that is understandable to computers.
Role Operators : These refer to a set of notations, which specifies the grammatical role or function of the term which follows the operators and which regulates the order of the terms in an input string in PRECIS. The rules associated with role operators serve as computer instructions for generating index entries, determines the format, typography and punctuation associated with each index entry.
Role : A symbol attached to the index term to indicate the context in which the term has been used.
Search Engine : A searchable database of Internet files collected by a computer program (called a wanderer, crawler, robot, worm, and spider.
Semantic Net : A directed graph in which nodes represent entities and arcs entities. Arcs are labelled with the names of the relation types—that is, the binary relationship to which the relationship belongs. A single node represents a single entity. Semantic nets can be used to represent various types of knowledge.
Semantics : Semantics is a study of meaning. In an indexing language, semantic relationship refers to the hierarchical and non-hierarchical relationships between the subjects and is governed by see and
sees also references in an index file. Controlled vocabulary serves as the source for see and see
also references.
Semantic Web : The term Semantic Web, introduced by Tim Berners-Lee, is the extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. The infrastructure of the Semantic Web would allow machines as well as humans to make deductions and organize information. The architectural components include semantics (meaning of the elements), structure (organization of the elements), and syntax (communication).
Software Agent : A computer program that carries out tasks on  behalf of another entity. Frequently used to reference a program that searches the Internet for information meeting the specified  requirements of an individual user.
Specificity : It refers to the use of much smaller number of terms to cover only the central subject matter of a document. The more specific the terms used, the fewer the entries per term on the average. Specificity is the property of the vocabulary used in indexing and it relates to the depth of treatment of the content of a document in indexing.
Spider : A computer program used in indexing and retrieving Web resources with reference to their
URLs that contain the given keywords or phrases. A spider traverses the Web from link to link,
identifying and reading pages. Also called crawler, robot, etc. Standard Generalized A non-proprietary language/enabling technology
Markup Language : for describing information. Information in SGML (SGML) is structured like a database, supporting rendering in and conversion between different formats. Both XML and later versions of HTML are instances of SGML.
Subject Directory : A search engine service that offers a collection of links to Internet resources submitted by the site creators or evaluators and organised into subject categories.
Subject Gateway : A subject gateway is recognised to be specialised resources on a particular field and is compiled by the people, not robots. The resources in the subject gateway include Internet catalogues, subject directories, virtual libraries and gateways and these resources are organised into hierarchical subject categories. Also called Information Gateway. Syntax : The grammatical structure consisting of a set of rules that govern the sequence of occurrence the terms in a subject heading.
Term Entry System : A type of post coordinate indexing system in which index entries for a document are made under each of the component terms associated with the thought content of the document. Here, terms are posted on the item.
Term Relationship : It refers to the relationship between two or more equally concrete things or phrases in a subject, which may lead to the absence of order of significance and modification of the amplified phrase order as propounded by E J Coates in his subject indexing system. Coates has identified 20 different kinds of relationships by means of prepositions: of, for, against, with, and by.
Term Significance : It refers to the order of significance developed by E J Coates in his subject indexing system. It states that the most significant term in a compound subject heading is the one that is most readily available in the memory of the enquirer and this leads to the order of significance as
Thing-Part-Material-Action.
Uniform Resource : The syntax for all names/addresses that refer to Identifier (URI) resources on the Web.
Uniform Resource : A technique for indicating the name and location Locator (URL) of Internet resources. The URL specifies the name and type of the resource, as well as the computer, device and directory where the resource may be found. The URL is a subset of URI. For example, the URL for Dublin Core Metadata Initiative is http://dublincore.org/.
Uniform Resource Name : A URI (name and address of an object on the (URN) Internet) that has some assurance of persistence beyond that normally associated with an Internet domain or host name.
Web Indexing : Web indexing means providing access points for online information materials, which are available through the use of World Wide Web browsing Software.
Weighting : A method of the allocation of values to indexing terms by using quantitative figures according to their importance in the document.
World Wide Web (WWW) : The panoply of Internet resources (text, graphics, audio, video, etc.) that is accessible via a Web browser. Also called Web or W3.