Lecture 1 Introduction Information Retrieval

Loading...

Lecture 1 Introduction

Some material is from: Yannis Tzitzikas (UoC) slides, & Class material of the two textbooks

1 Information Retrieval 2009-2010

Information Retrieval Collection: Fixed set of documents (information items) Goal: Retrieve documents with information that is relevant to user’s information need and helps the user complete a task

SIGIR 2005

2 Information Retrieval 2009-2010

1

Information Retrieval Information item: Usually text (often with structure), but possibly also image, audio, video, etc. Text items are often referred to as documents, and may be of different scope (book, article, paragraph, etc.).

Information Retrieval 2009-2010

Examples IR Systems – Verity, Fulcrum, Excalibur, Eurospider – Hummingbird, Documentum – Inquery, Smart, Okapi, Lemur, Indri Web search and in-house systems – West, LEXIS/NEXIS, Dialog – Lycos, AltaVista, Excite, Yahoo, Google, Nothern Light, Teoma, HotBot,Direct Hit, … – Ask Jeeves – eLibrary, Inquira – vivisimo (www.vivisimo.com) – ...

Information Retrieval 2009-2010

2

IR Emphasis on User information need: ƒ Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. ƒ Translate this to a query (natural language, keyword, proximity, XQuery, sketch-based, etc)

Information Retrieval 2009-2010

The classic search model Get rid of mice in a politically correct way

TASK

Mis-conception

Info about removing mice without killing them

Info Need

Mis-translation Verbal form

How do I trap mice alive? Mis-formulation

Query

mouse trap

SEARCH ENGINE

Query Refinement

Results Corpus 6

Information Retrieval 2009-2010

3

IR IR: • representation, • storage, • organization of, and • access to information items

ƒ Emphasis is on the retrieval of information (not data)

Information Retrieval 2009-2010

Data vs Information Retrieval ƒ Data retrieval ƒ which docs contain a set of keywords? ƒ Well defined semantics ƒ a single erroneous object implies failure! (sound and complete)

ƒ Information retrieval ƒ information about a subject or topic ƒ semantics is frequently loose ƒ small errors are tolerated

ƒ IR system: ƒ interpret content of information items ƒ generate a ranking which reflects relevance ƒ notion of relevance is most important

Information Retrieval 2009-2010

4

Basic Concepts: User Task

Retrieval Database Browsing

ƒ Two complementary forms of information or data retrieval: Retrieval (ανάκτηση) Browsing (πλοήγηση)

Information Retrieval 2009-2010

Querying (retrieval) vs. Browsing Querying: ƒ Information need (retrieval goal) is focused and crystallized. ƒ Contents of repository are well-known. ƒ Often, user is sophisticated. Browsing: ƒ Information need (retrieval goal) is vague and imprecise (or there is no goal!) ƒ Contents of repository are not well-known. ƒ Often, user is naive.

Information Retrieval 2009-2010

5

Querying(retrieval) vs. Browsing (cont.)

ƒ Flat (list of documents) ƒ Structure guided (hierarchical structure: file folders – yahoo! Directory, ODP) ƒ also, inside a document (abstract, sessions, etc) ƒ Hypertext (following links)

Information Retrieval 2009-2010

Querying(retrieval) vs. Browsing (cont.)

ƒ Querying and browsing are often interleaved (in the same session). ƒ Example: present a query to a search engine, browse in the results, restate the original query, etc.

Information Retrieval 2009-2010

6

Pulling (ad hoc querying) vs. Pushing (filtering) information ƒ Querying and browsing are both initiated by users (information is “pulled” from the sources). ƒ Alternatively, information may be “pushed” to users. ƒ Dynamically compare newly received items against standing statements of interests of users (profiles) and deliver matching items to user mail files. ƒ Asynchronous (background) process. ƒ Profile defines all areas of interest (whereas an individual query focuses on specific question). ƒ Each item compared against many profiles (whereas each query is compared against many items). Σχήμα Information Retrieval 2009-2010

Basic Concepts: Logical View of Documents

Logical view of the documents Keywords (tagging, or extracted automatically) Full-text ->(text operations) -> Index terms (also structure)

Information Retrieval 2009-2010

7

Basic Concepts: Logical View of Documents

Accents spacing

Docs

Noun groups

stopwords

stemming

indexing

structure

structure

Index terms

Full text

Full text Document representation viewed as a continuum: logical view of docs might shift

Information Retrieval 2009-2010

The Retrieval Process ranked docs

user need

Text

User Interface user need

Text Text Operations

logical view user feedback

Query Operations

query Searching retrieved docs Ranking

logical view Indexing

DB Manager Module

inverted file Index

Text Database

ranked docs

Information Retrieval 2009-2010

8

The Retrieval Process ƒ Model documents to be used text operations text model ƒ Build an index ƒ User needs query operations ƒ Ranking based on likelihood of relevance ƒ Result representation ƒ User feedback phase Information Retrieval 2009-2010

Objectives ƒ

Overall objective (efficiency): ƒ Minimize search overhead

ƒ

Measurement of success (effectiveness): ƒ Precision and recall

ƒ

Facilitate the overall objective: ƒ Good search tools ƒ Helpful presentation of results

Information Retrieval 2009-2010

9

Minimize search overhead ƒ

Minimize overhead of a user who is locating needed information.

ƒ

Overhead: Time spent in all steps leading to the reading of items containing the

ƒ

Needed information: Either

needed information (query generation, query execution, scanning results, reading non-relevant items, etc.).

ƒ

Sufficient information in the system to complete a task.

ƒ

All information in the system relevant to the user needs.

ƒ Example –shopping: ƒ Looking for an item to purchase. ƒ Looking for an item to purchase at minimal cost. ƒ Example –researching: ƒ Looking for a bibliographic citation that explains a particular term. ƒ Building a comprehensive bibliography on a particular subject.

Information Retrieval 2009-2010

Measurement of success Two dual measures: ƒ Precision: Proportion of items retrieved that are relevant. Precision = relevant retrieved / total retrieved = |Answer ∩ Relevant | / |Answer | ƒ Recall: Proportion of relevant items that are retrieved. Recall = relevant retrieved / relevant exist = |Answer ∩ Relevant | / | Relevant | ƒ Most popular measures, but others exist.

Information Retrieval 2009-2010

10

Measurement of success (cont.)

Information Retrieval 2009-2010

Measurement of success (cont.)

Relevance? [Information need] Related to the topic Timely From a reliable source …

Information Retrieval 2009-2010

11

Presentation of results Present search results in format that helps user determine relevant items: „

Arbitrary (physical) order

„

Relevance order

„

Clustered (e.g., conceptual similarity)

„

Graphical (visual) representation

Information Retrieval 2009-2010

Support user search Support user search, providing tools to overcome obstacles such as: ƒ

Ambiguities inherent in languages. ƒ Homographs: Words with identical spelling but with multiple meanings. ƒ Example: Chinon—Japanese electronics, French chateau.

ƒ

Limits to user's ability to express needs. ƒ Lack of system experience or aptitude.

ƒ

Lack of expertise in the area being searched. ƒ Initially only vague concept of information sought. ƒ Differences between user's vocabulary and authors' vocabulary: different words with similar meanings.

Information Retrieval 2009-2010

12

History

Library search

Information Retrieval 2009-2010

Past, present and future 1960s-1970s Initial exploration of text retrieval systems for “small” corpora of scientific abstracts and law and business documents Basic boolean and vector-space models of retrieval Salton (Cornell)

1980s Legal document database systems, many run by companies Lexis-Nexis Dialog Medline

Information Retrieval 2009-2010

13

Past, present and future 1990’s Searching FTP-able documents on the Internet Archie WAIS Searching the World-Wide-Web Lycos Yahoo Altavista Recommender Systems Ringo Amazon NetPerceptions Automatic Text Categorization and Clustering Information Retrieval 2009-2010

Past, present and future 2000’s Link Analysis for Web Search Google

WEB changed everything

Automated Information Extraction Whizbang Fetch Burning Glass

New issues: Trust Privacy, etc

Question Answering TREC Q/A track Multimedia IR Cross-Language IR

Additional sources: Social networking Wikipedia, etc

Document Summarization

Information Retrieval 2009-2010

14

Unstructured (text) vs. structured (database) data in 1996 160 140 120 100 Unstructured Structured

80 60 40 20 0

Data volume

Market Cap 29

Information Retrieval 2009-2010

Unstructured (text) vs. structured (database) data in 2006 160 140 120 100 Unstructured Structured

80 60 40 20 0

Data volume

Market Cap 30

Information Retrieval 2009-2010

15

Search Engines and Web Today Indexed web: at least 45.84 billion pages 2 exabytes (260) per year -- 90% in digital form 50% increase per year

31 Information Retrieval 2009-2010

Case Study

32 Information Retrieval 2009-2010

16

Unstructured data in 1680 • Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? • One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? – Slow (for large corpora) – NOT Calpurnia is non-trivial – Other operations (e.g., find the word Romans near countrymen) not feasible – Ranked retrieval (best documents to return)

33 Information Retrieval 2009-2010

Term-document incidence Antony and Cleopatra

Julius Caesar

The Tempest

Hamlet

Othello

Macbeth

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Brutus AND Caesar but NOT Calpurnia

1 if play contains word, 0 otherwise

Information Retrieval 2009-2010

17

Incidence vectors

• So we have a 0/1 vector for each term. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) Î bitwise AND. • 110100 AND 110111 AND 101111 = 100100.

35 Information Retrieval 2009-2010

Answers to query ƒ Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.

ƒ Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

36 Information Retrieval 2009-2010

18

Bigger collections ƒ Consider N = 1M documents, each with about 1K terms. ƒ Avg 6 bytes/term incl spaces/punctuation ƒ 6GB of data in the documents. ƒ Say there are m = 500K distinct terms among these.

37 Information Retrieval 2009-2010

Can’t build the matrix ƒ 500K x 1M matrix has half-a-trillion 0’s and 1’s. ƒ But it has no more than one billion 1’s. Why? ƒ matrix is extremely sparse.

ƒ What’s a better representation? ƒ We only record the 1 positions.

38 Information Retrieval 2009-2010

19

Inverted index

ƒ For each term T, we must store a list of all documents that contain T. Brutus

2

Calpurnia

1

4 2

8

16 32 64 128

3

5

8

13 21 34

13 16

Caesar

What happens if the word Caesar is added to document 14? 39 Information Retrieval 2009-2010

Inverted index (continued) ƒ Linked lists generally preferred to arrays ƒ Dynamic space allocation ƒ Insertion of terms into documents easy ƒ Space overhead of pointers

Sorted by docID (more later on why). Brutus

2

4

8

16

Calpurnia

1

2

3

5

Caesar

13

Dictionary

Posting

32 8

64 13

128

21

34

16

Postings lists 40

Information Retrieval 2009-2010

20

Inverted index construction

Documents to be indexed.

Friends, Romans, countrymen. Tokenizer

Token stream.

Friends Romans

Countrymen

Linguistic modules friend

Modified tokens.

roman

countryman

Indexer friend

2

4

roman

1

2

countryman

13

Inverted index. Information Retrieval 2009-2010

16

Indexer steps ƒ Sequence of (Modified token, Document ID) pairs.

Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Term I did enact julius caesar I was killed i' the capitol brutus killed me so let it be with caesar the noble brutus hath told you

Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2

caesar

2

was ambitious

2 2

Information Retrieval 2009-2010

21

Indexer steps (continued) Term Doc # I did enact julius caesar I was killed i' the capitol brutus killed me so let it be with caesar the noble brutus hath told you caesar was ambitious

ƒ Sort by terms. Core indexing step.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2

Information Retrieval 2009-2010

Indexer steps (continued) ƒ Multiple term entries in a single document are merged. ƒ Frequency information is added. Why frequency? Will discuss later.

Term Doc # 2 ambitious be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2

Term Doc # ambitious be brutus brutus capitol caesar caesar did enact hath I i' it julius killed let me noble so the the told you was was with

2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2

Term freq 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1

Information Retrieval 2009-2010

22

Indexer steps (continued) ƒ The result is split into a Dictionary file and a Postings file. Term Doc # ambitious be brutus brutus capitol caesar caesar did enact hath I i' it julius killed let me noble so the the told you was was with

Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2

1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1

Doc #

Term N docs Coll freq ambitious 1 1 be 1 1 brutus 2 2 capitol 1 1 caesar 2 3 did 1 1 enact 1 1 hath 1 1 I 1 2 i' 1 1 it 1 1 julius 1 1 killed 1 2 let 1 1 me 1 1 noble 1 1 so 1 1 the 2 2 told 1 1 you 1 1 was 2 2 with 1 1

Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2

1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1

Information Retrieval 2009-2010

Using the index • How do we process a query? – Later - what kinds of queries can we process?

46 Information Retrieval 2009-2010

23

Query processing: AND • Consider processing the query: Brutus AND Caesar – Locate Brutus in the Dictionary; • Retrieve its postings. – Locate Caesar in the Dictionary; • Retrieve its postings. – “Merge” the two postings:

2

4

8

16

1

2

3

5

32 8

64 13

Brutus 34 Caesar

128 21

47 Information Retrieval 2009-2010

The merge ƒ Walk through the two postings simultaneously, in time linear in the total number of postings entries

2

8

2

4

8

16

1

2

3

5

32 8

64 13

Brutus 34 Caesar

128 21

If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

48

Information Retrieval 2009-2010

24

Boolean queries: Exact match ƒ The Boolean Retrieval model is being able to ask a query that is a Boolean expression: ƒ Boolean Queries are queries using AND, OR and NOT to join query terms ƒ Views each document as a set of words ƒ Is precise: document matches condition or not. ƒ Primary commercial retrieval tool for 3 decades. ƒ Professional searchers (e.g., lawyers) still like Boolean queries: ƒ You know exactly what you’re getting.

49 Information Retrieval 2009-2010

Example: WestLaw

http://www.westlaw.com/

ƒ Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) ƒ Tens of terabytes of data; 700,000 users ƒ Majority of users still use boolean queries ƒ Example query: ƒ What is the statute of limitations in cases involving the federal tort claims act? ƒ LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM ƒ /3 = within 3 words, /S = in same sentence 50 Information Retrieval 2009-2010

25

Exercise ƒ Try the search feature at http://www.rhymezone.com/shakespeare/ ƒ Write down five search features you think it could do better

51 Information Retrieval 2009-2010

What’s ahead in IR? Beyond term search ƒ What about phrases? ƒ Stanford University ƒ Proximity: Find Gates NEAR Microsoft. ƒ Need index to capture position information in docs. More later. ƒ Zones in documents: Find documents with (author = Ullman) AND (text contains automata). ƒ Frequency information ƒ One document as a singleton or group ƒ Content clustering and classification ƒ Concept (vs keyword queries)

52 Information Retrieval 2009-2010

26

Course Content Retrieval Models Retrieval Evaluation Indexing Query operations (relevance feedback, query expansion, clustering, etc) Web search Parallel and distributed (P2P and MapReduce) Social Networks

53 Information Retrieval 2009-2010

27

Loading...

Lecture 1 Introduction Information Retrieval

Lecture 1 Introduction Some material is from: Yannis Tzitzikas (UoC) slides, & Class material of the two textbooks 1 Information Retrieval 2009-2010...

334KB Sizes 4 Downloads 20 Views

Recommend Documents

Introduction to Information Retrieval
May 27, 2008 - would almost certainly want to do this with postscript or PDF files. We do not deal further with these is

Introduction to Information Retrieval | Data Mining Research
Sep 1, 2010 - I will introduce a new book I find very useful: Introduction to Information Retrieval by Christopher D. Ma

Introduction to Information Retrieval - Stanford NLP Group
a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as

Introduction to Information Retrieval - Stanford NLP Group
Aug 1, 2006 - Online edition (c) 2009 Cambridge UP. An. Introduction to. Information. Retrieval. Christopher D. Manning.

Lecture 1: Introduction
Three basic non-cooperative oligopoly models: □. Bertrand oligopoly – firms simultaneously choose prices. □. Courn

Lecture 1: Introduction
Feb 10, 2011 - What were two pollutants discussed? Why? • Emission sources of these two pollutants? • Key events & e

Lecture 1: Introduction to Microcomputers
1/15/2010. 1. Lecture 1: Introduction to Microcomputers. Today's Topics. • What is a microcomputers? Wh do est d micro

Lecture 1-Introduction - IIT Kanpur
PLEASE NOTE THAT THERE IS ALWAYS A GOOD,. POSITIVE AND ALMOST LINEAR CORRELATION ... INTRODUCTION TO MICROMACHINING , V.

Lecture 1: Introduction - CS Rutgers
Target audience: Undergrads, as a first. (possibly only) computer-science course. • Seminar in Computers and Society:

ems lecture 1: introduction - nptel
It is used as a part of Substation Automation System (SAS), Demand Side Management (DSM), Protection, and Distribution M