Saturday, January 27, 2007

How to Make a Dictionary; Session 12, Tuesday, 2007-01-22


Computational Lexicography

Review of Lexicography Principles

It is imoprtant to know about the theroy and the funtion of computational lexicography in order to understand how lexical information is related to a larger context.

Linguists use computational lexicography in order to learn more about new vocabulary and their relation within a given context.

Words are located in a text corpus, isolated, regrouped and reintegrated into an immediate context.

The following summary is about how this is exactly done with KWIC (KeyWord In Context) and what the basic notion of concordance means.


Criteria for Good Lexicography

Quantity:
-Completeness of coverage:
--- extensional coverage: number of entries
-intentional coverage: number of lexical information

Quality:
-Correctness of information:
-Types of lexical information

Consistency of structure:
-Macrostructure
-Mesostructure
-Microstructure



Lexicographic workflow cycle:

Data acquisition- -----------→-------Lexicon construction
-Recordings --------------------------------- Metadata
-Text collection --------------------------- Information retrieval
-Concordance ----------------------------- Linguistic analyses
-Dictionaries
---------------------------------------------------------
Lexical evaluation: -----←---------Access to data:
Internal:---------------------------------- - Traditional print media
- consistency ----------------------------- Hyperlexicon: CD, internet
- completeness --------------------------- Software with lexicon component:
External ---------------------------------- word processing
- utility for users ----------------------- speech processing




1.) Lexical Data Acquisition

From Corpus to lexicon


----------------Layer 4: LEXICON WITH GENERALISATION HIERARCHIES
--------------------------------(general type, default inheritance)
LEXICON ----Layer 3: LEXOCON WITH SELECTED GENERALISATION
--------------------------------(procedurally optimised: semasiologically, onomasiologically)
----------------Layer 2: LEXICON MATRIX
--------------------------------(entries x data categories, no generalisation)
----------------Layer 1: CORPUS LEXICON
--------------------------------(wordlist, concordance, HMM, ...)
___________________________________________________________________________
CORPUS -----Layer 2: SECONDARY DATA
-------------------------------(transcription, annotation, metadata)
----------------Layer 1: PRIMARY DATA
------------------------- (audio/ video recording)





From Corpus to Lexicon...

Concordance

A KWIC (Key Word In Context) concordance is a special kind of preliminary, corpus- based dictionary: Each word in a text corpus is paired with its contexts of occurrence in this corpus!
(Google for instance is a very special form of KWIC concordance!)

Example of the process:

Bill Bryson: Notes from a Small Island

"My first sight of England was on a foggy March night in 1793 when I arrived on the midnight ferry from Calais."



ALPHABETICALLY ORDERED KWIC

Keywords with right- hand contexts

Alphabetic order ------Keywords in context
1973 -------------------.- -when i arrived
a -----------------------..--foggy march night
arrived...................................on the midnight
calais
england --------------------was on a
ferry -----------------------from calais
first ------------------------sight of england
foggy --------------.--------march
from -----------------------calais
i ----------------------.-----arrived on the
in --------------------------1973 when i
march ----------------------night in 1973
midnight -------------------ferry from calais
my -------------------------first sight
night -----------------------in 1973 when
of ------------------------ --england was on
on --------------------------a foggy march
on --------------------------the midnight ferry
sight -----------------------of england was
the -------------------------midnight
was ------------------------on a foggy
when----------------------- i arrived on




CONCRDANCING ON THE WEB:
The first:
--------- HyprLex
--------- VerbMobil HyprLex
Some more:
-------- General information on concording
-------- Corpus Linguistics



A KWIC CONCORDANCE ENGINE

KIWIC concordance construction





1. CORPUS CREATION ----------2. TOKENISATION
----------------------------------------------------------
------------------------------3. KEYWORDLIST -------4. CONTEXT
---------------------------------EXTRACTION -------------COLLATION
---------------------------------------------------------↓↓
.......................6. OUTPUT FORMATING ← 5. KEYWORD SEARCH



SIMPLEST KWIC PROCEDURE

1.)Corpus creation: make a corpus of texts in electronic format
2.)Tokenisation (re-process each text):
-----1.process punctuation marks
-----2.break the text into context units (lines/sentences)
3.)Keyword list extraction (all words in text)
4.)Context collation (for each keyword)
5.)Search for KWIC in corpus
6.)Store output and format (for printing, hypertext [CD, web])



SIMPLE KWIC CONCORDANCE


KWIC: 1. Corpus collation

My first sight of England was on a foggy March night in 1973 when I arrived on the midnight ferry from Calais.


KWIC procedure: 2. Tokenisation

In the text:
My first sight of England was on a foggy March night in 1973 when I arrived on the midnight ferry from Calais.

Process
- upper case (capital) letters
-punctuation marks

To produce:

my first sight of england was on a foggy march night in 1973 when i arrived on the midnight ferry from calais.


KWIC procedure: 3. Keyword List

-Replace each SP (space) sequence by a LF (linefeed) / NL (newline)
-Sort the list alphabetically
-Remove duplicate words

1.
my
first
sight
of
england
was
on
a
foggy
march
night
in
1973
when
i
arrived
on
the
midnight
ferry
from
calais


2.+ 3.
1973
a
arrived
calais
england
ferry
first
foggy
from
i
in
march
midnight
my
night
of
on
sight
the
was
when



KWIC procedure: 4. Contexts

-Pick context unit
- left and right contexts
- m words at beginning
- n words at end
-Add m boundary marks at beginning and n at end
-Split into units of length m + 1 + n

# my first
my first sight
first sight of
sight of england
of england was
england was on
was on a
on a foggy
a foggy
a foggy march
foggy march night
march night in


# my first ------------ night in 1973
my first sight ---------in 1973 when
first sight of ----------1973 when i
sight of england ------ when i arrived
of england was --------i arrived on
england was on -------arrived on the
was on a --------------on the midnight
on a foggy ------------the midnight ferry
a foggy march --------midnight ferry from
foggy march night ----ferry from calais
march night in ------ -from calais#



KWIC procedure: 5. Search

For example:

-on is found in the middle of the following context units:
------ was on a
------ arrived on the

-arrived is found in the middle of the following context units:
------i arrived on




KWIC procedure: 6. Output (which has to be stored under a professional layout)

1973: -----in ---------1973 ---------when
a: ---------on --------a -------------foggy
arrived: ---i ---------arrived ------on
calais: -----from -----calais ---------#
england: --of ---------england ----was
ferry: -----midnight -ferry ---------from
first: ------my -------first ---------sight
foggy: -----a ---------foggy --------march
from: -----ferry ----- from --------calais
i: ---------when ----- i -------------arrived
in: --------night ----- in -----------1973
march: ---foggy ----- march ------night
midnight: -the -------midnight ---ferry
my: -------# --------my -----------first
night: ----march ---- night --------in
of: -------sight ------ of ---------.--england
on: ------.arrived --- on -----------the
on: ------was ------- the ----------midnight
sight: ----first -------sight --------of
the: ------on ---------the -----.---midnight
was: -----england --- was ---------on
when: ---1973 -------when ----- --i





COMPUTING A KWIC CONCORDANCE

From Text Corpus to KWIC Concordance

NOW THE SAME PROCESS IN HTML !


KWIC procedure: 1. Preprocess

Swordlist = ""’
while (<>) {
chomp;
s/e\.g\./EG/ ;
s/M\.A\. /MA/ ;
tr/ [.,;: ""-) ( ] / / ;
tr/ [A-Z] / [a-z] / ;
tr/ \ t/ / ;
s/ */ /g ;
Swordlist = Swordlist . S_ ;
}

NORMALISED TEXT



KWIC procedure: 2. Contexts

Scontextlength = 5 ;
@contextlist = ( ) ;
for (Si = (@ wordlist – Scontextlength) ; Si++) {
print OUTPUT Swordlist [Si] ;
Scontextlist [Si] = Swordlist [Si] ;
for (Sj=1 ; Sjprint OUTPUT " " . Swordlist [Si + Sj] ;
Scontextlist [Si] = Scontextlist [Si] . " " .
Swordlist [Si + Sj] ;
}
print OUTPUT "\n" ;
}

CONTEXTS



KWIC procedure: 3. Keyword List

@wordlist = split (/ / ,Swordlist) ;
@sortedwordlist = sort { Sa cmp Sb } @wordlist ;
Sprev = " " ;
Scount = 0 ;
@uniquewordlist = ( ) ;
for ( Si=0 ; Si <@sortewordlist; Si++ ) {
Sa = Ssortedwortlist [Si] ;
if ( Sa ne Sprev ) { Sprev = Sa ; Print OUTPUT Sa . "\n" ; Suniquewordlist [Scount] = Sa ; Scount++ ; }

KEYWORDLIST




KWIC procedure: 4. Search

for (Si=0 ; Si<@uniquewordlist; Si++) {
Sa = Suniquewordlist [Si] ;
for (Sj=0 ; Sj<@contextlist ; Sj++) { @context = split ( / / ,Scontextlist [Sj] ) ;
if (Sa eq Scontext [2] ) { Scontext = Scontext [0] . " " . Scontext [1] . " " . Scontext [2] : Scontext [3] . " " . Scontext [4] ; print OUTPUT Scontext ; ---
....
}
-}
}

CONCORDANCE



KWIC procedure: 5. Format

1.)Design a page layout with text objects:
1.Title
2.Headings
3.Body Text
4.Tables
2.)Implement- to test the algorithm- in HTML



KWIC procedure: Source

-The Perl implementation follows the procedure exactly
-However, the code is for demonstration purposes only, because it does not allow:
----- flexible handing of contexts and filenames
----- treatment of more than one text
----- modularity of organisation
------ format scalability and search efficiency
Project: re-write the code to do these things

PERL SOURCE CODE



KWIC: Scaling UP

-The Iibido Concordance was made using exactly the same procedure, but:
----- using UNIX (Linux) shell sprinting, not Perl
→ because this is much more flexible
→ the Toolbox system uses the RDF format for output:
----- Multi-Dictionary Formatter (MDF), or
----- Lexique Pro
-Today one could also use XML stylesheets


IBIBIO CONCORDANCE




DICTIONARY MAKING:
Why KWIC is used....

KWIC: Dictionary Making

-The function of a KWIC is:
To make searching for lexical information more efficient by putting context information about words in one place
for making "Word Sketches" (Adam Kilgarriff)
- grammatical descriptions: part of speech
- dictionaries: examples of use, collocations, ...
Project: Make concordances from your text corpora and use them to collect lexical information for your Toolbox lexical databases



THE STATUS OF DICTIONARIES

Remember that the dictionary is:
-one of the three main components of language documentation:
----→ corpus of recordings and texts
----→ dictionary
----→ sketch grammar
-the central component of any linguistic descriptions
-the most useful linguistic product for use by the speech community, or non- linguists in general



THE IBIBIO DICTIONARY

-The Ibibio Dictionary
--uses information from Elaine Kaufmann’s Ibibio Dictionary
-the information was re-typed into an Office table format
-this was converted into:
→ Toolbox format for further lexicographic extension
→ LaTeX for formatting (cf. the Ibibio Concordance)
-Project: extend the Ibibio corpus, concordance in scope and content




CONCLUSION!

-It is faster to do this way if you
----- have a large text corpus
----- want to make: - a detailed syntagmatic or morphological description
----------------------------- a large dictionary
-have little time to do this

0 Comments:

Post a Comment

<< Home