Computational Lexicography
Review of Lexicography Principles
It is imoprtant to know about the theroy and the funtion of computational lexicography in order to understand how lexical information is related to a larger context.
Linguists use computational lexicography in order to learn more about new vocabulary and their relation within a given context.
Words are located in a text corpus, isolated, regrouped and reintegrated into an immediate context.
The following summary is about how this is exactly done with KWIC (KeyWord In Context) and what the basic notion of concordance means.
Criteria for Good Lexicography
Quantity:
-Completeness of coverage:
--- extensional coverage: number of entries
-intentional coverage: number of lexical information
Quality:
-Correctness of information:
-Types of lexical information
Consistency of structure:
-Macrostructure
-Mesostructure
-Microstructure
Lexicographic workflow cycle:
Data acquisition- -----------→-------Lexicon construction
-Recordings --------------------------------- Metadata
-Text collection --------------------------- Information retrieval
-Concordance ----------------------------- Linguistic analyses
-Dictionaries
--------↑ -------------------------------------------------↓
Lexical evaluation: -----←---------Access to data:
Internal:---------------------------------- - Traditional print media
- consistency ----------------------------- Hyperlexicon: CD, internet
- completeness --------------------------- Software with lexicon component:
External ---------------------------------- word processing
- utility for users ----------------------- speech processing
1.) Lexical Data Acquisition
From Corpus to lexicon
----------------↑ Layer 4: LEXICON WITH GENERALISATION HIERARCHIES
--------------------------------(general type, default inheritance)
LEXICON ----↑ Layer 3: LEXOCON WITH SELECTED GENERALISATION
--------------------------------(procedurally optimised: semasiologically, onomasiologically)
----------------↑ Layer 2: LEXICON MATRIX
--------------------------------(entries x data categories, no generalisation)
----------------↑ Layer 1: CORPUS LEXICON
--------------------------------(wordlist, concordance, HMM, ...)
___________________________________________________________________________
CORPUS -----↑ Layer 2: SECONDARY DATA
-------------------------------(transcription, annotation, metadata)
----------------↑ Layer 1: PRIMARY DATA
------------------------- (audio/ video recording)
From Corpus to Lexicon...
Concordance
A KWIC (Key Word In Context) concordance is a special kind of preliminary, corpus- based dictionary: Each word in a text corpus is paired with its contexts of occurrence in this corpus!
(Google for instance is a very special form of KWIC concordance!)
Example of the process:
Bill Bryson: Notes from a Small Island
"My first sight of England was on a foggy March night in 1793 when I arrived on the midnight ferry from Calais."
ALPHABETICALLY ORDERED KWIC
Keywords with right- hand contexts
Alphabetic order ------Keywords in context
1973 -------------------.- -when i arrived
a -----------------------..--foggy march night
arrived...................................on the midnight
calais
england --------------------was on a
ferry -----------------------from calais
first ------------------------sight of england
foggy --------------.--------march
from -----------------------calais
i ----------------------.-----arrived on the
in --------------------------1973 when i
march ----------------------night in 1973
midnight -------------------ferry from calais
my -------------------------first sight
night -----------------------in 1973 when
of ------------------------ --england was on
on --------------------------a foggy march
on --------------------------the midnight ferry
sight -----------------------of england was
the -------------------------midnight
was ------------------------on a foggy
when----------------------- i arrived on
CONCRDANCING ON THE WEB:
The first:
--------- HyprLex
--------- VerbMobil HyprLex
Some more:
-------- General information on concording
-------- Corpus Linguistics
A KWIC CONCORDANCE ENGINE
KIWIC concordance construction
1. CORPUS CREATION ----→ ------2. TOKENISATION
-----------------------------------↓ -----------------------↓
------------------------------3. KEYWORDLIST -------4. CONTEXT
---------------------------------EXTRACTION -------------COLLATION
---------------------------------------------------------↓↓
.......................6. OUTPUT FORMATING ← 5. KEYWORD SEARCH
SIMPLEST KWIC PROCEDURE
1.)Corpus creation: make a corpus of texts in electronic format
2.)Tokenisation (re-process each text):
-----1.process punctuation marks
-----2.break the text into context units (lines/sentences)
3.)Keyword list extraction (all words in text)
4.)Context collation (for each keyword)
5.)Search for KWIC in corpus
6.)Store output and format (for printing, hypertext [CD, web])
SIMPLE KWIC CONCORDANCE
KWIC: 1. Corpus collation
My first sight of England was on a foggy March night in 1973 when I arrived on the midnight ferry from Calais.
KWIC procedure: 2. Tokenisation
In the text:
My first sight of England was on a foggy March night in 1973 when I arrived on the midnight ferry from Calais.
Process
- upper case (capital) letters
-punctuation marks
To produce:
my first sight of england was on a foggy march night in 1973 when i arrived on the midnight ferry from calais.
KWIC procedure: 3. Keyword List
-Replace each SP (space) sequence by a LF (linefeed) / NL (newline)
-Sort the list alphabetically
-Remove duplicate words
1.
my
first
sight
of
england
was
on
a
foggy
march
night
in
1973
when
i
arrived
on
the
midnight
ferry
from
calais
2.+ 3.
1973
a
arrived
calais
england
ferry
first
foggy
from
i
in
march
midnight
my
night
of
on
sight
the
was
when
KWIC procedure: 4. Contexts
-Pick context unit
- left and right contexts
- m words at beginning
- n words at end
-Add m boundary marks at beginning and n at end
-Split into units of length m + 1 + n
# my first
my first sight
first sight of
sight of england
of england was
england was on
was on a
on a foggy
a foggy
a foggy march
foggy march night
march night in
# my first ------------ night in 1973
my first sight ---------in 1973 when
first sight of ----------1973 when i
sight of england ------ when i arrived
of england was --------i arrived on
england was on -------arrived on the
was on a --------------on the midnight
on a foggy ------------the midnight ferry
a foggy march --------midnight ferry from
foggy march night ----ferry from calais
march night in ------ -from calais#
KWIC procedure: 5. Search
For example:
-on is found in the middle of the following context units:
------ was on a
------ arrived on the
-arrived is found in the middle of the following context units:
------i arrived on
KWIC procedure: 6. Output (which has to be stored under a professional layout)
1973: -----in ---------1973 ---------when
a: ---------on --------a -------------foggy
arrived: ---i ---------arrived ------on
calais: -----from -----calais ---------#
england: --of ---------england ----was
ferry: -----midnight -ferry ---------from
first: ------my -------first ---------sight
foggy: -----a ---------foggy --------march
from: -----ferry ----- from --------calais
i: ---------when ----- i -------------arrived
in: --------night ----- in -----------1973
march: ---foggy ----- march ------night
midnight: -the -------midnight ---ferry
my: -------# --------my -----------first
night: ----march ---- night --------in
of: -------sight ------ of ---------.--england
on: ------.arrived --- on -----------the
on: ------was ------- the ----------midnight
sight: ----first -------sight --------of
the: ------on ---------the -----.---midnight
was: -----england --- was ---------on
when: ---1973 -------when ----- --i
COMPUTING A KWIC CONCORDANCE
From Text Corpus to KWIC Concordance
NOW THE SAME PROCESS IN HTML !
KWIC procedure: 1. Preprocess
Swordlist = ""’
while (<>) {
chomp;
s/e\.g\./EG/ ;
s/M\.A\. /MA/ ;
tr/ [.,;: ""-) ( ] / / ;
tr/ [A-Z] / [a-z] / ;
tr/ \ t/ / ;
s/ */ /g ;
Swordlist = Swordlist . S_ ;
}
NORMALISED TEXT
KWIC procedure: 2. Contexts
Scontextlength = 5 ;
@contextlist = ( ) ;
for (Si = (@ wordlist – Scontextlength) ; Si++) {
print OUTPUT Swordlist [Si] ;
Scontextlist [Si] = Swordlist [Si] ;
for (Sj=1 ; Sj
Scontextlist [Si] = Scontextlist [Si] . " " .
Swordlist [Si + Sj] ;
}
print OUTPUT "\n" ;
}
CONTEXTS
KWIC procedure: 3. Keyword List
@wordlist = split (/ / ,Swordlist) ;
@sortedwordlist = sort { Sa cmp Sb } @wordlist ;
Sprev = " " ;
Scount = 0 ;
@uniquewordlist = ( ) ;
for ( Si=0 ; Si <@sortewordlist; Si++ ) {
Sa = Ssortedwortlist [Si] ;
if ( Sa ne Sprev ) { Sprev = Sa ; Print OUTPUT Sa . "\n" ; Suniquewordlist [Scount] = Sa ; Scount++ ; }
KEYWORDLIST
KWIC procedure: 4. Search
for (Si=0 ; Si<@uniquewordlist; Si++) {
Sa = Suniquewordlist [Si] ;
for (Sj=0 ; Sj<@contextlist ; Sj++) { @context = split ( / / ,Scontextlist [Sj] ) ;
if (Sa eq Scontext [2] ) { Scontext = Scontext [0] . " " . Scontext [1] . " " . Scontext [2] : Scontext [3] . " " . Scontext [4] ; print OUTPUT Scontext ; ---
....}
-}
}
CONCORDANCE
KWIC procedure: 5. Format
1.)Design a page layout with text objects:
1.Title
2.Headings
3.Body Text
4.Tables
2.)Implement- to test the algorithm- in HTML
KWIC procedure: Source
-The Perl implementation follows the procedure exactly
-However, the code is for demonstration purposes only, because it does not allow:
----- flexible handing of contexts and filenames
----- treatment of more than one text
----- modularity of organisation
------ format scalability and search efficiency
Project: re-write the code to do these things
PERL SOURCE CODE
KWIC: Scaling UP
-The Iibido Concordance was made using exactly the same procedure, but:
----- using UNIX (Linux) shell sprinting, not Perl
→ because this is much more flexible
→ the Toolbox system uses the RDF format for output:
----- Multi-Dictionary Formatter (MDF), or
----- Lexique Pro
-Today one could also use XML stylesheets
IBIBIO CONCORDANCE
DICTIONARY MAKING:
KWIC: Dictionary Making
-The function of a KWIC is:
To make searching for lexical information more efficient by putting context information about words in one place
for making "Word Sketches" (Adam Kilgarriff)
- grammatical descriptions: part of speech
- dictionaries: examples of use, collocations, ...
Project: Make concordances from your text corpora and use them to collect lexical information for your Toolbox lexical databases
THE STATUS OF DICTIONARIES
Remember that the dictionary is:
-one of the three main components of language documentation:
----→ corpus of recordings and texts
----→ dictionary
----→ sketch grammar
-the central component of any linguistic descriptions
-the most useful linguistic product for use by the speech community, or non- linguists in general
THE IBIBIO DICTIONARY
-The Ibibio Dictionary
--uses information from Elaine Kaufmann’s Ibibio Dictionary
-the information was re-typed into an Office table format
-this was converted into:
→ Toolbox format for further lexicographic extension
→ LaTeX for formatting (cf. the Ibibio Concordance)
-Project: extend the Ibibio corpus, concordance in scope and content
CONCLUSION!
-It is faster to do this way if you
----- have a large text corpus
----- want to make: - a detailed syntagmatic or morphological description
----------------------------- a large dictionary
-have little time to do this

0 Comments:
Post a Comment
<< Home