Thursday, January 04, 2007

How to Make a Dictionary, Session 6, Tuesday 2006-11-21


Lexicon data and their structure


In the last lessons we learned about the different lexical database structures:
-Microstructure (Single lexical articles/ entries)
-Mesostructure (Interrelation of lexicon entries and relation to external information)
-Macrostructure (The order of all lexical entries)
-Megastrucrture (The whole dictionary/lexicon and its metadata)

They can be defined as different structuring elements within a dictionary:

MICROSTRUCTURE: is the smallest part of a dictionary and copes with lexical entries.
The lexicon microstructure contains information on lexical entries (single words) such as grammatical information (syntax, part of speech (POS), inflectional class, valence of verbs etc.), phonetic information on spelling and pronunciation and both information on representation of meaning (semantics, definitions) and corpus references (usage examples and words in context).
The information on syntax, grammar and meaning (definition in terms of semantics and pragmatics) and the corpus reference (usage examples) can be defined as DatCats.

MESOSTRUCTURE: copes most of all with cross-references and relations between lexical entries and their definitions. Single entries may not only be linked to the rest of the dictionary (interrelation of lexicon entries) but also to external information (external references).
MACROSTRUCTURE: contains the organisation and order of the content and body of a dictionary e.g. the list of lexical entries.

MEGASTRUCTURE: is the overall structure of the dictionary. It contains the metadata and can be defined as complete document.



*Detour: information on the CORPUS

The term CORPUS can as a collection of language material. It can consist of written texts one can find in newspapers or books for instance or of oral speech transcriptions which are usually written in IPA.
Moreover, it may contain additional information on the part of speech of the definiens, on its lemma (which is a grammaticalized form/ variation of a word, f.ex: -ed suffix to express the past) and phonetic transcriptions or other kinds of annotations.


Problematic issues in lexicography

A lexicographer meets various problems while working with lexical databases:

1.) The problem of the ambiguity of terms:
-some words have got synonyms, two different word forms that have the same meaning
-there is also polysemy, meaning that one word form has got two one more different meanings

2.) The problem of finding and to searching for words:
-how to find related words in a language with inflectional prefixes?
-how to cope with orthographic ambiguity?
-how to structure picture lexicons?

3.) The problem of constant language change
-how to integrate "new" words and/ or new word meanings?

How the lexicographer solves some of these problems:
- The problem of ambiguity and polysemy can be solved by enumeration and linkage within
the entire dictionary
-In order to cope with constant language change, the dictionary has to be reprinted
Regularly. One needs new editions.



Different methods of creating lexicons

There are three main methods to the creation of dictionaries and lexicons:
-Introspection based lexicon creation
-Questionnaire based lexicon creation
-Corpus based lexicon creation

1.) Introspection based lexicon creation
-a trained linguist takes a look inside the language and reflects his own language use
-he considers the fact that language acts as a social filter. Utterances have to be relevant, important and adequate

2.) Questionnaire based lexicon creation
Questionnaires are mainly used in comparative linguistics.
They are very useful in order to explore unknown languages, meaning languages the linguist does not speak/ know himself. In order to explore the unknown language in detail, the linguist asks different native speakers to explain certain language features, phenomenon and to complete questionnaires.
They are asked for translations and explanations. In this context the fact that language acts as a social filter should be considered. The interviewed native speakers should come from various social classes within the society.
The use of a questionnaire is intended to do research in morphology, to provide translations and use and create language software programs/ computer systems that can be used by other experts, linguists and translators.

___________________________________________________________________________
*DetourDetailed information on the use of questionnaires (from: http://www.spectrum.uni-
bielefeld.de/~ttrippel/htmd/questionnaire_short.html):

An example Questionnaire used for the exploration of Australian Languages:

Questionnaire on Motion in Australian Languages (modified)
David Wilkins, David Nash and Jane Simpson (used with permission)
April 1998


Introduction
The purpose of this questionnaire is to gain a first comparative picture of the lexical resources Australian languages draw on for the expression of motion, and the manner in which motion descriptions are "packaged". In the nature of our design, and our discussion, we rely heavily on Talmy's (1985) notion of lexicalization patterns, in particular his cross-linguistic discussion of systems of motion description. We are interested, for instance, in patterns of semantic conflation (that is, what other semantic information besides 'motion' may be encoded in a verb root) and patterns of semantic distribution (that is, what types of information are encoded in the different morphemes that come together to build a description of a motion event).
We will assume a "pretheoretical" understanding of what constitutes a motion event and a motion description. In this questionnaire, the primary focus is on "translocational motion" (i.e.change of location of an entity along a path from one place to another). We further restrict our focus to motion descriptions in which the Subject argument of a verb (in an active clause) is the entity ('figure') in motion (an accompanying entity may also be in motion, but that is not our focus of interest). In narrowing our focus in this way, we depart from Talmy's own manner of investigation, since he was also interested in patterns of location, causative location and causative motion.

The questionnaire
This questionnaire is designed in a "modular fashion". There are four independent modules, and we would be glad to receive answers to any of the "modules". A researcher should not feel that they need to answer the whole questionnaire if that seems too daunting. Where you do not know the answer to a question, please say so (rather than leaving a part of a module blank). The ordering of modules reflects our own sense of which types of information are more important to enable us to do some cross-language comparison.
Name of Researcher:
Name of Language:
Primary Place of Research:
Primary Data Resources:
May we distribute your filled in questionnaire?: YES NO
How many inflecting, unanalysable, mono-morphemic verb roots does the language possess: (tick one of the following)
LESS THAN 50_________ 50 to 200 __________MORE THAN 200________
Can you give us a more precise figure? (If so, what source(s) is the figure based on?):

MODULE I : Motion Verbs and Patterns of Motion Expression
Below we present 26 English motion verbs or descriptions. We would like you to provide any (and all) expressional equivalents for the language under discussion. We are not only interested in mono-morphemic verb roots, we are also interested in more complex expressions. For instance, in Arrernte, there is no monomorphemic root for 'to fly'. However, Arrernte speakers do commonly talk about the motion of birds, airplanes and insects by combining a general motion verb and the locative phrase alkere-le (sky-LOC) 'in the sky' in the same clause - e.g. alkere-le alhe-me ('in sky going') = 'flying'; alkere-le unthe-me ('in sky wandering') = 'flying around'; alkere-le apetye-me ('in sky coming') = 'flying this way', and so on.
(N.B. While it would be nice to know translation equivalents, it is more important for us to know what expressions people actually use, no matter how infrequently.)
We do not assume that the following will provide a one-to-one list of equivalents. In some cases the same verb or expression may cover several notions we have distinguished on the list, and in other cases the distinctions won't be fine-grained enough and you'll need to provide several equivalents, detailing the distinctions. We simply ask you to give us as much detail as is feasible.
Please include the following information in any response:
the transitivity of the verb in the expression (in relation to the meaning expressed)
a morphemic break down and gloss of each morpheme in all complex expressions
where relevant, an indication of any animacy or category constraints which apply to the moving entity in the expression (e.g. does the moving entity have to be a liquid?)

The List
a. "to go" b. "to come" c. "to return" ("to go back") d. "to take to" ("take along"; "carry") e. "to bring" f. "to move" (from one place to another e.g. they shifted into the shade; they moved camp) g. "to leave behind" ("to abandon"; "to leave something somewhere and go off") h. "to move" ( with no overall change of location; move on the spot or about a fixed point e.g the bush is moving, his eyes/hair moved) i. "to move quickly" ("hurry away"; "hurry off") j. "to walk" k. "to run" l. "to crawl (of baby)" m. "to fly (of bird)" n. "to hover" ("to flutter" - e.g of hawk; butterfly) o. "to swim" (of fish? of person?) p. "to roll" (e.g. of ball or boulder or tumbleweed) q. "to creep up on" ("to sneak along"; "sneak up on") r. "to follow someone/something" s. "to track someone/something" t. "ascend" ("get up on to"; "to climb up") u. "to descend" ("get down off/out of") v. "to fall" (down from a height) (does this contrast with "to fall over"?; "collapse"?) w. "emerge" ("exit"; "appear"; "come out"; "rise (of sun)") x. "to enter " ("to go into" (e.g. a house, a camp)) y. "to cross over" ("go across") z. "to pass by"

MODULE II : Motion-Rich 'Textlet' or Text Fragment
So that one can get a feel about how motion description really works in the language, could you please provide a piece of natural continuous text which is rich in motion expression, and which you feel is representative. All that is needed is a small text or text fragment of between 5 and 20 clauses in length, in which the focus is the motion of one or more of the "protagonists". Of course, we need you to provide morphemic breaks, interlinear glosses, and a free translation. It would also be useful if you could provide notes, as you go along, to any specific motion related features that the 'outsider' should attend to. (An example will be provided. - Note that, we'd prefer it if you did not rely on a translation from English, but instead used a small text that was generated directly from the mind and mouth (or pen) of a native speaker.)

MODULE III: Grammatical Marking of Ground and Path
In Talmy's (1985:61) terms the basic components of a motion event are:
Figure= the entity that is in motion Ground= the entity or entities that the Figure is moving in relation to Path= the course followed (and trajectory) of the Figure (often deduced from the Ground which is specified) Motion= the actual predication of a motion act.
So, in the sentence 'the baby crawled up the hill', the Figure is 'the baby', the Ground is 'the hill', the Path is specified with 'up', and the assertion of Motion is encoded in the verb 'crawl' .
This module of the questionnaire is particularly concerned with the way in which Grounds and Paths (including direction) may be grammatically coded. We would appreciate it if you used some of the expressions from the list in Module I of this questionnaire in glossed example sentences to illustrate the types of marking asked about below.

A. Marking of grounds
a) How are "goals" of motion marked? (i.e. what cases, adpositions, or other means are used to mark ground NPs functioning as "goals of motion?) (e.g. The child crawled to(wards) the tree.; They returned to camp; The lizard got up onto the rock.;)
b) Can one make a distinction between 'to X' and 'towards X'? For all motion verbs? How? (e.g. The leaf fell towards the ground. vs. The leaf fell to the ground.)
c) How are "sources" of motion marked? (e.g. The woman moved away from the fire. ; They travelledfrom Sydney.; The baby bird fell out of the tree.; The dog fell off of the truck.)
d) How are ground NPs which refer to the route or path along/on which motion takes place marked? (e.g. He's walkingalong the track.; The horse wandered along the sides of the fence.)
e) How are ground NPs which refer to the medium in which motion takes place marked? (e.g. The bird is flying through the air.; The children are running through the sand?)
f) How does one mark a ground NP which refers to a place through (or via) which the figure travels in order to get to another place? (e.g. They travelled from Alice Springs to Elliott via Tennant Creek ; She came through here on her way to church.)
g) With expressions like "enter" (or "go into") and "exit" (or "come out of"), how are the ground NPs which refer to the space "entered" and "exited" marked? (e.g. The snake enteredits burrow.; The owl came out from the hollow of the tree.)
h) With expressions of "crossing" and "passing" how are grounds indicating the entity 'crossed' and 'passed' marked? (e.g. Those people ran past our house; A dingo crossedthe road.)
i) Languages like English can string several Grounds together with one motion verb (e.g. The dog carried the meat from the creek along the path to the tree.). Other languages have strong restrictions, preferring one Ground per motion verb. Do you have a sense of how many grounds can occur naturally with a motion verb? Is it possible (natural) to say things like:
- He went from the tree to the rock.
- He went into the house through the rear door.
- He came along the road towards our car.
-The dog carried the meat from the creek along the path to the tree.
j) If you use adpositions or case endings to express these ideas, can they occur independently as the main predicate in a sentence as in? (If they are possible, what do they mean? Can they have motion readings or only static spatial readings?)
- The dog (is) from the tree
- The dog (is) to the tree
- The dog (is) along the road
- The dog (is) into the house
- The rabbit (is) out of its burrow

B. Path Direction
Are there any form of directionals (i.e. grammaticised directional elements like Warlpiri -rni 'hither, to here',-rra 'thither, to there', -mpa 'past, by, across)? If so, what part of speech class do they attach to, or co-occur with? If they combine with verbs, are they restricted to motion verbs or can they, for instance, occur with perception verbs or speech act verbs (or all verbs)?
Does the language have anything akin to the 'associated motion' category discussed by Koch (1984); Tunbridge (1988); and Wilkins (1989, 1991)? If a language has anything like this, it is usually some form of verb affix, verb compounding or fixed construction, and the most commonly coded notions tend to be 'do verb action while going along' ('she cried all along the way') or 'go/come and do verb action' ('she came and told me'; 'she went and hit him'). Please describe any phenomena that seem to be relevant. (In a language like Adnyamathanha (Tunbridge 1988), where this category is very elaborate, you find the following verb affixes: -mana- 'come and V', -namana- 'quickly come and V', -vara- 'go and V', -navara- 'quickly go and V', -ndhena- 'V once while coming'; -nali- 'V continuously while coming';-ndheli- 'V once while going', -nangga- 'V all the way along', -enhi- 'V while keeping moving'; and -wandha- 'V and leave'. In origin such suffixes (or compounding elements) are very often general motion verbs)
MODULE IV: What Element of the Clause Encodes Path?:
The verb-framed vs. satellite-framed typology
Talmy (1985) observed that, in motion descriptions, a language like English differs typologically from a language like Spanish, by virtue of the fact that Spanish tends to conflate 'motion' and 'path' together in the verb root, while English tends to code path in a separate (adverbial/prepositional) element which functions as a satellite to the verb. He judges patterns of expression to be characteristic for a language if they are (i) colloquial in style (rather than formal or stilted), (ii) frequent; and (iii) pervasive (rather than limited) in application. Thus, in English, the characteristic mode of expression is to say "go up", "go down", "go in", "go out" and so on, while it is less characteristic to to say "ascend", "descend", "enter", "exit", and so on. The former pattern exemplifies "satellite-framing" (i.e. 'go' provides the motion concept, while 'up', 'down', 'in', 'out' realizes the path). For languages like Spanish, verbs like "enter" and "ascend" are the characteristic mode of expression, and the verb roots can been seen to simultaneously code "motion" and "path" (i.e. "verb-framing"). (Note: Satellites to the verb-root may be affixes on the motion verb root; or clitics; or path adverbs; or particles; or preverbs)
Please try to assess whether the language you are working on is verb-framed or satellite framed (or somewhere in between or something else), by answering the following 'diagnostic' questions:
Are verb roots meaning 'enter', 'exit', 'descend', 'climb up' a more characteristic form of expression, in Talmy's terms, than more analytic counterparts such as "go into", "go out of", "go down", "go up"?
How common is it for verbs in the language to conflate both 'motion' and 'manner' (that is, are there a rich class of verb roots like 'run', 'swim', 'slither', 'hop', 'limp', 'crawl', 'stroll', etc.)? According to Talmy, if a language characteristically conflates 'motion' with 'manner' in verb roots, it is NOT common for the same language to also characteristically conflate 'motion' with 'path'.
When both manner and path notions appear in a motion description, how does information get distributed among elements? To answer this question we list sentences below which try to elicit some of the relevant distinctions. Again, don't go for word-for-word translations. Give us what you think would be the normal ("characteristic) way of expressing the idea (or something close to it). And, please include the following information: - the transitivity of the verb in question in relation to the meaning expressed (including the expected case on the subject of the sentence) - an interlinear morpheme-by-morpheme gloss
i) The child ran to the other side of the street/path/creek.
ii) The child ran across the street.
iii) The baby crawled into the house/shed/camp. (Where the "into" path is to be stressed, is the form of expression done more like: "crawlingly enter" or "crawl into" or "crawl to the inside of"?)
iv) The baby crawled up the rock (Can one distinguish "crawl to the top of the rock" and "ascend the rock by crawling"?).
v) The snake slithered into the string bag.
vi) The boy fell to the ground. (while standing on the ground? vs from out of a tree?)
vii) The rock/boy fell down into the water. (where entry into the water is stressed)
viii) The girl climbed up onto the branch of the tree.
Can one "accumulate" path notions with just one verb? In English, one is not only able to string a number of different Grounds together, one can also accumulate a string of simple Path-satellites. As an example, Slobin (1996:83) notes that it is quite normal for English speakers to say things like "The bird flew down from out of the hole in the tree" (where down-from-out-of specifies the trajectory). In this English sentence, there is only one specified ground ('the hole in the tree'), but a complex of three units of Path information ('down', 'from', and 'out of'). The closest Spanish approximation would be "El pájaro salió del aguejaro del árbol volando hacia abajo" which translates literally as 'The bird exited of the hole of the tree flying towards below'. Thus, in contrast to English, Spanish, like other verb-framed languages, tends to render complex Path information through multiple clauses, since they do not allow for the accumulation of path expressions. So, what about the language under investigation?

OTHER INFORMATION
Please provide any other information on the language that you feel is relevant to this research endeavour. In particular, if there are publications or sections of publications concerning the language which deal directly with motion description, we would be grateful if you brought this to our attention (and we will collate and share all such references).
THANKS FOR ALL YOUR HELP

References cited in questionnaire:
Koch, Harold. 1984.
'The Category of "Associated Motion" in Kaytej', Language in Central Australia, 1 23-34
Slobin, Dan. 1996.
'From "thought and language" to "thinking for speaking"' in Gumperz and Levinson eds. Rethinking Linguistic Relativity. CUP. 70-96
Talmy, Leonard. 1985.
'Lexicalization patterns: semantic structure in lexical forms'. in Shopen ed. Language Typology and Syntactic Description III: Grammatical categories and the lexicon. CUP. 57-149
Tunbridge, Dorothy. 1988.
'Affixes of Motion and Direction in Adnyamathanha' in Austin ed. Complex Sentence Constructions in Australian Languages. John Benjamins. 267-283
Wilkins, David P. 1989..
Mparntwe Arrernte (Aranda): Studies in the structure and semantics of grammar. Unpublished PhD dissertation. A.N.U.
Wilkins, David P. 1991.
'The Semantics, Pragmatics and Diachronic Development of "Associated Motion" in Mparntwe Arrernte'. Buffalo Papers in Linguistics, 207-257.
___________________________________________________________________________

3.) Corpus based lexicon creation
The Corpus based lexicon creation mainly deals with the form of words and their found.
It is based on various corpora, mainly on words in context (texts) which show very well the concordance of words, but also on wordlists and distribution analysis.


Hierarchy of lexicon and corpus types
The Complexity of Lexicography

LEXICON

4. ORDER LEXICON (abstract lexicon):
maximally declarative generalisation network

3. ORDER LEXICON (optimised lexicon):
procedurally optimised local generalisations

2. ORDER LEXICON (protolexicon):
flat tabular lexicon

1. ORDER LEXICON (corpus lexicon):
wordlist, concordance, HMM


CORPUS
Tertiary corpus:
classificatory markup annotation

Secondary corpus:
transcription, symbol-signal labelling annotation

Primary corpus:
recorded audio-visual corpus; manuscript


The corpus based lexicon creation application:
The Summer Institute of Linguistics is famous for its fieldwork tools and for the creation of language databases. When high tech computer programs were not available, these databases were called "shoebox", because fieldwork data have been collected on cards, arranged in ordinary boxes. Later, data like base texts, morphology, phonology, syntax, grammar, part of speech, time and aspect, valence of verbs and translations have been collected in online databases. These lexicon databases consist of lists and tables.

0 Comments:

Post a Comment

<< Home