Kikuyu.com    KIKUYU.COM    Forums  Hop To Forum Categories  General Discussions  Hop To Forums  Main Board    New Gikuyu Chat box
Go
New
Find
Notify
Tools
Reply
  
New Gikuyu Chat box
 Login/Join
 
Guy
Silver Member
Posted
Hi there,

I am a researcher at the University of Antwerp (Belgium). I am a computational linguist, which means I develop computer programs that help the computer deal with human language.

I have been to Nairobi a couple of times and I really want to work with Kenyan local languages and Gikuyu is a logical first choice.

To get to know colloquial language, I've set up a free real-time chatbox for Gikuyu speaking people:
http://pcger33.cde.ua.ac.be/gikuyuchat

I would like as many people as possible to go on this chat site and converse in Gikuyu. Don't worry: I don't speak a word of the language, so I'm not "listening in". So please, tell as many people as you can about this chat box.

Later, I will add accent restoration, which can place the ~ on the i's and u's where needed. You can find an example of this system here:
http://pcger33.cde.ua.ac.be/gikuyu

I will also add chat boxes for other local languages as well, if the Gikuyu one is succesful.

Note: this is not supposed to be a competitor to these wonderful forums. Real-time chat is a different medium altogether.
 
Posts: 6 | Location: Antwerp, Belgium | Registered: 05 April 2006Reply With QuoteReport This Post
"PanAfricanist.Com Member"
Silver Member
Posted Hide Post
Hey Guy,

I doubt you are a researcher. What are you working on? If you want people to come to your site then develop something better. We have nothing but honest people here and we don't need your shadiness.

Regards


PanAfricanist.Com Member
 
Posts: 64 | Location: Nairobi/Boston/New York | Registered: 13 April 2005Reply With QuoteReport This Post
Guy
Silver Member
Posted Hide Post
quote:
Originally posted by Boss:
Hey Guy,

I doubt you are a researcher. What are you working on? If you want people to come to your site then develop something better. We have nothing but honest people here and we don't need your shadiness.

Regards


My shadiness? I am not sure why I deserve that particular qualification. You can find out all about my research on my academic homepage:
http://pcger33.cde.ua.ac.be/guy

I mainly do research on syntax and morphology of Dutch and English. I was a visiting professor at the Computer Science department at the University of Nairobi in May, September 2005 and March 2006. These visits have made me interested in working with Bantu languages.

Since then, I have developed a Kiswahili part-of-speech tagger:
http://pcger33.cde.ua.ac.be/swahiliTagger/

and then there's the accent restoration for Gikuyu, which turns Gikuyu text without accents into accented text:
http://pcger33.cde.ua.ac.be/gikuyu/

Let me know if you want the scientific publications that describe this work.

I am sorry my chat box does not appeal to you, but could you please let me know what is wrong with it? What would you like to see improved?

Furthermore, I am saddened that my research goals are classified as 'shadiness'.
 
Posts: 6 | Location: Antwerp, Belgium | Registered: 05 April 2006Reply With QuoteReport This Post
"PanAfricanist.Com Member"
Silver Member
Posted Hide Post
Sorry Guy - I misunderstood and was too quick to judge. All the best!


PanAfricanist.Com Member
 
Posts: 64 | Location: Nairobi/Boston/New York | Registered: 13 April 2005Reply With QuoteReport This Post
"Ithe wa Nyambura na Wambui"
Platinum Member
Picture of sajini
Posted Hide Post
Hi Guy,
This a great idea. I tested some of the stuff in both the Gîkûyû accent marker and the swahili parts of speech tagger.

This was my Swahili sentence
Mwaka mmoja baadaye watoto walikuwa wamekaa chini ya mnazi

This was the way it was tagged
Mwaka/N mmoja/NUM baadaye,//N watoto/N walikuwa/V wamekaa/V chini/N ya/GEN-CON mnazi/N

I was surprised that 'baadaye' was tagged as a Noun, and instead of 'chini ya' being taken as a single construction, it was broken down which will alters the semantics of teh sentence. I am sure you are aware of the problems with the 'parts of speech' theory. Take the number modifiers for instance mmoja in the sentence above, while moja is a number, we know that numbers modify nouns, which give them an adjectival attribute, how will this be captured?

Coming to Gîkûyû, I decided to test the name 'mûmumunyano, a deverbal the kissing, or the sucking. I was trying to test whether your program would discriminate between the 'û' and 'u' sequence. instead, I was given mûmûmûnyano which is gibberish. I must accept that I have not sujected this to more data, but will be doing so soon enough.

Gîkûyû is a very tricky language because wrong accent marks might completely distort the meaning.

I will regard these as teething problems, but must warn you that you have chosen a very difficult path, and you have a long way to go. All the same, this is cool, and best of luck.


Emotions are the greatest enemy of rational arguments
 
Posts: 3133 | Location: Neither here nor there | Registered: 03 May 2005Reply With QuoteReport This Post
Guy
Silver Member
Posted Hide Post
quote:

This was my Swahili sentence
Mwaka mmoja baadaye watoto walikuwa wamekaa chini ya mnazi

This was the way it was tagged
Mwaka/N mmoja/NUM baadaye,//N watoto/N walikuwa/V wamekaa/V chini/N ya/GEN-CON mnazi/N



Try it with the comma separated from 'baadaye'. It then gives you the tag 'adv'. Don't know if this makes more sense. I don't speak Kiswahili either. Our tools are language independent: given an annotated corpus, we just retrain our software. For this tagger, we used the Helsinki corpus of Swahili. In any case, this is actually a previous version of the tagger. We're doing much better now, but did not have time to redo the demo.

quote:

I was surprised that 'baadaye' was tagged as a Noun, and instead of 'chini ya' being taken as a single construction, it was broken down which will alters the semantics of teh sentence.


If you leave a space between 'chini' and 'ya' they will be treated as separate words

quote:

I am sure you are aware of the problems with the 'parts of speech' theory. Take the number modifiers for instance mmoja in the sentence above, while moja is a number, we know that numbers modify nouns, which give them an adjectival attribute, how will this be captured?


Numbers can also be used in a non-adjectival sense. We solve this problem by just assigning them the tag numeral. Their function within the NP is something that is handled on a higher level, ie syntax.

quote:

Coming to Gîkûyû, I decided to test the name 'mûmumunyano, a deverbal the kissing, or the sucking. I was trying to test whether your program would discriminate between the 'û' and 'u' sequence. instead, I was given mûmûmûnyano which is gibberish. I must accept that I have not sujected this to more data, but will be doing so soon enough.


Yes, the accent restoration is not perfect. Our current tests show a 90% accuracy, meaning that approximately 1 out of 10 words are marked incorrectly. That is a lot, but we use the accent restoration to speed up corpus development. A human who has to correct only 1 word out of 10 will process much more data than someone who has to do it from scratch.

quote:

Gîkûyû is a very tricky language because wrong accent marks might completely distort the meaning.

I will regard these as teething problems, but must warn you that you have chosen a very difficult path, and you have a long way to go. All the same, this is cool, and best of luck.


There's no such thing as an easy problem in computational linguistics. There really is no problem in natural language that we can solve with 100% accuracy with computers. All languages are tricky in their own way. But I really like the challenges I am facing with Bantu languages and Gikuyu in particular. We're just scratching the surface and there's a long road ahead, but it's going to be an interesting one.

Anyway, thank you very much for trying out the demos and keep the comments coming!
 
Posts: 6 | Location: Antwerp, Belgium | Registered: 05 April 2006Reply With QuoteReport This Post
"Ithe wa Nyambura na Wambui"
Platinum Member
Picture of sajini
Posted Hide Post
I have tried some paradigmatic tests on only one type of costructions. I will give the degree of accuracy to be around 75%. I don't know what others might say. Still, this is a good resource. Is the program available for download?

Input: ciana ciathiite gutua ndare
Output: ciana ciathiĩte gũtua ndare

Input: ikira maguta tawaini
Output: ĩkĩra magũta tawainĩ

Input: njikirira mai ma mbura
Output: njikĩrĩra maĩ ma mbura

Input: maitu athiire guthia mutu
Output: maitũ athiire gũthia mũtũ

Input: kaana karia gekirite githii
Output: kaana karĩa gekĩrĩte gĩthiĩ

Input: Njoroge aikariire giti kia baba
Output: Njoroge aikarĩire gĩtĩ kĩa baba

Input: githuurano kia mwaka turorete
Output: gĩthũũrano kĩa mwaka tũrorete

Input: muciinga wa muthigari
Output: mũciinga wa mũthigarĩ

Input: mwiri wa mutigairi
Output: mwĩrĩ wa mũtigairĩ

Input: ruhiu rwakwa rwina gutu
Output: rũhiũ rwakwa rwĩna gũtũ

Input: ndwara cia maguru
Output: ndwara cia magũrũ

Input: gicugirira kia iria
Output: gĩcũgirĩra kĩa iria

Input: mucemanio wa iguuta
Output: mũcemanio wa iguũta

Input: ituura ria Nairobi
Output: itũũra rĩa Nairobi

Input: mbia cia kuiya
Output: mbĩa cia kũiya

Input: Guuriai gikunjo ibuku ria mathayo murango wa mirongo iiri na inya, kamuhari ga ikumi na igiri. Namba cia maguruini ni ngiri imwe na magana meeri ma mirongo mugwanja na ithano

Output: gũũrĩai gĩkunjo ibuku rĩa mathayo mũrango wa mĩrongo ĩĩrĩ na inya, kamũharĩ ga ikũmi na ĩgĩri. Namba cia magũrũinĩ nĩ ngĩrĩ imwe na magana meeri ma mĩrongo mũgwanja na ithano


Emotions are the greatest enemy of rational arguments
 
Posts: 3133 | Location: Neither here nor there | Registered: 03 May 2005Reply With QuoteReport This Post
Guy
Silver Member
Posted Hide Post
quote:
Originally posted by sajini:
I have tried some paradigmatic tests on only one type of costructions. I will give the degree of accuracy to be around 75%. I don't know what others might say. Still, this is a good resource. Is the program available for download?


75% sounds about right for this version of the software behind the demo. In the mean time, we have done extensive optimization and we're doing much better now. Most of our time is spent developing the software and writing the publications. We haven't had time to update the demo yet. As you can imagine, there's not that many people trying it out, so it's not a priority.

You can download a publication on this topic here:
http://pcger33.cde.ua.ac.be/guy/lrec06.pdf

We will update the demo soon though and then you should see a significant improvement. The software is not yet downloadable, because there's actually a pretty big machine learning architecture behind it.

Again: thanks for taking our demo for a test drive. Much appreciated!
 
Posts: 6 | Location: Antwerp, Belgium | Registered: 05 April 2006Reply With QuoteReport This Post
Silver Member
Posted Hide Post
I have tried the site, I can say exciting. For those of us who have to use Gikuyu constantly on a normal keyboard, it can provide an easy way of writing not only for students but for most people since I have noticed we are so poor in writing our language.
I tried these sentences

I- Hari mai na mai nikii kiega
O- harĩ maĩ na maĩ, nĩkĩĩ kĩega?
Did not distinguish between 'water' and 'poop'.

I- Kimaara na Maara merana maranirie.
O- kĩmaara na maara merana maranĩrie
Got it right

I- Kaihua na gaicukuru
O- kaihũa na gaicukuru
Did not sense gaicũkũyũ but rather 'school'

I- Kuhoya ti kuiya
O- kũhoya ti kũiya
got it right


I- Nguhoya ngaiu ahe uumiriria
O- ngũhoya ngai ahe ũũmĩrĩria
Got it right

Kwanyu ngaatirio ni ngwa!
kwanyu ngaatĩrio nĩ ngwa!
Correct...

But
Kwanyu ngatirwo na ikingi
kwanyu ngatĩrwo na ĩkingĩ
instead of- kwanyu ngatirwo ni ikĩngĩ

The problem to me seem to be words which have very close pronunciation. For learners.. this is very dangerous because an absence of the CAP in one i or u leads to change of meaning. There is more that need to be done. One of the option is to have different output. That is the program should be able to produce different combinations.

Keep up though.


Cia mburi ni hia!
 
Posts: 25 | Location: Kwa Waithaka (No Nongainuka!) | Registered: 09 April 2006Reply With QuoteReport This Post
Guy
Silver Member
Posted Hide Post
quote:
The problem to me seem to be words which have very close pronunciation. [...] That is the program should be able to produce different combinations.

Keep up though.


Thank you very much for trying out the program. This kind of 'accent restoration' is typically done using a digital dictionary in languages like German and French.

The novelty of our approach is that it circumvents the need for a digital dictionary (which we don't have for Gikuyu) and predicts the placement of accents on the basis of the letters itself.

So when it sees 'mai' it will infer that there is most likely an accent on 'i', because 'maĩ' is more probable than 'mai' (there's actually more going on than simple statistics, but that's the general idea).

This approach is therefore limited in that it cannot provide a disambiguation on the word level. That's why it can not and never will distinguish between water and poop. To tackle this problem, we need to involve syntactic and semantic analysis. And that's where it gets really interesting.

Note to self: next time I come to Kenya, be careful when asking for water in Gikuyu Smiler
 
Posts: 6 | Location: Antwerp, Belgium | Registered: 05 April 2006Reply With QuoteReport This Post
<PGithinji>
Posted
Impressive.
tried it with the few words i learned.
quote:
http://pcger33.cde.ua.ac.be/gikuyuchat

this page is not availableFrowner
quote:
Note to self: next time I come to Kenya, be careful when asking for water in Gikuyu

maĩ =maĩ ? Confused
 
Reply With QuoteReport This Post
  Powered by Social Strata  
 

Kikuyu.com    KIKUYU.COM    Forums  Hop To Forum Categories  General Discussions  Hop To Forums  Main Board    New Gikuyu Chat box