Comments on Truly Madly Wordly: Creating a text corpus from Wikipedia

what are the command line arguments for mkvocab.pl...

2017-12-20T06:29:27.824+02:00

what are the command line arguments for mkvocab.pl file . Please help me !!

Hi, first of all thank you for your blog, it has r...

2017-08-03T15:43:19.770+02:00

Hi, first of all thank you for your blog, it has really helped me out. Secondly, I am a bit stuck trying to use the perl script, would you have an example of how you converted the html output from the java code to clean text? atm the moment I'm running the Java script on the xml dump file and thought whilst its running id try get the next step working before it finishes with some smaller data.

Hi, first of all thank you for your blog, it has r...

2017-08-03T15:42:23.562+02:00

no sir.. actually i am trying to prepare the corpu...

2017-05-27T11:46:29.375+02:00

no sir.. actually i am trying to prepare the corpus manually. it will be helpful if somebody share the info.

@habi did u get any help ??

2017-05-08T11:24:15.532+02:00

@habi did u get any help ??

Sir, I want to extract some books or articles in M...

2017-05-04T14:23:50.291+02:00

Sir, I want to extract some books or articles in Malayalam language from web and also want to create the language model. can you please help me to do this? what are the tool kits available for pre-processing the text?

Hi, i need to recognize only 100 specific words so...

2017-03-17T17:24:43.360+02:00

Hi,
i need to recognize only 100 specific words so i don't want other unnecessary words in my dictionary .SO how could i do it.Please help me

Hi, Do you get any solution? I need to convert int...

2016-11-01T11:13:19.409+02:00

Hi,
Do you get any solution?
I need to convert into Korean language. So if you could help me it would be good for me.
can you please knock me at banna.kbet@gmail.com?

2016-11-01T11:11:46.893+02:00

This comment has been removed by the author.

Hi I found the link of this page from CMUsphinx an...

2016-10-01T14:43:47.902+02:00

Hi
I found the link of this page from CMUsphinx anguage model building page: http://cmusphinx.sourceforge.net/wiki/tutoriallm

I have downloaded and installed gwtwiki, but don't know how to use it? I have also downloaded the wiki dump file in .xml.

May you explain more the stem number 2?

thanks a lot

Id your final model or cleaned wiki corspus availa...

2014-12-12T11:15:08.573+02:00

Id your final model or cleaned wiki corspus available for downlad

2014-05-25T02:42:28.449+02:00

This comment has been removed by the author.

Out of interest, what version of mitlm did you use...

2014-03-05T08:02:31.141+02:00

Out of interest, what version of mitlm did you use? I ask because I get segmentation faults when using the same commands (estimate-ngram).

Thanks

Chris

How do I need to change Wikipedia2Txt.java to leav...

2014-01-22T23:21:33.093+02:00

How do I need to change Wikipedia2Txt.java to leave paragraphs in one piece, i.e. not break them into separate sentences?

Hey, I dig up this article because I try to make...

2014-01-02T14:33:19.840+02:00

Hey,

I dig up this article because I try to make pocketsphinx working in french.
I was thinking about using different french books agregated into a single file and jumping right to step 3. My understanding is that it would do the trick right ?

Regards

naive solution will run in O(n) : suppose phrases ...

2013-08-16T11:31:03.126+02:00

naive solution will run in O(n) : suppose phrases are made of 2 words, then split a setence(for example) of n words into n-1 "phrases" and calculate..

We can assume this could be done in any of the wik...

2013-06-12T19:43:27.969+02:00

We can assume this could be done in any of the wikipedias, for example, French or Spanish.
This would appear to pick out single morphemes well. Is it adaptable to pulling out multiple word collocations?
For example: "food poisoning," "an easy read," "light snack," and "absorbed in her book," while made up of adjective and noun or verb and prepositional or adverbial phrase, act as commonly occurring sequences of words whose collocation means more or other than the "sum of its parts."
Any algorithm(s) available to extract collocative phrases?

2013-05-25T01:55:30.744+02:00

This comment has been removed by the author.

Thanks for your post. could you upload the lm file...

2013-03-23T13:49:30.753+02:00

Thanks for your post. could you upload the lm file you just created with this steps.

Thank you very much! With your help, I solved the ...

2013-03-20T12:28:32.697+02:00

Thank you very much! With your help, I solved the whole problem in just one hour :)

Marwen and danajaatcse, To do that you probably d...

2011-04-30T16:41:37.254+02:00

Marwen and danajaatcse,

To do that you probably don't want to use the same technique here of processing the entire large wikipedia dump file, because most of the articles are not relevant.

To construct a corpus on a specific topic, you'd need a collection of articles that are related to the topic. One approach would be to use Wikipedia categories or Wikipedia Book pages, and retrieve all linked articles.

Another approach, which I am working on at the moment, is to start with Wikipedia search with some keywords, and then follow links based on the similarity of articles to each other. I will post a new article about that when I'm done.

My approach uses vector space modelling and LSA to establish similarity, using the gensim toolkit (http://nlp.fi.muni.cz/projekty/gensim/).

hi! i want to know too if there is a way to extrac...

2011-04-30T14:19:14.361+02:00

hi! i want to know too if there is a way to extract a corpus from only articles that are specific to a given domain (e.g. water domain [water, river, tsunami, purification, freshwater...]

say I wanted to extract articles on history or com...

2011-04-08T07:38:45.387+02:00

say I wanted to extract articles on history or computer science?

How do you define "a specific domain" in...

2011-04-07T21:51:37.668+02:00

How do you define "a specific domain" in the Wikipedia context?

Hi, Do you know a quick way to extract and creat...

2011-04-07T20:41:16.751+02:00

Hi,

Do you know a quick way to extract and create the corpus for only the articles of a specific domain?