I Love Natural Language Processing

I LOVE NLP

Moses Support Digest:Moses Error in training phrase

without comments

[Moses-support] Moses: Error in training phrase model

Hi,

I have compiled Moses,Giza & SRILM on Fedora Core 11 using the steps
described in http://www.statmt.org/moses_steps.html and other moses support
links.

While training my parallel corpus of english and hindi (~100,000 sentences
each) I get an error as shown below when i execute:

nohup nice
./tools/moses-scripts/scripts-20091002-0031//training/train-factored-phrase-model.perl
-scripts-root-dir ./tools/moses-scripts/scripts-20091002-0031/ -root-dir
work3 -corpus ./work3/corpus/IRL-clean -f hi2 -e en2 -alignment
grow-diag-final-and -reordering msd-bidirectional-fe -lm
0:3:/home/danish/FIRE2010/work3/lm/IRL-en.lm >& ./work3/training.out &

In one step of the training process, I get the following error and the tools
quits:

*Last few lines of output (training.out) :*

Use of uninitialized value $a in split at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 856.
Use of uninitialized value $a in scalar chomp at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 853.
Use of uninitialized value $a in split at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 856.
Use of uninitialized value $a in scalar chomp at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 853.
Use of uninitialized value $a in split at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 856.
Use of uninitialized value $a in scalar chomp at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 853.
Use of uninitialized value $a in split at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 856.
Use of uninitialized value $a in scalar chomp at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 853.
Use of uninitialized value $a in split at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 856.
Use of uninitialized value $a in scalar chomp at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 853.
Use of uninitialized value $a in split at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 856.

Saved: ./work3//model/lex.f2e and ./work3//model/lex.e2f
FILE: ./work3/corpus/IRL-clean.en2
FILE: ./work3/corpus/IRL-clean.hi2
FILE: ./work3//model/aligned.grow-diag-final-and
(5) extract phrases @ Sat Oct 3 02:46:00 IST 2009
./tools/moses-scripts//scripts-20091002-0031//training/phrase-extract/extract
./work3/corpus/IRL-clean.en2 ./work3/corpus/IRL-clean.hi2
./work3//model/aligned.grow-diag-final-and ./work3//model/extract 7
–NoFileLimit orientation
Executing:
./tools/moses-scripts//scripts-20091002-0031//training/phrase-extract/extract
./work3/corpus/IRL-clean.en2 ./work3/corpus/IRL-clean.hi2
./work3//model/aligned.grow-diag-final-and ./work3//model/extract 7
–NoFileLimit orientation
PhraseExtract v1.4, written by Philipp Koehn
phrase extraction from an aligned parallel corpus
………Executing: gzip ./work3//model/extract.inv
gzip: ./work3//model/extract.inv: No such file or directory
Exit code: 1
ERROR at
./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
line 963.

My clean sentence files are with the extension hi2 (for hindi) and en2 (for
english).
I have tried solutions available on moses support forums for similar
problems, but they have not helped.

The following is a listing of the files & folders in my work folder (work3)

*corpus* folder
total 76384
-rw-rw-r–. 1 danish danish 27717737 2009-10-02 23:29 IRL-clean.hi2
-rw-rw-r–. 1 danish danish 11502887 2009-10-02 23:29 IRL-clean.en2
-rw-r–r–. 1 root root 1781671 2009-10-03 17:44 hi2.vcb.classes
-rw-r–r–. 1 root root 1579583 2009-10-03 17:44 hi2.vcb.classes.cats
-rw-r–r–. 1 root root 704087 2009-10-03 17:50 en2.vcb.classes
-rw-r–r–. 1 root root 534277 2009-10-03 17:50 en2.vcb.classes.cats
-rw-r–r–. 1 root root 2158362 2009-10-03 17:50 hi2.vcb
-rw-r–r–. 1 root root 1013926 2009-10-03 17:50 en2.vcb
-rw-r–r–. 1 root root 15605740 2009-10-03 17:50 hi2-en2-int-train.snt
-rw-r–r–. 1 root root 15605740 2009-10-03 17:51 en2-hi2-int-train.snt

*giza.en2-hi2* folder
total 124088
-rw-r–r–. 1 root root 109989326 2009-10-03 18:44 en2-hi2.cooc
-rw-r–r–. 1 root root 1651 2009-10-03 18:44 en2-hi2.gizacfg
-rw-r–r–. 1 root root 17070807 2009-10-03 19:22 en2-hi2.A3.final.gz

*giza.hi2-en2* folder
total 124052
-rw-r–r–. 1 root root 110088686 2009-10-03 17:51 hi2-en2.cooc
-rw-r–r–. 1 root root 1651 2009-10-03 17:51 hi2-en2.gizacfg
-rw-r–r–. 1 root root 16928263 2009-10-03 18:43 hi2-en2.A3.final.gz

*lm* folder
total 100388
-rw-rw-r–. 1 danish danish 27717737 2009-10-02 23:29 IRL-clean.hi2
-rw-rw-r–. 1 danish danish 11502887 2009-10-02 23:29 IRL-clean.en2
-rw-r–r–. 1 root root 22834140 2009-10-03 17:29 IRL-en.lm
-rw-r–r–. 1 root root 40731568 2009-10-03 17:30 IRL-hi.lm

*model* folder
total 7992
-rw-r–r–. 1 root root 0 2009-10-03 19:23 aligned.grow-diag-final-and
-rw-r–r–. 1 root root 4089006 2009-10-03 19:23 lex.f2e
-rw-r–r–. 1 root root 4089006 2009-10-03 19:23 lex.e2f

I can see the model folder does not contain the extract.inv file which seems
to cause the error. I have re-done the steps thrice and face the exact same
error each time.

I have ensured that the parallel text has been lower cased (for english) and
cleaned (english & hindi both).
May I request you to kindly help me resolve this issue at the earliest.
Thanks!

Thank you,
Regards,

Danish Contractor


Re: [Moses-support] Moses: Error in training phrase model

Hi,

the problem lies in the word alignment step (step 3) – you can run the step in isolation to check in more detail about what is going wrong.

One common problem with word alignment is that GIZA++ is sensititive
to bad data, i.e. empty lines, long sentences, or excessive mismatch in sentence length. The clean-corpus-n.perl script is designed to take care of these problems. Did you run this on your original corpus?

-phi

Re: [Moses-support] Moses: Error in training phrase model

Hi,

Thanks for the reply. Yes, I did run the clean-corpus-n.perl script. I also had to replace all occurrences of “|” in the hindi text with another character as it seems “|” is of special significance to the scripts.

The “|” is used in the hindi language as a full stop (“.” — end of sentence marker).

Could you please let me know if there is a limit on the max length of sentences – I gave a length of 1 – 60 while running the script. In addition, is there any limit on the max allowable difference in sentence length of the parallel text?

Thanks.
–Danish

Re: [Moses-support] Moses: Error in training phrase model

Hi,

GIZA++ has a limit on 100 words per sentence. It usually makes little sense to include sentences longer than 60 words in training, since the word alignment is difficult to compute.

-phi

NOTICE:This is digested from the Moses-support mailing list, which supports for the moses SMT decoder.

Related posts:

  1. Moses Support Digest:Translation from English to Foreign Language
  2. Moses Support Digest: moses-chart error while compiling training scripts
  3. Moses Support Digest:Hierarchical rule extraction
  4. Moses Support Digest:Aligned phrase counts
  5. Moses Support Digest:About giza++ options when running moses
  6. Moses Support Digest:About the hierarchical model of Moses
  7. Moses Support Digest: experiment management system and Moses scripts
  8. Moses Support Digest: moses installation
  9. Moses Support Digest:Dictonary use during training
  10. Moses Support Digest:GIZA++ error

Written by 52nlp

December 16th, 2009 at 9:34 pm

Posted in Moses,SMT

Tagged with , , ,

Leave a Reply