Error “join: multi-character tab `\\t’” for using join tab

The default separator for linux join is whitespace, when I try to use “t” to instead of whitespace with the command “join file1 file2 -t “t”", following error occurred:

join: multi-character tab `\t

Here is the resolved method is found by Google, just use the following command:

join file1 file12 -t $’t’

This is learned from stackoverflow and I just make it here as a note: http://stackoverflow.com/questions/1722353/unix-join-separator-char

Man Join

Name
join – join lines of two files on a common field
Synopsis
join [OPTION]… FILE1 FILE2
Description

For each pair of input lines with identical join fields, write a line to standard output. The default join field is the first, delimited by whitespace. When FILE1 or FILE2 (not both) is -, read standard input.

-a FILENUM
print unpairable lines coming from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2
-e EMPTY
replace missing input fields with EMPTY
-i, –ignore-case ignore differences in case when comparing fields
-j FIELD
equivalent to ‘-1 FIELD -2 FIELD’
-o FORMAT
obey FORMAT while constructing output line
-t CHAR
use CHAR as input and output field separator
-v FILENUM
like -a FILENUM, but suppress joined output lines
-1 FIELD
join on this FIELD of file 1
-2 FIELD
join on this FIELD of file 2
–help
display this help and exit
–version
output version information and exit

Unless -t CHAR is given, leading blanks separate fields and are ignored, else fields are separated by CHAR. Any FIELD is a field number counted from 1. FORMAT is one or more comma or blank separated specifications, each being ‘FILENUM.FIELD’ or ’0′. Default FORMAT outputs the join field, the remaining fields from FILE1, the remaining fields from FILE2, all separated by CHAR.

Important: FILE1 and FILE2 must be sorted on the join fields.
Author
Written by Mike Haertel.
Reporting Bugs
Report bugs to .
Copyright
Copyright � 2006 Free Software Foundation, Inc.
This is free software. You may redistribute copies of it under the terms of the GNU General Public License . There is NO WARRANTY, to the extent permitted by law.
See Also
The full documentation for join is maintained as a Texinfo manual. If the info and join programs are properly installed at your site, the command

info join

should give you access to the complete manual.
Referenced By
combine(1)

Posted in Programming Skill | Tagged , , | Leave a comment

Two methods to redirect output for “set -x”

I found that the standard direct cannot redirect the output of the “set -x”, following is two methods to redirect output for “set -x” by google and my co-worker:

1: ./test.sh > test.log 2>&1

2: test.sh 2>&1 | tee -a test.log

About “tee” from linux man page:

Name
tee – read from standard input and write to standard output and files
Synopsis
tee [OPTION]… [FILE]…
Description

Copy standard input to each FILE, and also to standard output.
-a, –append
append to the given FILEs, do not overwrite
-i, –ignore-interrupts
ignore interrupt signals
–help
display this help and exit
–version
output version information and exit
If a FILE is -, copy again to standard output.
Author
Written by Mike Parker, Richard M. Stallman, and David MacKenzie.
Reporting Bugs
Report bugs to .
Copyright
Copyright � 2006 Free Software Foundation, Inc.This is free software. You may redistribute copies of it under the terms of the GNU General Public License . There is NO WARRANTY, to the extent permitted by law.
See Also
The full documentation for tee is maintained as a Texinfo manual. If the info and tee programs are properly installed at your site, the command
info tee
should give you access to the complete manual.
Referenced By
auto-build(1), pee(1), tee(2), tpipe(1)

Posted by 52nlp

Posted in Programming Skill | Tagged , , , , | Leave a comment

“set -x”: Prints executed commands and their arguments

I wrote a shell script and every time when I want print the executed commands and arguments I used the “echo” to echo the executed commands, this is not convenient. At the same time I use my co-worker’s shell script, I found that his script print a lot of useful infomation which including executed commands and arguments. I saw the shell code and very surprise there is a few echo. I’m not familar with shell but found that he used “set -x” at every script begining. So I google the “set -x”:

set -x: Prints executed commands and their arguments

For me, one “set -x” can replace of many “echo”s and it seems that I should learn the shell scrpit systematic, “Learning the bash Shell, Third Edition” may be the best choice for me:

This refreshed edition serves as the most valuable guide yet to the bash shell. It’s full of practical examples of shell commands and programs guaranteed to make everyday use of Linux that much easier. Includes information on key bindings, command line editing and processing, integrated programming features, signal handling, and much more!

Posted by 52nlp

Posted in Programming Skill | Tagged , , , , | Leave a comment

How to start new process or run shell commands within python?

Starting new process or running shell commands is a common task in python and there are many ways to excute this process. After google the problem in the web, I found the summary of the “start new process or run shell commands within python” is messy. Following is my summary of this problem in python, which include os.system(), os.popen(), commands module and subprocess module. It references some materials from the web and I will keep it here as a note.

1. os.system(command)

Execute the command (a string) in a subshell. This is implemented by calling the Standard C function system(), and has the same limitations. Changes to sys.stdin, etc. are not reflected in the environment of the executed command.

On Unix, the return value is the exit status of the process encoded in the format specified for wait(). Note that POSIX does not specify the meaning of the return value of the C system() function, so the return value of the Python function is system-dependent.

On Windows, the return value is that returned by the system shell after running command, given by the Windows environment variable COMSPEC: on command.com systems (Windows 95, 98 and ME) this is always 0; on cmd.exe systems (Windows NT, 2000 and XP) this is the exit status of the command run; on systems using a non-native shell, consult your shell documentation.

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

os.system(command) is the first way to run shell commands within python I found and it is simple:

>>> import os
>>> os.system(“md5sum –c file.md5″)
file: OK
0

Here the “0” is the return value of os.system(“md5sum –c file”), which is the exit status of the process encoded in the format specified for wait(). Note that POSIX does not specify the meaning of the return value of the C system() function, so the return value of the Python function is system-dependent. But I need the “OK” to notice the python program to execute the next step, so os.system() is not my first choice.
Continue reading

Posted in Programming Skill | Tagged , , , , , | 1 Comment

Mean Absolute Error (MAE) and Mean Square Error (MSE)

I need to calculate the MAE and MSE value for a model we trained, following is the summary of Mean Absolute Error (MAE) and Mean Square Error (MSE) from the wikipedia and other web site which I make a note here.

Mean Absolute Error (MAE):
In statistics, the mean absolute error is a quantity used to measure how close forecasts or predictions are to the eventual outcomes. The mean absolute error (MAE) is given by

MAE

As the name suggests, the mean absolute error is an average of the absolute errors ei = fi − yi, where fi is the prediction and yi the true value. Note that alternative formulations may include relative frequencies as weight factors.

The mean absolute error is a common measure of forecast error in time series analysis, where the terms “mean absolute deviation” is sometimes used in confusion with the more standard definition of mean absolute deviation. The same confusion exists more generally.

The MAE measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables. The equation is given in the library references. Expressed in words, the MAE is the average over the verification sample of the absolute values of the differences between forecast and the corresponding observation. The MAE is a linear score which means that all the individual differences are weighted equally in the average.

Mean Square Error (MSE):
In statistics, the mean square error or MSE of an estimator is one of many ways to quantify the difference between an estimator and the true value of the quantity being estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss. MSE measures the average of the square of the “error.” The error is the amount by which the estimator differs from the quantity to be estimated. The difference occurs because of randomness or because the estimator doesn’t account for information that could produce a more accurate estimate.

The MSE is the second moment (about the origin) of the error, and thus incorporates both the variance of the estimator and its bias. For an unbiased estimator, the MSE is the variance. Like the variance, MSE has the same unit of measurement as the square of the quantity being estimated. In an analogy to standard deviation, taking the square root of MSE yields the root mean squared error or RMSE, which has the same units as the quantity being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known as the standard error.

Definition and basic properties

The MSE of an estimator with respect to the estimated parameter θ is defined as

MSE

The MSE is equal to the sum of the variance and the squared bias of the estimator

mse

The MSE thus assesses the quality of an estimator in terms of its variation and unbiasedness. Note that the MSE is not equivalent to the expected value of the absolute error.

Since MSE is an expectation, it is a scalar, and not a random variable. It may be a function of the unknown parameter θ , but it does not depend on any random quantities. However, when MSE is computed for a particular estimator of θ the true value of which is not known, it will be subject to estimation error. In a Bayesian sense, this means that there are cases in which it may be treated as a random variable.

Alternative usage

The term mean squared error is sometimes used to refer to residual sum of squares, divided by the number of degrees of freedom. This is an observed quantity, whereas the definition above is a function of an often unknown parameter. For more details, see errors and residuals in statistics.

Posted by 52nlp

Posted in Mathematics | Tagged , , , , | 1 Comment

“Nohup” and “Screen”

“Nohup” and “Screen” can be used to run a command even if the session is disconnected or the user logs out. I use them both, but “Screen” is better.

What is nohup? As the man page states:

“nohup – run a command immune to hangups, with output to a non-tty
Synopsis. Run COMMAND, ignoring hangup signals.”

How to use nohup:
$nohup ./command.sh &
then the shell show:
[~]$ appending output to nohup.out
enter it and the nohup.out will save the output.

What is screen? As the man page states:

“Screen is a full-screen window manager that multiplexes a physical terminal between several processes (typically interactive shells). Each virtual terminal provides the functions of a DEC VT100 terminal and, in addition, several control functions from the ISO 6429 (ECMA 48, ANSI X3.64) and ISO 2022 standards (e.g. insert/delete line and support for multiple character sets). There is a scrollback history buffer for each virtual terminal and a copy-and-paste mechanism that allows moving text regions between windows.

When screen is called, it creates a single window with a shell in it (or the specified command) and then gets out of your way so that you can use the program as you normally would. Then, at any time, you can create new (full-screen) windows with other programs in them (including more shells), kill existing windows, view a list of windows, turn output logging on and off, copy-and-paste text between windows, view the scrollback history, switch between windows in whatever manner you wish, etc. All windows run their programs completely independent of each other. Programs continue to run when their window is currently not visible and even when the whole screen session is detached from the user’s terminal. When a program terminates, screen (per default) kills the window that contained it. If this window was in the foreground, the display switches to the previous window; if none are left, screen exits.”

How to use screen?
1. creat a task:
$ screen -S task
2. Execute a command in the task window,if your task not finished, use
$ Ctrl+a+d
to save the task. It will show the following info:
[detached]

if your task has been finished, use “exit” to exit screen:
$ exit
[screen is terminating]

3. You can use “screen -ls” to find any screen info:
$ screen -ls
There is a screen on:
10000.task (Detached)

4. Use “screen -r” to recover the task:
$ screen -r 10000

Posted by 52nlp

Posted in Programming Skill | Tagged , , , | Leave a comment

Reference of Scheme or Lisp

This is the reference of scheme or lisp which recommended by Dr. Wang:

1. The Origin of Scheme programming language:
http://groups.csail.mit.edu/mac/projects/scheme/

2. Compiler and Interpreter
http://www.gnu.org/software/mit-scheme/

3. Learning Materials
http://www.scheme.com/tspl3/

4. The Mathematical Model behind the Lisp Language
http://www.inf.fu-berlin.de/lehre/WS03/alpi/lambda.pdf

About Secheme (from Wikipedia):
Scheme is one of the two main dialects of the programming language Lisp. Unlike Common Lisp, the other main dialect, Scheme follows a minimalist design philosophy specifying a small standard core with powerful tools for language extension. Its compactness and elegance have made it popular with educators, language designers, programmers, implementors, and hobbyists, and this diverse appeal is seen as both a strength and, because of the diversity of its constituencies and the wide divergence between implementations, one of its weaknesses.

Scheme was developed at the MIT AI Lab by Guy L. Steele and Gerald Jay Sussman who introduced it to the academic world via a series of memos, now referred to as the Lambda Papers, over the period 1975-1980. The Scheme language is standardized in the official IEEE standard, and a de facto standard called the Revisedn Report on the Algorithmic Language Scheme (RnRS). The most widely implemented standard is R5RS (1998), and a new standard R6RS was ratified in 2007.

Scheme was the first dialect of Lisp to choose lexical scope and the first to require implementations to perform tail-call optimization. It was also one of the first programming languages to support first-class continuations. It had a significant influence on the effort that led to the development of its sister, Common Lisp.

Posted by 52nlp

Posted in Programming Skill | Tagged , , | Leave a comment

Google’s Protocol Buffers: Beautiful and Easy

I met a problem when I want put “vector<int>” as data into the Berkeley DB, Dr. Wang told me Protocol Buffers may be one of the best choices. Before that time, Protocol Buffers for me like a stranger, cause I even didn’t listen about it. Now it’s time to learn Protocol Buffers.

First what are Protocol Buffers? Following is from the Official Google Code Home:

Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages – Java, C++, or Python.

For details Google Protocol overview, I recommend you to ref:
http://code.google.com/apis/protocolbuffers/docs/overview.html

Second how to start Protocol Buffers?
Certainly you should download and install it first:
http://code.google.com/p/protobuf/downloads/list

Choose a proper package to download and install according the INSTALL.txt file in the package. The simplest way to compile this package for Linux platform is:
1. `cd’ to the directory containing the package’s source code and type `./configure’ to configure the package for your system.
2. Type `make’ to compile the package.
3. Optionally, type `make check’ to run any self-tests that come with the package.
4. Type `make install’ to install the programs and any data files and documentation.
5. You can remove the program binaries and object files from the source code directory by typing `make clean’.

Then you can use Google’s Protocol Buffers. I choose “Protocol Buffer Basics: C++” tutorial as the start: http://code.google.com/apis/protocolbuffers/docs/cpptutorial.html.

This tutorial provides a basic C++ programmer’s introduction to working with protocol buffers. By walking through creating a simple example application, it shows you how to
• Define message formats in a .proto file.
• Use the protocol buffer compiler.
• Use the C++ protocol buffer API to write and read messages.

This isn’t a comprehensive guide to using protocol buffers in C++. For more detailed reference information, see the Protocol Buffer Language Guide, the C++ API Reference, the C++ Generated Code Guide, and the Encoding Reference.

For my problem, first define an doc_id.proto:
message DocId {
repeated int32 id = 1;
}

Then compile it by the following command:
protoc –cpp_out=./ doc_id.proto

This generates the following files in the same directory:
• doc_id.pb.h, the header which declares your generated classes.
• doc_id.pb.cc, which contains the implementation of your classes.

Now, my C++ program can include the doc_id.pb.h file and use the related Protocol Buffer API:

// repeated int32 id = 1;
inline int id_size() const;
inline void clear_id();
static const int kIdFieldNumber = 1;
inline ::google::protobuf::int32 id(int index) const;
inline void set_id(int index, ::google::protobuf::int32 value);
inline void add_id(::google::protobuf::int32 value);
inline const ::google::protobuf::RepeatedField< ::google::protobuf::int32 >&
id() const;
inline ::google::protobuf::RepeatedField< ::google::protobuf::int32 >*
mutable_id();

But my problem is put the “vector<int>” as data into the Berkeley DB, how can I do this with the Protocol Buffers? Finally, each protocol buffer class has methods for writing and reading messages of your chosen type using the protocol buffer binary format. These include:

* bool SerializeToString(string* output) const;: serializes the message and stores the bytes in the given string. Note that the bytes are binary, not text; we only use the string class as a convenient container.
* bool ParseFromString(const string& data);: parses a message from the given string.
* bool SerializeToOstream(ostream* output) const;: writes the message to the given C++ ostream.
* bool ParseFromIstream(istream* input);: parses a message from the given C++ istream.

SerializeToString is the key function for me to serialize the “vector<int>” as the bytes string to store in the Berkeley DB:

string docid_string;
if (!docid.SerializeToString(&docid_string)) {

}

When I want get the “vector<int>” from the Berkeley DB, I can use ParseFromString:

if (docid.ParseFromString(docid_string)) {

}

That’s all! Google’s Protocol Buffers are really beautiful and easy!

Posted by 52nlp

Posted in Programming Skill | Tagged , , , , , | Leave a comment

Moses Support Digest: CALL FOR PAPERS – PBML

[Moses-support] CALL FOR PAPERS – PBML

CALL FOR PAPERS:
OPEN SOURCE TOOLS FOR MACHINE TRANSLATION
The Fifth Machine Translation Marathon, which will take place September
13-18 in Le Mans, France, is hosting an Open Source Convention to
advance the state of the art in machine translation. The MT Marathon
is organised by the computer science laboratory of Le Mans (LIUM) and the
University of Le Mans on behalf of the EuroMatrixPlus Consortium.
We invite developers of open source tools to present their work and
submit a paper of up to 10 pages that (a) describes the underlying
methodology and (b) includes instructions how to use the tools.
We are looking for stand-alone tools and extensions of existing tools,
such as the Moses open source systems. Accepted papers will be
presented during the MT Marathon and published as a special issue of
the Prague Bulletin of Mathematical Linguistics (

http://ufal.mff.cuni.cz/pbml).

Possible topics:
* training of machine translation models
* machine translation decoders
* tuning of machine translation systems
* evaluation of machine translation
* visualization, annotation or debugging tools
* tools for human translators
* interfaces for web-based services or APIs
* extensions of existing tools
* other tools for machine translation
This is the third time that the MT Marathon will host the Open Source
Convention. The papers from previous years are available online:

http://ufal.mff.cuni.cz/pbml-91-100.html

Papers will be reviewed by two reviewers appointed by the program
committee
e.g. papers which will be marked as the best by the reviewers (with no
substantial
modifications required) will be printed as usual in the PBML journal in time
for the MT Marathon.
Others will be printed after requested modifications in the next PBML
issue.
Important dates:
Deadline for paper submission: August 1 2010
Notification of acceptance: August 7 2010
Camera-ready paper due: August 14 2010
Presentations: September 13-17 2010 (at the MT Marathon in Le Mans)
Camera-ready (next issue) : 15 December 2010
Please send full non-anonymous submissions in PDF to Philipp Koehn <
pkoehn at inf.ed.ac.uk>
and the full Xe(La)TeX source for technical pre-review to Ondřej
Bojar .
Please use the PBML style files from

http://ufal.mff.cuni.cz/pbml-instructions.html

(follow the “short paper” track instructions).
Program Committee
Philipp Koehn
Ondrej Bojar
Holger Schwenk
Loïc Barrault

Posted in Moses, SMT | Tagged , , , | Leave a comment

Moses Support Digest: moses-irstlm memory racing with 5-gram lm

[Moses-support] moses-irstlm memory racing with 5-gram lm

I’m troubleshooting a new moses system with these components:
1) GIZA++ (SVN rev 8, v 1.0.3)
2) IRSTLM (SVN rev 38, v 5.40.01)
3) Moses (SVN rev 3210, dated 4-26-2010)
4) Ubuntu-server 10.04 LTS 64-bit.
5) 3.4 Ghz Pentium-D with 4gb ram.

Using a 3-gram lm, the system works as expected. Training, tuning and
evaluation a small (135K pairs) en-nl subset of europarl.v5 work fine. BLEU
score was 23.

I then built a 5-gram model, edited the moses.ini config and started
mert-moses-new. It creates a filtered model, and then launches moses. The
memory usage grows and within 10 minutes, the system kills moses.

In both cases, the lm is only the target half of the bitext corpus, about
135K lines.

The moses.ini files:

[lmodel-file]
1 0 3 /media/models/irstlm/europarl.v5.mini/3-gram.nl.blm

[lmodel-file]
1 0 5 /media/models/irstlm/europarl.v5.mini/5-gram.nl.blm

I know of one other who has anyone the same problem with the 4-1-2010
moses build and irstlm from March/April last year.

Any suggestions? Could it be the new Ubuntu or the g++-4.4.1 compiler?

Thanks,
Tom
Continue reading

Posted in Moses, SMT | Tagged , , , | 1 Comment