train-kytea: Kytea a word segmentation/pronunciation estimation tool

OPTIONS

A summary of options is included below.

-encode: The text encoding to be used (utf8/euc/sjis; default: utf8)
-full: A fully annotated training corpus (multiple possible)
-tok: A training corpus that is tokenized with no tags (multiple possible)
-part: A partially annotated training corpus (multiple possible)
-conf: A confidence annotated training corpus (multiple possible)
-feat: A file containing features generated by -featout
-dict: A dictionary file (one 'word/pron' entry per line, multiple possible)
-subword: A file of subword units. This will enable unknown word PE.
-model: The file to write the trained model to
-modtext: Print a text model (instead of the default binary)
-featout: Write the features used in training the model to this file

-nows: Don't train a word segmentation model
-notags: Skip the training of tagging, do only word segmentation
-global: Train the nth tag with a global model (good for POS, bad for PE)
-debug: The debugging level during training (0=silent, 1=normal, 2=detailed)

-charw: The character window to use for WS (3)
-charn: The character n-gram length to use for WS for WS (3)
-typew: The character type window to use for WS (3)
-typen: The character type n-gram length to use for WS for WS (3)
-dictn: Dictionary words greater than -dictn will be grouped together (4)
-unkn: Language model n-gram order for unknown words (3)
-eps: The epsilon stopping criterion for classifier training
-cost: The cost hyperparameter for classifier training
-nobias: Don't use a bias value in classifier training
-solver: The solver (1=SVM, 7=logistic regression, etc.; default 1, see LIBLINEAR documentation for more details)