experiments/subword-nmt/.git
3 years agoMerge pull request #70 from alvations/patch-4 master origin/HEAD origin/master wahrwolf/master
Rico Sennrich [Mon, 14 Jan 2019 16:13:57 +0000 (16:13 +0000)]
Merge pull request #70 from alvations/patch-4

Use a single regex match with optional operator

3 years agoMerge pull request #69 from alvations/patch-3
Rico Sennrich [Mon, 14 Jan 2019 16:12:35 +0000 (16:12 +0000)]
Merge pull request #69 from alvations/patch-3

re.split can catch groups and save the delimiter

3 years agoCast filter generator to list for Python3
alvations [Mon, 14 Jan 2019 15:12:35 +0000 (23:12 +0800)]
Cast filter generator to list for Python3

3 years agore.split can catch groups and save the delimiter
alvations [Mon, 14 Jan 2019 06:27:43 +0000 (14:27 +0800)]
re.split can catch groups and save the delimiter

3 years agoadded missing parameter
alvations [Mon, 14 Jan 2019 14:53:07 +0000 (22:53 +0800)]
added missing parameter

3 years agoUse a single regex match with optional operator
alvations [Mon, 14 Jan 2019 07:42:59 +0000 (15:42 +0800)]
Use a single regex match with optional operator

3 years agoUpdate README.md
Rico Sennrich [Fri, 11 Jan 2019 09:32:00 +0000 (09:32 +0000)]
Update README.md

3 years agoversion bump
Rico Sennrich [Tue, 11 Dec 2018 14:46:24 +0000 (14:46 +0000)]
version bump

3 years agoenable encoding fix in subword-bpe
Rico Sennrich [Mon, 12 Nov 2018 17:56:02 +0000 (17:56 +0000)]
enable encoding fix in subword-bpe

relevant code was not run because subword_bpe.py is never executed as a script.

3 years agofix subword-bpe learn-bpe in Python 2
Rico Sennrich [Mon, 17 Sep 2018 10:53:36 +0000 (11:53 +0100)]
fix subword-bpe learn-bpe in Python 2

fixes regression from commit 06352. Error was:
AttributeError: 'Namespace' object has no attribute 'separator'

3 years agoMerge pull request #62 from bastings/master
Rico Sennrich [Thu, 23 Aug 2018 09:38:12 +0000 (10:38 +0100)]
Merge pull request #62 from bastings/master

pass `total_symbols` to learn_bpe

3 years agopass `total_symbols` to learn_bpe
Joost Bastings [Wed, 22 Aug 2018 20:09:08 +0000 (22:09 +0200)]
pass `total_symbols` to learn_bpe

pass `total_symbols` to learn_bpe when using the `subword-nmt learn-bpe` command

3 years agosuppert argument --total-symbols in learn_joint_bpe_and_vocab
Rico Sennrich [Mon, 20 Aug 2018 11:07:21 +0000 (12:07 +0100)]
suppert argument --total-symbols in learn_joint_bpe_and_vocab

3 years agoversion bump
Rico Sennrich [Fri, 17 Aug 2018 12:49:09 +0000 (13:49 +0100)]
version bump

3 years agofix best practice instructions
Rico Sennrich [Fri, 17 Aug 2018 12:40:43 +0000 (13:40 +0100)]
fix best practice instructions

thx to @bastings.

3 years agoMerge pull request #57 from jsenellart/fix_unicode_separator
Rico Sennrich [Tue, 17 Jul 2018 22:15:03 +0000 (08:15 +1000)]
Merge pull request #57 from jsenellart/fix_unicode_separator

enable unicode separator/glossaries in cli

3 years agocondition parameter conversion to python 2
Jean A. Senellart [Tue, 17 Jul 2018 21:36:11 +0000 (07:36 +1000)]
condition parameter conversion to python 2

3 years agoMerge branch 'master' into fix_unicode_separator
Jean Senellart [Tue, 17 Jul 2018 21:25:48 +0000 (07:25 +1000)]
Merge branch 'master' into fix_unicode_separator

3 years agoenable unicode separators in Python2
Rico Sennrich [Tue, 17 Jul 2018 06:40:51 +0000 (16:40 +1000)]
enable unicode separators in Python2

thanks @jsenellart

3 years agosame for glossaries
Jean A. Senellart [Thu, 12 Jul 2018 19:23:54 +0000 (04:23 +0900)]
same for glossaries

3 years agoenable unicode separator
Jean A. Senellart [Thu, 12 Jul 2018 02:52:30 +0000 (11:52 +0900)]
enable unicode separator

3 years agoMerge pull request #56 from Proyag/master
Rico Sennrich [Tue, 10 Jul 2018 16:17:09 +0000 (17:17 +0100)]
Merge pull request #56 from Proyag/master

Extending --glossaries to handle regex

3 years agoadd unittest (and fix python3 integer division in unittest)
Proyag [Mon, 9 Jul 2018 09:05:15 +0000 (11:05 +0200)]
add unittest (and fix python3 integer division in unittest)

3 years agohandle regex as glossaries
Proyag [Fri, 6 Jul 2018 15:15:57 +0000 (17:15 +0200)]
handle regex as glossaries

3 years agofix typo in previous commit
Rico Sennrich [Thu, 28 Jun 2018 10:48:40 +0000 (11:48 +0100)]
fix typo in previous commit

3 years agonew option --total-symbols in learn-bpe
Rico Sennrich [Thu, 28 Jun 2018 10:41:08 +0000 (11:41 +0100)]
new option --total-symbols in learn-bpe

redefines "--symbols" to be the number of merge operations,
minus the character vocabulary size, so that "--symbols" becomes
an estimate of the final symbol vocabulary size.

thx @phikoehn

3 years agoMerge pull request #52 from en-dash/master
Rico Sennrich [Wed, 6 Jun 2018 13:34:56 +0000 (21:34 +0800)]
Merge pull request #52 from en-dash/master

Improve library usability

3 years agonew method segment_tokens that takes and returns a list
Lenz [Tue, 5 Jun 2018 20:13:51 +0000 (23:13 +0300)]
new method segment_tokens that takes and returns a list

3 years agofix: spurious .format() operation
Lenz [Tue, 5 Jun 2018 20:06:43 +0000 (23:06 +0300)]
fix: spurious .format() operation

4 years agofix pip package with Python3
Rico Sennrich [Mon, 21 May 2018 09:53:59 +0000 (10:53 +0100)]
fix pip package with Python3

4 years agoproper markdown on PyPi
Rico Sennrich [Thu, 17 May 2018 12:28:21 +0000 (13:28 +0100)]
proper markdown on PyPi

4 years agopypi repository
Rico Sennrich [Thu, 17 May 2018 12:08:24 +0000 (13:08 +0100)]
pypi repository

4 years agomore consistent command line names for get-vocab
Rico Sennrich [Wed, 16 May 2018 15:44:15 +0000 (16:44 +0100)]
more consistent command line names for get-vocab

4 years agorecommend subword_nmt.py as alternative to pip install in README
Rico Sennrich [Wed, 16 May 2018 15:32:55 +0000 (16:32 +0100)]
recommend subword_nmt.py as alternative to pip install in README

4 years agohelp text for subword-nmt command (and remove little-used segment_char_ngrams from...
Rico Sennrich [Wed, 16 May 2018 15:09:57 +0000 (16:09 +0100)]
help text for subword-nmt command (and remove little-used segment_char_ngrams from command)

4 years agodocumentation of pip package
Rico Sennrich [Wed, 16 May 2018 13:48:08 +0000 (14:48 +0100)]
documentation of pip package

4 years agobugfixes to packaging
Rico Sennrich [Wed, 16 May 2018 13:47:59 +0000 (14:47 +0100)]
bugfixes to packaging

4 years agocreate symlink in old script location (with deprecation warning)
Rico Sennrich [Wed, 16 May 2018 13:35:47 +0000 (14:35 +0100)]
create symlink in old script location (with deprecation warning)

4 years agomodify files for packaging; thanks to universome
Rico Sennrich [Wed, 16 May 2018 11:22:01 +0000 (12:22 +0100)]
modify files for packaging; thanks to universome

4 years agomove files to package structure; add setup.py
Rico Sennrich [Wed, 16 May 2018 10:44:24 +0000 (11:44 +0100)]
move files to package structure; add setup.py

4 years agofix regression from 7bb1c: don't duplicate empty line
Rico Sennrich [Tue, 1 May 2018 20:27:22 +0000 (21:27 +0100)]
fix regression from 7bb1c: don't duplicate empty line

fixes #48

4 years agodon't strip UTF-8 whitespace
Rico Sennrich [Tue, 1 May 2018 18:14:13 +0000 (19:14 +0100)]
don't strip UTF-8 whitespace

4 years agomore tests
Rico Sennrich [Thu, 26 Apr 2018 10:41:43 +0000 (11:41 +0100)]
more tests

4 years agotesting
Rico Sennrich [Thu, 26 Apr 2018 08:56:33 +0000 (09:56 +0100)]
testing

4 years agowhitespace-only splitting everywhere.
Rico Sennrich [Wed, 25 Apr 2018 15:58:51 +0000 (16:58 +0100)]
whitespace-only splitting everywhere.

4 years agoupdate changelog
Rico Sennrich [Wed, 28 Mar 2018 08:19:42 +0000 (09:19 +0100)]
update changelog

4 years agoget_vocabulary: don't crash on double whitespace or empty line
Rico Sennrich [Mon, 26 Mar 2018 18:03:55 +0000 (19:03 +0100)]
get_vocabulary: don't crash on double whitespace or empty line

4 years agodon't break on leading whitespace
Rico Sennrich [Mon, 26 Mar 2018 09:35:33 +0000 (10:35 +0100)]
don't break on leading whitespace

4 years agodon't assume trailing whitespace has length 1.
Rico Sennrich [Sun, 25 Mar 2018 13:57:46 +0000 (14:57 +0100)]
don't assume trailing whitespace has length 1.

fixes bug introduced in commit 30e5be, and issue #38.

4 years agodon't break when same BPE file is used multiple times in script.
Rico Sennrich [Sun, 25 Mar 2018 13:45:14 +0000 (15:45 +0200)]
don't break when same BPE file is used multiple times in script.

fixes #39.

4 years agomake get_vocab consistent with new whitespace-only splitting in apply_bpe
Rico Sennrich [Wed, 14 Mar 2018 21:14:57 +0000 (21:14 +0000)]
make get_vocab consistent with new whitespace-only splitting in apply_bpe
(thanks to Ondrej Bojar)

4 years agofix regression in commit 30e5be (end-of-line token was segmented wrong)
Rico Sennrich [Tue, 6 Mar 2018 17:22:41 +0000 (17:22 +0000)]
fix regression in commit 30e5be (end-of-line token was segmented wrong)

4 years agodon't crash on double spaces
Rico Sennrich [Tue, 6 Mar 2018 14:49:47 +0000 (14:49 +0000)]
don't crash on double spaces

4 years agodon't silently replace unicode characters with space or newline.
Rico Sennrich [Tue, 6 Mar 2018 12:01:29 +0000 (12:01 +0000)]
don't silently replace unicode characters with space or newline.

should fix #29.

4 years agoreference to package
Rico Sennrich [Thu, 1 Mar 2018 19:02:04 +0000 (19:02 +0000)]
reference to package

4 years agofix number of arguments in test_glossaries.encode_mock
Rico Sennrich [Mon, 22 Jan 2018 17:48:59 +0000 (17:48 +0000)]
fix number of arguments in test_glossaries.encode_mock

fixes #36

4 years agoMerge pull request #35 from obo/patch-1
Rico Sennrich [Fri, 12 Jan 2018 15:44:15 +0000 (15:44 +0000)]
Merge pull request #35 from obo/patch-1

add up repeated entries with --dist-input

4 years agoadd up repeated entries with --dist-input
Ondrej Bojar [Fri, 5 Jan 2018 09:59:48 +0000 (10:59 +0100)]
add up repeated entries with --dist-input

This allows to directly read the contatenation of several outputs of get_vocab.py, instead of adding them up with a separate script.

4 years agoMerge pull request #34 from ozancaglayan/fixes
Rico Sennrich [Wed, 3 Jan 2018 16:19:50 +0000 (16:19 +0000)]
Merge pull request #34 from ozancaglayan/fixes

Fixes

4 years agoremove unused imports, fix trailing whitespace
Ozan Caglayan [Thu, 21 Dec 2017 20:42:38 +0000 (21:42 +0100)]
remove unused imports, fix trailing whitespace

4 years agodo not force system's default python
Ozan Caglayan [Thu, 21 Dec 2017 20:38:22 +0000 (21:38 +0100)]
do not force system's default python

This is to make sure that the scripts are executed with the interpreter
defined in the environment instead of what has been installed as
/usr/bin/python.

4 years agoadd .gitignore file
Ozan Caglayan [Thu, 21 Dec 2017 20:37:07 +0000 (21:37 +0100)]
add .gitignore file

4 years agoMerge pull request #32 from ozancaglayan/master
Rico Sennrich [Mon, 18 Dec 2017 16:51:56 +0000 (16:51 +0000)]
Merge pull request #32 from ozancaglayan/master

learn_joint_bpe_and_vocab: Fix parameter passing

4 years agolearn_joint_bpe_and_vocab: Fix parameter passing
Ozan Çağlayan [Mon, 18 Dec 2017 16:24:24 +0000 (17:24 +0100)]
learn_joint_bpe_and_vocab: Fix parameter passing

args.separator was being passed to merges instead of separator.

4 years agoMerge pull request #31 from Proyag/master
Rico Sennrich [Fri, 15 Dec 2017 13:47:05 +0000 (13:47 +0000)]
Merge pull request #31 from Proyag/master

Option to apply fewer BPE operations than learned

4 years agoOption to apply fewer merge operations than learned
Proyag [Thu, 14 Dec 2017 12:08:19 +0000 (12:08 +0000)]
Option to apply fewer merge operations than learned

4 years agoupdate toy example to BPE use same representation again as learn_bpe.py
Rico Sennrich [Thu, 5 Oct 2017 13:53:18 +0000 (14:53 +0100)]
update toy example to BPE use same representation again as learn_bpe.py

learn_bpe.py was changed in commit a749a7 to make end-of-word representation more consistent.

4 years agoline buffering for apply_bpe.py
Rico Sennrich [Wed, 30 Aug 2017 16:20:19 +0000 (18:20 +0200)]
line buffering for apply_bpe.py

python 3 only so far; not sure how to make this work in python 2

4 years agocache persists within BPE instance, but not across BPE instances
Rico Sennrich [Fri, 9 Jun 2017 10:01:56 +0000 (13:01 +0300)]
cache persists within BPE instance, but not across BPE instances

5 years agotypo
Rico Sennrich [Sat, 20 May 2017 14:29:52 +0000 (15:29 +0100)]
typo

5 years agoMerge pull request #23 from jvdbogae/master
Rico Sennrich [Wed, 10 May 2017 09:14:24 +0000 (10:14 +0100)]
Merge pull request #23 from jvdbogae/master

chmod +x apply_bpe.py

5 years agoSomehow, apply_bpe.py ended up non-executable, resulting in an empty training corpus...
Joachim Van den Bogaert [Tue, 9 May 2017 10:00:17 +0000 (12:00 +0200)]
Somehow, apply_bpe.py ended up non-executable, resulting in an empty training corpus and a failed AMUNMT training. When cleaning afterwards, the subword-nmt repo is deleted and cloned again by the AMUNMT example training script, resulting in apply_bpe.py being non-executable again (should it have been chmod +x ’ed).

5 years agoMerge pull request #22 from Unbabel/feat/glossaries
Rico Sennrich [Mon, 1 May 2017 15:36:02 +0000 (16:36 +0100)]
Merge pull request #22 from Unbabel/feat/glossaries

Feat/glossaries

5 years agofix glossaries feature
dimesq [Mon, 1 May 2017 10:19:43 +0000 (11:19 +0100)]
fix glossaries feature

5 years agocomments
Rico Sennrich [Fri, 28 Apr 2017 09:46:08 +0000 (10:46 +0100)]
comments

5 years agoImplement glossaries feature
dimesq [Tue, 25 Apr 2017 20:11:35 +0000 (21:11 +0100)]
Implement glossaries feature

5 years agoAdd tests
dimesq [Tue, 25 Apr 2017 19:48:30 +0000 (20:48 +0100)]
Add tests

5 years agochangelog
Rico Sennrich [Fri, 21 Apr 2017 10:25:06 +0000 (11:25 +0100)]
changelog

5 years agoupdate README
Rico Sennrich [Fri, 21 Apr 2017 10:13:09 +0000 (11:13 +0100)]
update README

5 years agoMerge branch 'master' into vocab
Rico Sennrich [Fri, 21 Apr 2017 10:00:39 +0000 (11:00 +0100)]
Merge branch 'master' into vocab

5 years agoremove subword marker at end-of-line v0.1
Rico Sennrich [Fri, 7 Apr 2017 13:13:26 +0000 (15:13 +0200)]
remove subword marker at end-of-line

5 years agofix merge conflict
Rico Sennrich [Sat, 1 Apr 2017 20:25:05 +0000 (21:25 +0100)]
fix merge conflict

5 years agoMerge branch 'master' into vocab
Rico Sennrich [Thu, 9 Mar 2017 14:07:07 +0000 (14:07 +0000)]
Merge branch 'master' into vocab

5 years agorename --is-dict to --dict-input
Rico Sennrich [Mon, 27 Feb 2017 15:56:55 +0000 (15:56 +0000)]
rename --is-dict to --dict-input

5 years agoAllow passing in a word - count file instead of iterating through the whole dataset
Martin Boyanov [Sat, 25 Feb 2017 12:01:52 +0000 (14:01 +0200)]
Allow passing in a word - count file instead of iterating through the whole dataset

5 years agomake max deterministic by using symbol pair as secondary sort key
Rico Sennrich [Wed, 22 Feb 2017 13:58:21 +0000 (13:58 +0000)]
make max deterministic by using symbol pair as secondary sort key

5 years agoacknowledgements
Rico Sennrich [Mon, 20 Feb 2017 10:54:15 +0000 (10:54 +0000)]
acknowledgements

5 years agodocumentation fix
Rico Sennrich [Fri, 10 Feb 2017 12:36:33 +0000 (12:36 +0000)]
documentation fix

5 years agoconsistent utf-8 encoding across python versions and environment variables
Rico Sennrich [Fri, 10 Feb 2017 11:27:47 +0000 (11:27 +0000)]
consistent utf-8 encoding across python versions and environment variables

5 years agoMerge branch 'master' into vocab
Rico Sennrich [Fri, 10 Feb 2017 11:21:24 +0000 (11:21 +0000)]
Merge branch 'master' into vocab

5 years agoconsistently use UTF-8 across python versions and environment variables
Rico Sennrich [Fri, 10 Feb 2017 11:11:45 +0000 (11:11 +0000)]
consistently use UTF-8 across python versions and environment variables

5 years agoMerge branch 'unicode'
Rico Sennrich [Fri, 10 Feb 2017 11:07:08 +0000 (11:07 +0000)]
Merge branch 'unicode'

5 years agoscript to automate learning of joint BPE and extracting vocabulary
Rico Sennrich [Fri, 10 Feb 2017 11:06:52 +0000 (11:06 +0000)]
script to automate learning of joint BPE and extracting vocabulary

5 years agodocumentation
Rico Sennrich [Mon, 6 Feb 2017 11:45:20 +0000 (11:45 +0000)]
documentation

5 years agoallow prevention of BPE merges that produce OOVs or rare subword units (by providing...
Rico Sennrich [Fri, 3 Feb 2017 17:56:12 +0000 (17:56 +0000)]
allow prevention of BPE merges that produce OOVs or rare subword units (by providing vocabulary file and threshold)
after applying BPE merges, and subword unit that is invalid is split by reverting the BPE merges recursively

5 years agomore consistent handling of end-of-word token
Rico Sennrich [Fri, 3 Feb 2017 17:55:00 +0000 (17:55 +0000)]
more consistent handling of end-of-word token
prevents apply_bpe.py from applying word-internal merge operations in word-final position

5 years agoconsistent, cross-version unicode handling origin/unicode wahrwolf/unicode
Rico Sennrich [Tue, 10 Jan 2017 14:52:42 +0000 (14:52 +0000)]
consistent, cross-version unicode handling

5 years agoMerge pull request #10 from rmeertens/master
Rico Sennrich [Tue, 8 Nov 2016 09:09:06 +0000 (09:09 +0000)]
Merge pull request #10 from rmeertens/master

using python3 print function

5 years agousing python3 print function
roland [Tue, 8 Nov 2016 09:00:31 +0000 (10:00 +0100)]
using python3 print function

5 years agofrequency threshold for learning
Rico Sennrich [Mon, 17 Oct 2016 15:35:53 +0000 (16:35 +0100)]
frequency threshold for learning

Closes #8