]>
wolfpit.net Git - experiments/subword-nmt/.git/log
Rico Sennrich [Mon, 14 Jan 2019 16:13:57 +0000 (16:13 +0000)]
Merge pull request #70 from alvations/patch-4
Use a single regex match with optional operator
Rico Sennrich [Mon, 14 Jan 2019 16:12:35 +0000 (16:12 +0000)]
Merge pull request #69 from alvations/patch-3
re.split can catch groups and save the delimiter
alvations [Mon, 14 Jan 2019 15:12:35 +0000 (23:12 +0800)]
Cast filter generator to list for Python3
alvations [Mon, 14 Jan 2019 06:27:43 +0000 (14:27 +0800)]
re.split can catch groups and save the delimiter
alvations [Mon, 14 Jan 2019 14:53:07 +0000 (22:53 +0800)]
added missing parameter
alvations [Mon, 14 Jan 2019 07:42:59 +0000 (15:42 +0800)]
Use a single regex match with optional operator
Rico Sennrich [Fri, 11 Jan 2019 09:32:00 +0000 (09:32 +0000)]
Update README.md
Rico Sennrich [Tue, 11 Dec 2018 14:46:24 +0000 (14:46 +0000)]
version bump
Rico Sennrich [Mon, 12 Nov 2018 17:56:02 +0000 (17:56 +0000)]
enable encoding fix in subword-bpe
relevant code was not run because subword_bpe.py is never executed as a script.
Rico Sennrich [Mon, 17 Sep 2018 10:53:36 +0000 (11:53 +0100)]
fix subword-bpe learn-bpe in Python 2
fixes regression from commit 06352. Error was:
AttributeError: 'Namespace' object has no attribute 'separator'
Rico Sennrich [Thu, 23 Aug 2018 09:38:12 +0000 (10:38 +0100)]
Merge pull request #62 from bastings/master
pass `total_symbols` to learn_bpe
Joost Bastings [Wed, 22 Aug 2018 20:09:08 +0000 (22:09 +0200)]
pass `total_symbols` to learn_bpe
pass `total_symbols` to learn_bpe when using the `subword-nmt learn-bpe` command
Rico Sennrich [Mon, 20 Aug 2018 11:07:21 +0000 (12:07 +0100)]
suppert argument --total-symbols in learn_joint_bpe_and_vocab
Rico Sennrich [Fri, 17 Aug 2018 12:49:09 +0000 (13:49 +0100)]
version bump
Rico Sennrich [Fri, 17 Aug 2018 12:40:43 +0000 (13:40 +0100)]
fix best practice instructions
thx to @bastings.
Rico Sennrich [Tue, 17 Jul 2018 22:15:03 +0000 (08:15 +1000)]
Merge pull request #57 from jsenellart/fix_unicode_separator
enable unicode separator/glossaries in cli
Jean A. Senellart [Tue, 17 Jul 2018 21:36:11 +0000 (07:36 +1000)]
condition parameter conversion to python 2
Jean Senellart [Tue, 17 Jul 2018 21:25:48 +0000 (07:25 +1000)]
Merge branch 'master' into fix_unicode_separator
Rico Sennrich [Tue, 17 Jul 2018 06:40:51 +0000 (16:40 +1000)]
enable unicode separators in Python2
thanks @jsenellart
Jean A. Senellart [Thu, 12 Jul 2018 19:23:54 +0000 (04:23 +0900)]
same for glossaries
Jean A. Senellart [Thu, 12 Jul 2018 02:52:30 +0000 (11:52 +0900)]
enable unicode separator
Rico Sennrich [Tue, 10 Jul 2018 16:17:09 +0000 (17:17 +0100)]
Merge pull request #56 from Proyag/master
Extending --glossaries to handle regex
Proyag [Mon, 9 Jul 2018 09:05:15 +0000 (11:05 +0200)]
add unittest (and fix python3 integer division in unittest)
Proyag [Fri, 6 Jul 2018 15:15:57 +0000 (17:15 +0200)]
handle regex as glossaries
Rico Sennrich [Thu, 28 Jun 2018 10:48:40 +0000 (11:48 +0100)]
fix typo in previous commit
Rico Sennrich [Thu, 28 Jun 2018 10:41:08 +0000 (11:41 +0100)]
new option --total-symbols in learn-bpe
redefines "--symbols" to be the number of merge operations,
minus the character vocabulary size, so that "--symbols" becomes
an estimate of the final symbol vocabulary size.
thx @phikoehn
Rico Sennrich [Wed, 6 Jun 2018 13:34:56 +0000 (21:34 +0800)]
Merge pull request #52 from en-dash/master
Improve library usability
Lenz [Tue, 5 Jun 2018 20:13:51 +0000 (23:13 +0300)]
new method segment_tokens that takes and returns a list
Lenz [Tue, 5 Jun 2018 20:06:43 +0000 (23:06 +0300)]
fix: spurious .format() operation
Rico Sennrich [Mon, 21 May 2018 09:53:59 +0000 (10:53 +0100)]
fix pip package with Python3
Rico Sennrich [Thu, 17 May 2018 12:28:21 +0000 (13:28 +0100)]
proper markdown on PyPi
Rico Sennrich [Thu, 17 May 2018 12:08:24 +0000 (13:08 +0100)]
pypi repository
Rico Sennrich [Wed, 16 May 2018 15:44:15 +0000 (16:44 +0100)]
more consistent command line names for get-vocab
Rico Sennrich [Wed, 16 May 2018 15:32:55 +0000 (16:32 +0100)]
recommend subword_nmt.py as alternative to pip install in README
Rico Sennrich [Wed, 16 May 2018 15:09:57 +0000 (16:09 +0100)]
help text for subword-nmt command (and remove little-used segment_char_ngrams from command)
Rico Sennrich [Wed, 16 May 2018 13:48:08 +0000 (14:48 +0100)]
documentation of pip package
Rico Sennrich [Wed, 16 May 2018 13:47:59 +0000 (14:47 +0100)]
bugfixes to packaging
Rico Sennrich [Wed, 16 May 2018 13:35:47 +0000 (14:35 +0100)]
create symlink in old script location (with deprecation warning)
Rico Sennrich [Wed, 16 May 2018 11:22:01 +0000 (12:22 +0100)]
modify files for packaging; thanks to universome
Rico Sennrich [Wed, 16 May 2018 10:44:24 +0000 (11:44 +0100)]
move files to package structure; add setup.py
Rico Sennrich [Tue, 1 May 2018 20:27:22 +0000 (21:27 +0100)]
fix regression from 7bb1c: don't duplicate empty line
fixes #48
Rico Sennrich [Tue, 1 May 2018 18:14:13 +0000 (19:14 +0100)]
don't strip UTF-8 whitespace
Rico Sennrich [Thu, 26 Apr 2018 10:41:43 +0000 (11:41 +0100)]
more tests
Rico Sennrich [Thu, 26 Apr 2018 08:56:33 +0000 (09:56 +0100)]
testing
Rico Sennrich [Wed, 25 Apr 2018 15:58:51 +0000 (16:58 +0100)]
whitespace-only splitting everywhere.
Rico Sennrich [Wed, 28 Mar 2018 08:19:42 +0000 (09:19 +0100)]
update changelog
Rico Sennrich [Mon, 26 Mar 2018 18:03:55 +0000 (19:03 +0100)]
get_vocabulary: don't crash on double whitespace or empty line
Rico Sennrich [Mon, 26 Mar 2018 09:35:33 +0000 (10:35 +0100)]
don't break on leading whitespace
Rico Sennrich [Sun, 25 Mar 2018 13:57:46 +0000 (14:57 +0100)]
don't assume trailing whitespace has length 1.
fixes bug introduced in commit 30e5be, and issue #38.
Rico Sennrich [Sun, 25 Mar 2018 13:45:14 +0000 (15:45 +0200)]
don't break when same BPE file is used multiple times in script.
fixes #39.
Rico Sennrich [Wed, 14 Mar 2018 21:14:57 +0000 (21:14 +0000)]
make get_vocab consistent with new whitespace-only splitting in apply_bpe
(thanks to Ondrej Bojar)
Rico Sennrich [Tue, 6 Mar 2018 17:22:41 +0000 (17:22 +0000)]
fix regression in commit 30e5be (end-of-line token was segmented wrong)
Rico Sennrich [Tue, 6 Mar 2018 14:49:47 +0000 (14:49 +0000)]
don't crash on double spaces
Rico Sennrich [Tue, 6 Mar 2018 12:01:29 +0000 (12:01 +0000)]
don't silently replace unicode characters with space or newline.
should fix #29.
Rico Sennrich [Thu, 1 Mar 2018 19:02:04 +0000 (19:02 +0000)]
reference to package
Rico Sennrich [Mon, 22 Jan 2018 17:48:59 +0000 (17:48 +0000)]
fix number of arguments in test_glossaries.encode_mock
fixes #36
Rico Sennrich [Fri, 12 Jan 2018 15:44:15 +0000 (15:44 +0000)]
Merge pull request #35 from obo/patch-1
add up repeated entries with --dist-input
Ondrej Bojar [Fri, 5 Jan 2018 09:59:48 +0000 (10:59 +0100)]
add up repeated entries with --dist-input
This allows to directly read the contatenation of several outputs of get_vocab.py, instead of adding them up with a separate script.
Rico Sennrich [Wed, 3 Jan 2018 16:19:50 +0000 (16:19 +0000)]
Merge pull request #34 from ozancaglayan/fixes
Fixes
Ozan Caglayan [Thu, 21 Dec 2017 20:42:38 +0000 (21:42 +0100)]
remove unused imports, fix trailing whitespace
Ozan Caglayan [Thu, 21 Dec 2017 20:38:22 +0000 (21:38 +0100)]
do not force system's default python
This is to make sure that the scripts are executed with the interpreter
defined in the environment instead of what has been installed as
/usr/bin/python.
Ozan Caglayan [Thu, 21 Dec 2017 20:37:07 +0000 (21:37 +0100)]
add .gitignore file
Rico Sennrich [Mon, 18 Dec 2017 16:51:56 +0000 (16:51 +0000)]
Merge pull request #32 from ozancaglayan/master
learn_joint_bpe_and_vocab: Fix parameter passing
Ozan Çağlayan [Mon, 18 Dec 2017 16:24:24 +0000 (17:24 +0100)]
learn_joint_bpe_and_vocab: Fix parameter passing
args.separator was being passed to merges instead of separator.
Rico Sennrich [Fri, 15 Dec 2017 13:47:05 +0000 (13:47 +0000)]
Merge pull request #31 from Proyag/master
Option to apply fewer BPE operations than learned
Proyag [Thu, 14 Dec 2017 12:08:19 +0000 (12:08 +0000)]
Option to apply fewer merge operations than learned
Rico Sennrich [Thu, 5 Oct 2017 13:53:18 +0000 (14:53 +0100)]
update toy example to BPE use same representation again as learn_bpe.py
learn_bpe.py was changed in commit a749a7 to make end-of-word representation more consistent.
Rico Sennrich [Wed, 30 Aug 2017 16:20:19 +0000 (18:20 +0200)]
line buffering for apply_bpe.py
python 3 only so far; not sure how to make this work in python 2
Rico Sennrich [Fri, 9 Jun 2017 10:01:56 +0000 (13:01 +0300)]
cache persists within BPE instance, but not across BPE instances
Rico Sennrich [Sat, 20 May 2017 14:29:52 +0000 (15:29 +0100)]
typo
Rico Sennrich [Wed, 10 May 2017 09:14:24 +0000 (10:14 +0100)]
Merge pull request #23 from jvdbogae/master
chmod +x apply_bpe.py
Joachim Van den Bogaert [Tue, 9 May 2017 10:00:17 +0000 (12:00 +0200)]
Somehow, apply_bpe.py ended up non-executable, resulting in an empty training corpus and a failed AMUNMT training. When cleaning afterwards, the subword-nmt repo is deleted and cloned again by the AMUNMT example training script, resulting in apply_bpe.py being non-executable again (should it have been chmod +x ’ed).
Rico Sennrich [Mon, 1 May 2017 15:36:02 +0000 (16:36 +0100)]
Merge pull request #22 from Unbabel/feat/glossaries
Feat/glossaries
dimesq [Mon, 1 May 2017 10:19:43 +0000 (11:19 +0100)]
fix glossaries feature
Rico Sennrich [Fri, 28 Apr 2017 09:46:08 +0000 (10:46 +0100)]
comments
dimesq [Tue, 25 Apr 2017 20:11:35 +0000 (21:11 +0100)]
Implement glossaries feature
dimesq [Tue, 25 Apr 2017 19:48:30 +0000 (20:48 +0100)]
Add tests
Rico Sennrich [Fri, 21 Apr 2017 10:25:06 +0000 (11:25 +0100)]
changelog
Rico Sennrich [Fri, 21 Apr 2017 10:13:09 +0000 (11:13 +0100)]
update README
Rico Sennrich [Fri, 21 Apr 2017 10:00:39 +0000 (11:00 +0100)]
Merge branch 'master' into vocab
Rico Sennrich [Fri, 7 Apr 2017 13:13:26 +0000 (15:13 +0200)]
remove subword marker at end-of-line
Rico Sennrich [Sat, 1 Apr 2017 20:25:05 +0000 (21:25 +0100)]
fix merge conflict
Rico Sennrich [Thu, 9 Mar 2017 14:07:07 +0000 (14:07 +0000)]
Merge branch 'master' into vocab
Rico Sennrich [Mon, 27 Feb 2017 15:56:55 +0000 (15:56 +0000)]
rename --is-dict to --dict-input
Martin Boyanov [Sat, 25 Feb 2017 12:01:52 +0000 (14:01 +0200)]
Allow passing in a word - count file instead of iterating through the whole dataset
Rico Sennrich [Wed, 22 Feb 2017 13:58:21 +0000 (13:58 +0000)]
make max deterministic by using symbol pair as secondary sort key
Rico Sennrich [Mon, 20 Feb 2017 10:54:15 +0000 (10:54 +0000)]
acknowledgements
Rico Sennrich [Fri, 10 Feb 2017 12:36:33 +0000 (12:36 +0000)]
documentation fix
Rico Sennrich [Fri, 10 Feb 2017 11:27:47 +0000 (11:27 +0000)]
consistent utf-8 encoding across python versions and environment variables
Rico Sennrich [Fri, 10 Feb 2017 11:21:24 +0000 (11:21 +0000)]
Merge branch 'master' into vocab
Rico Sennrich [Fri, 10 Feb 2017 11:11:45 +0000 (11:11 +0000)]
consistently use UTF-8 across python versions and environment variables
Rico Sennrich [Fri, 10 Feb 2017 11:07:08 +0000 (11:07 +0000)]
Merge branch 'unicode'
Rico Sennrich [Fri, 10 Feb 2017 11:06:52 +0000 (11:06 +0000)]
script to automate learning of joint BPE and extracting vocabulary
Rico Sennrich [Mon, 6 Feb 2017 11:45:20 +0000 (11:45 +0000)]
documentation
Rico Sennrich [Fri, 3 Feb 2017 17:56:12 +0000 (17:56 +0000)]
allow prevention of BPE merges that produce OOVs or rare subword units (by providing vocabulary file and threshold)
after applying BPE merges, and subword unit that is invalid is split by reverting the BPE merges recursively
Rico Sennrich [Fri, 3 Feb 2017 17:55:00 +0000 (17:55 +0000)]
more consistent handling of end-of-word token
prevents apply_bpe.py from applying word-internal merge operations in word-final position
Rico Sennrich [Tue, 10 Jan 2017 14:52:42 +0000 (14:52 +0000)]
consistent, cross-version unicode handling
Rico Sennrich [Tue, 8 Nov 2016 09:09:06 +0000 (09:09 +0000)]
Merge pull request #10 from rmeertens/master
using python3 print function
roland [Tue, 8 Nov 2016 09:00:31 +0000 (10:00 +0100)]
using python3 print function
Rico Sennrich [Mon, 17 Oct 2016 15:35:53 +0000 (16:35 +0100)]
frequency threshold for learning
Closes #8