Skip to content

Commit c4d5d13

Browse files
committed
Support flickr30k.
1 parent 8118670 commit c4d5d13

File tree

7 files changed

+211
-53
lines changed

7 files changed

+211
-53
lines changed

README.md

Lines changed: 4 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -23,56 +23,11 @@ Pretrained models are provided [here](https://drive.google.com/open?id=0B7fNdx_j
2323

2424
If you want to do evaluation only, you can then follow [this section](#generate-image-captions) after downloading the pretrained models (and also the pretrained resnet101).
2525

26-
## Train your own network on COCO
26+
## Train your own network on COCO/Flickr30k
2727

28-
### Download COCO captions and preprocess them
28+
### Prepare data.
2929

30-
Download preprocessed coco captions from [link](http://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip) from Karpathy's homepage. Extract `dataset_coco.json` from the zip file and copy it in to `data/`. This file provides preprocessed captions and also standard train-val-test splits.
31-
32-
Then do:
33-
34-
```bash
35-
$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk
36-
```
37-
38-
`prepro_labels.py` will map all words that occur <= 5 times to a special `UNK` token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into `data/cocotalk.json` and discretized caption data are dumped into `data/cocotalk_label.h5`.
39-
40-
### Download COCO dataset and pre-extract the image features (Skip if you are using bottom-up feature)
41-
42-
Download the coco images from [link](http://mscoco.org/dataset/#download). We need 2014 training images and 2014 val. images. You should put the `train2014/` and `val2014/` in the same directory, denoted as `$IMAGE_ROOT`.
43-
44-
Then:
45-
46-
```
47-
$ python scripts/prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root $IMAGE_ROOT
48-
```
49-
50-
51-
`prepro_feats.py` extract the resnet101 features (both fc feature and last conv feature) of each image. The features are saved in `data/cocotalk_fc` and `data/cocotalk_att`, and resulting files are about 200GB.
52-
53-
(Check the prepro scripts for more options, like other resnet models or other attention sizes.)
54-
55-
**Warning**: the prepro script will fail with the default MSCOCO data because one of their images is corrupted. See [this issue](https://github.com/karpathy/neuraltalk2/issues/4) for the fix, it involves manually replacing one image in the dataset.
56-
57-
### Download Bottom-up features (Skip if you are using resnet features)
58-
59-
Download pre-extracted feature from [link](https://github.com/peteanderson80/bottom-up-attention). You can either download adaptive one or fixed one.
60-
61-
For example:
62-
```
63-
mkdir data/bu_data; cd data/bu_data
64-
wget https://storage.googleapis.com/bottom-up-attention/trainval.zip
65-
unzip trainval.zip
66-
67-
```
68-
69-
Then:
70-
71-
```bash
72-
python script/make_bu_data.py --output_dir data/cocobu
73-
```
74-
75-
This will create `data/cocobu_fc`, `data/cocobu_att` and `data/cocobu_box`. If you want to use bottom-up feature, you can just follow the following steps and replace all cocotalk with cocobu.
30+
We now support both flickr30k and COCO. See details in `data/README.md`. (Note: the later sections assume COCO dataset; it should be trivial to use flickr30k.)
7631

7732
### Start training
7833

@@ -108,7 +63,7 @@ $ bash scripts/copy_model.sh fc fc_rl
10863

10964
Then
11065
```bash
111-
$ python train.py --id fc_rl --caption_model fc --input_json data/cocotalk.json --input_fc_dir data/cocotalk_fc --input_att_dir data/cocotalk_att --input_label_h5 data/cocotalk_label.h5 --batch_size 10 --learning_rate 5e-5 --start_from log_fc_rl --checkpoint_path log_fc_rl --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30
66+
$ python train.py --id fc_rl --caption_model fc --input_json data/cocotalk.json --input_fc_dir data/cocotalk_fc --input_att_dir data/cocotalk_att --input_label_h5 data/cocotalk_label.h5 --batch_size 10 --learning_rate 5e-5 --start_from log_fc_rl --checkpoint_path log_fc_rl --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30 --cached_tokens coco-train-idxs
11267
```
11368

11469
You will see a huge boost on Cider score, : ).

data/README.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Prepare data
2+
3+
## COCO
4+
5+
### Download COCO captions and preprocess them
6+
7+
Download preprocessed coco captions from [link](http://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip) from Karpathy's homepage. Extract `dataset_coco.json` from the zip file and copy it in to `data/`. This file provides preprocessed captions and also standard train-val-test splits.
8+
9+
Then do:
10+
11+
```bash
12+
$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk
13+
```
14+
15+
`prepro_labels.py` will map all words that occur <= 5 times to a special `UNK` token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into `data/cocotalk.json` and discretized caption data are dumped into `data/cocotalk_label.h5`.
16+
17+
### Download COCO dataset and pre-extract the image features (Skip if you are using bottom-up feature)
18+
19+
Download the coco images from [link](http://mscoco.org/dataset/#download). We need 2014 training images and 2014 val. images. You should put the `train2014/` and `val2014/` in the same directory, denoted as `$IMAGE_ROOT`.
20+
21+
Then:
22+
23+
```
24+
$ python scripts/prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root $IMAGE_ROOT
25+
```
26+
27+
28+
`prepro_feats.py` extract the resnet101 features (both fc feature and last conv feature) of each image. The features are saved in `data/cocotalk_fc` and `data/cocotalk_att`, and resulting files are about 200GB.
29+
30+
(Check the prepro scripts for more options, like other resnet models or other attention sizes.)
31+
32+
**Warning**: the prepro script will fail with the default MSCOCO data because one of their images is corrupted. See [this issue](https://github.com/karpathy/neuraltalk2/issues/4) for the fix, it involves manually replacing one image in the dataset.
33+
34+
### Download Bottom-up features (Skip if you are using resnet features)
35+
36+
Download pre-extracted feature from [link](https://github.com/peteanderson80/bottom-up-attention). You can either download adaptive one or fixed one.
37+
38+
For example:
39+
```
40+
mkdir data/bu_data; cd data/bu_data
41+
wget https://storage.googleapis.com/bottom-up-attention/trainval.zip
42+
unzip trainval.zip
43+
44+
```
45+
46+
Then:
47+
48+
```bash
49+
python script/make_bu_data.py --output_dir data/cocobu
50+
```
51+
52+
This will create `data/cocobu_fc`, `data/cocobu_att` and `data/cocobu_box`. If you want to use bottom-up feature, you can just follow the following steps and replace all cocotalk with cocobu.
53+
54+
## Flickr30k.
55+
56+
It's similar.
57+
58+
```
59+
python scripts/prepro_labels.py --input_json data/dataset_flickr30k.json --output_json data/f30ktalk.json --output_h5 data/f30ktalk
60+
61+
python scripts/prepro_ngrams.py --input_json data/dataset_flickr30k.json --dict_json data/f30ktalk.json --output_pkl data/f30k-train --split train
62+
```
63+
64+
This is to generate the coco-like annotation file for evaluation using coco-caption.
65+
66+
```
67+
python scripts/prepro_reference_json.py --input_json data/dataset_flickr30k.json --output_json data/f30k_captions4eval.json
68+
```
69+
70+
### Feature extraction
71+
72+
For resnet feature, you can do the same thing as COCO.
73+
74+
For bottom-up feature, you can download from [link](https://github.com/kuanghuei/SCAN)
75+
76+
`wget https://scanproject.blob.core.windows.net/scan-data/data.zip`
77+
78+
and then convert to a pth file using the following script:
79+
80+
```
81+
import numpy as np
82+
import os
83+
import torch
84+
from tqdm import tqdm
85+
86+
out = {}
87+
def transform(id_file, feat_file):
88+
ids = open(id_file, 'r').readlines()
89+
ids = [_.strip('\n') for _ in ids]
90+
feats = np.load(feat_file)
91+
assert feats.shape[0] == len(ids)
92+
for _id, _feat in tqdm(zip(ids, feats)):
93+
out[str(_id)] = _feat
94+
95+
transform('dev_ids.txt', 'dev_ims.npy')
96+
transform('train_ids.txt', 'train_ims.npy')
97+
transform('test_ids.txt', 'test_ims.npy')
98+
99+
torch.save(out, 'f30kbu_att.pth')
100+
```

dataloader.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,11 @@ def __init__(self, db_path, ext):
3333
self.env = lmdb.open(db_path, subdir=os.path.isdir(db_path),
3434
readonly=True, lock=False,
3535
readahead=False, meminit=False)
36+
elif db_path.endswith('.pth'): # Assume a key,value dictionary
37+
self.db_type = 'pth'
38+
self.feat_file = torch.load(db_path)
39+
self.loader = lambda x: x
40+
print('HybridLoader: ext is ignored')
3641
else:
3742
self.db_type = 'dir'
3843

@@ -43,6 +48,8 @@ def get(self, key):
4348
with env.begin(write=False) as txn:
4449
byteflow = txn.get(key)
4550
f_input = six.BytesIO(byteflow)
51+
elif self.db_type == 'pth':
52+
f_input = self.feat_file[key]
4653
else:
4754
f_input = os.path.join(self.db_path, key + self.ext)
4855

eval.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@
7171

7272

7373
# Set sample options
74+
opt.datset = opt.input_json
7475
loss, split_predictions, lang_stats = eval_utils.eval_split(model, crit, loader,
7576
vars(opt))
7677

eval_utils.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,10 @@ def count_bad(sen):
2828
def language_eval(dataset, preds, model_id, split):
2929
import sys
3030
sys.path.append("coco-caption")
31-
annFile = 'coco-caption/annotations/captions_val2014.json'
31+
if 'coco' in dataset:
32+
annFile = 'coco-caption/annotations/captions_val2014.json'
33+
elif 'flickr30k' in dataset or 'f30k' in dataset:
34+
annFile = 'coco-caption/f30k_captions4eval.json'
3235
from pycocotools.coco import COCO
3336
from pycocoevalcap.eval import COCOEvalCap
3437

scripts/prepro_labels.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -168,9 +168,12 @@ def main(params):
168168

169169
jimg = {}
170170
jimg['split'] = img['split']
171-
if 'filename' in img: jimg['file_path'] = os.path.join(img['filepath'], img['filename']) # copy it over, might need
172-
if 'cocoid' in img: jimg['id'] = img['cocoid'] # copy over & mantain an id, if present (e.g. coco ids, useful)
173-
171+
if 'filename' in img: jimg['file_path'] = os.path.join(img.get('filepath', ''), img['filename']) # copy it over, might need
172+
if 'cocoid' in img:
173+
jimg['id'] = img['cocoid'] # copy over & mantain an id, if present (e.g. coco ids, useful)
174+
elif 'imgid' in img:
175+
jimg['id'] = img['imgid']
176+
174177
if params['images_root'] != '':
175178
with Image.open(os.path.join(params['images_root'], img['filepath'], img['filename'])) as _img:
176179
jimg['width'], jimg['height'] = _img.size

scripts/prepro_reference_json.py

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# coding: utf-8
2+
"""
3+
Preprocess a raw json dataset into hdf5/json files for use in data_loader.lua
4+
5+
Input: json file that has the form
6+
[{ file_path: 'path/img.jpg', captions: ['a caption', ...] }, ...]
7+
example element in this list would look like
8+
{'captions': [u'A man with a red helmet on a small moped on a dirt road. ', u'Man riding a motor bike on a dirt road on the countryside.', u'A man riding on the back of a motorcycle.', u'A dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud-wreathed mountains. ', u'A man in a red shirt and a red hat is on a motorcycle on a hill side.'], 'file_path': u'val2014/COCO_val2014_000000391895.jpg', 'id': 391895}
9+
10+
This script reads this json, does some basic preprocessing on the captions
11+
(e.g. lowercase, etc.), creates a special UNK token, and encodes everything to arrays
12+
13+
Output: a json file and an hdf5 file
14+
The hdf5 file contains several fields:
15+
/images is (N,3,256,256) uint8 array of raw image data in RGB format
16+
/labels is (M,max_length) uint32 array of encoded labels, zero padded
17+
/label_start_ix and /label_end_ix are (N,) uint32 arrays of pointers to the
18+
first and last indices (in range 1..M) of labels for each image
19+
/label_length stores the length of the sequence for each of the M sequences
20+
21+
The json file has a dict that contains:
22+
- an 'ix_to_word' field storing the vocab in form {ix:'word'}, where ix is 1-indexed
23+
- an 'images' field that is a list holding auxiliary information for each image,
24+
such as in particular the 'split' it was assigned to.
25+
"""
26+
27+
from __future__ import absolute_import
28+
from __future__ import division
29+
from __future__ import print_function
30+
31+
import os
32+
import json
33+
import argparse
34+
import sys
35+
import hashlib
36+
from random import shuffle, seed
37+
38+
39+
def main(params):
40+
41+
imgs = json.load(open(params['input_json'][0], 'r'))['images']
42+
# tmp = []
43+
# for k in imgs.keys():
44+
# for img in imgs[k]:
45+
# img['filename'] = img['image_id'] # k+'/'+img['image_id']
46+
# img['image_id'] = int(
47+
# int(hashlib.sha256(img['image_id']).hexdigest(), 16) % sys.maxint)
48+
# tmp.append(img)
49+
# imgs = tmp
50+
51+
# create output json file
52+
out = {u'info': {u'description': u'This is stable 1.0 version of the 2014 MS COCO dataset.', u'url': u'http://mscoco.org', u'version': u'1.0', u'year': 2014, u'contributor': u'Microsoft COCO group', u'date_created': u'2015-01-27 09:11:52.357475'}, u'licenses': [{u'url': u'http://creativecommons.org/licenses/by-nc-sa/2.0/', u'id': 1, u'name': u'Attribution-NonCommercial-ShareAlike License'}, {u'url': u'http://creativecommons.org/licenses/by-nc/2.0/', u'id': 2, u'name': u'Attribution-NonCommercial License'}, {u'url': u'http://creativecommons.org/licenses/by-nc-nd/2.0/', u'id': 3, u'name': u'Attribution-NonCommercial-NoDerivs License'}, {u'url': u'http://creativecommons.org/licenses/by/2.0/', u'id': 4, u'name': u'Attribution License'}, {u'url': u'http://creativecommons.org/licenses/by-sa/2.0/', u'id': 5, u'name': u'Attribution-ShareAlike License'}, {u'url': u'http://creativecommons.org/licenses/by-nd/2.0/', u'id': 6, u'name': u'Attribution-NoDerivs License'}, {u'url': u'http://flickr.com/commons/usage/', u'id': 7, u'name': u'No known copyright restrictions'}, {u'url': u'http://www.usa.gov/copyright.shtml', u'id': 8, u'name': u'United States Government Work'}], u'type': u'captions'}
53+
out.update({'images': [], 'annotations': []})
54+
55+
cnt = 0
56+
empty_cnt = 0
57+
for i, img in enumerate(imgs):
58+
if img['split'] == 'train':
59+
continue
60+
out['images'].append(
61+
{u'id': img.get('cocoid', img['imgid'])})
62+
for j, s in enumerate(img['sentences']):
63+
if len(s) == 0:
64+
continue
65+
s = ' '.join(s['tokens'])
66+
out['annotations'].append(
67+
{'image_id': out['images'][-1]['id'], 'caption': s, 'id': cnt})
68+
cnt += 1
69+
70+
json.dump(out, open(params['output_json'], 'w'))
71+
print('wrote ', params['output_json'])
72+
73+
74+
if __name__ == "__main__":
75+
76+
parser = argparse.ArgumentParser()
77+
78+
# input json
79+
parser.add_argument('--input_json', nargs='+', required=True,
80+
help='input json file to process into hdf5')
81+
parser.add_argument('--output_json', default='data.json',
82+
help='output json file')
83+
84+
args = parser.parse_args()
85+
params = vars(args) # convert to ordinary dict
86+
print('parsed input parameters:')
87+
print(json.dumps(params, indent=2))
88+
main(params)
89+

0 commit comments

Comments
 (0)