zouharvi/bio-mqm-dataset
Viewer • Updated • 62.2k • 78 • 8
We introduce a new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain. We use this dataset to investigate whether machine translation (MT) metrics which are fine-tuned on human-generated MT quality judgements are robust to domain shifts between training and inference. We find that fine-tuned metrics exhibit a substantial performance drop in the unseen domain scenario relative to metrics that rely on the surface form, as well as pre-trained metrics which are not fine-tuned on MT quality judgments.
Get this paper in your agent:
hf papers read 2402.18747 curl -LsSf https://hf.co/cli/install.sh | bash No model linking this paper
No Space linking this paper
No Collection including this paper