Bei Li

I have already obtained my Ph.D degree from the Department of Computer Science and Technology at Northeastern University, China. I work at Natural Language Processing Lab under the supervision of Prof. Tong Xiao and Prof. Jingbo Zhu.

I received my bachelor degree in 2017 from Northeastern University, majoring in Computer Science and Technology, and my master in 2020 from Northeastern University, majoring in Computer Software and Theory.

I joined the NEUNLP LAB at the fourth year of my college life. My research interest contains complex architecture modeling, deep Transformer, multimodal modeling and machine learning. Nowdays, I am currently focusing on the study of large language models, and have finished two papers, one is the prompt engineering (DTG) and the other is prompt search (EvoPrompt). Welcome to contact me if you are interested in what I am doing or have any questions to discuss!

Email  /  Resume  /  Google Scholar  /  Github

profile photo
Research

My primary focus lies within the domain of sequence generation tasks, encompassing fields such as machine translation, abstractive summarization, and more. Presently, my central research objective is the development of parameter-efficient backbones for natural language processing tasks, striving towards models that achieve superior performance with a minimized computational footprint. While, I am also interested in designing highly effective prompts for large language model to activate the underlying abilities.

News
  • [Oct'2024] One paper accepted by NeurIPS 2024.
  • [Sep'2024] Five papers (5 main) accepted by EMNLP 2024.
  • [May'2024] Three papers (1 main and 2 Findings) accepted by ACL 2024.
  • [Dec'2023] One paper on automatic prompt search accepted by ICLR 2024.
  • [Dec'2023] One paper on speech translation accepted by ICASSP 2024.
  • [Dec'2023] One paper on RL sampling accepted by AAAI 2024.
  • [Oct'2023] Two papers (1 main and 1 Findings) accepted by EMNLP 2023.
  • [May'2023] Three papers (1 main and 2 Findings) accepted by ACL 2023.
  • [Dec'2022] Finished my internship at NLC, and start a new internship at Machine Learning Group (ML).
  • [May'2022] Started my internship at MicroSoft Research Asia(MSRA), Natural Language Computing (NLC).
  • [Apr'2022] One paper on learning multiscale Transformer models for sequence generation accepted by ICML 2022.
  • [Feb'2022] Two papers on parameter-efficient backbone and multimodal machine translation accepted by ACL 2022.
  • [Apr'2021] One paper on knowledge distillation accepted to ACL 2021.
  • [Nov'2020] One paper on deep Transformer compression accepted to AAAI 2021.
  • [Sep'2020] One paper on shallow-to-deep training for deep Transformer models accepted to EMNLP 2020.
  • [Apr'2020] One paper on context-aware machine translation accepted to ACL 2020.
  • [May'2019] One paper on the NiuTrans submission of WMT19 accepted to WMT 2019.
  • [May'2019] One paper on learning deep Transformer models accepted to ACL 2019.
Publications
Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning
Bei Li, Tong Zheng, Rui Wang, Jiahao Liu, Qingyan Guo, Junliang Guo, Xu Tan, Tong Xiao, Jingbo Zhu, Jingang Wang and Xunliang Cai
The Thirty-Eighth Annual Conference on Neural Information Processing Systems (Neurips), 2024
[pdf] / [code]

In this work, we present a series of advanced explorations of Transformer architecture design to minimize the error compared to the true ``solution.'' First, we introduce a predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector. Second, we propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Extensive experiments on large-scale machine translation, abstractive summarization, language modeling, and natural language understanding benchmarks demonstrate the superiority of our approach. On the WMT'14 English-German and English-French tasks, our model achieved BLEU scores of 30.95 and 44.27, respectively. Furthermore, on the OPUS multilingual machine translation task, our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters. Notably, it also beats LLama models by 5.7 accuracy points on the LM Harness Evaluation.

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, Yujiu Yang
The Twelfth International Conference on Learning Representations (ICLR), 2024
[pdf] / [code]

In this paper, we introduce EvoPrompt, a novel framework for optimizing discrete prompts in Large Language Models (LLMs) using evolutionary algorithms (EAs). This method effectively integrates the language processing strengths of LLMs with the optimization capabilities of EAs, eliminating the need for gradients or parameters. EvoPrompt starts with a set of prompts and evolves them through LLM-based iterations, showing significant improvement over human-crafted prompts and existing automated methods by up to 25% and 14% respectively. Tested on nine datasets across language understanding and generation tasks with both closed- and open-source LLMs like GPT-3.5 and Alpaca, EvoPrompt demonstrates the potential for further research in combining LLMs with traditional algorithms.

Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs.
Bei Li*, Yuxin Zuo*, Chuanhao Lv, Tong Zheng, Tong Xiao and Jingbo Zhu
The 2023 Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP), 2023
[pdf] / [code]

Building on our previous work presented at ACL 2022, this study aims to enhance cross-modal interaction in language models. We propose a new approach that generates Visual Question Answering (VQA)-style pairs from text and incorporates probing signals during the training process. Our extensive experiments confirm that this multi-task learning framework effectively alleviates the issue of insufficient cross-modal interaction, offering a significant advancement over our prior work.

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation
Yuhao Zhang, Chen Xu, Bei Li, Tong Xiao and Jingbo Zhu
The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
[pdf] / [code]

In this work, we find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations. Furthermore, we propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation.

Deliberate then Generate: Enhanced Prompting Framework for Text Generation
Bei Li, Rui Wang, Junliang Guo, Kaitao Song, Xu Tan, Hany Hassan, Arul Menezes, Tong Xiao, Jiang Bian and JingBo Zhu
In progress, comming soon.
[pdf] / [code]

We encourage the model to deliberate by proposing a novel Deliberate then Generate (DTG) prompting framework, which consists of error detection instructions and candidates that may contain errors. DTG is a simple yet effective technique that can be applied to various text generation tasks with minimal modifications. We conduct extensive experiments on 20+ datasets across 7 text generation tasks, including summarization, translation, dialogue, and more. We show that DTG consistently outperforms existing prompting methods and achieves state-of-the-art performance on multiple text generation tasks.

Augmenting Large Language Model Translators via Translation Memories
Yongyu Mu, Abudurexiti Reheman, Zhiquan Cao, Yuchun Fan, Bei Li, Yinqiao Li, Tong Xiao, Chunliang Zhang and Jingbo Zhu
61th Annual Meeting of the Association for Computational Linguistics (Findings of ACL), 2023
[pdf] / [code]

In-context learning (ICL) augments the capabilities of large language models (LLMs) in various downstream tasks by leveraging input and output exemplars. This paper explores the use of translation memory (TM) as a form of prompting to aid LLMs in machine translation tasks. Notably, the LLM's inherent ability to comprehend these prompts significantly bolsters the use of TM. Experimental results indicate that incorporating TM considerably enhances the translation proficiency of the LLM, elevating its BLEU score to levels commensurate with state-of-the-art neural machine translation systems.

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Xiao Xu, Bei Li, Chenfei Wu, Shao-Yen Tseng, Anahita Bhiwandiwalla, Shachar Rosenman, Vasudev Lal, Wanxiang Che and Nan Duan
61th Annual Meeting of the Association for Computational Linguistics (ACL, Oral), 2023
[pdf] / [code]

We propose ManagerTower, a novel Vision-Language model architecture that gathers and combines the insights of pre-trained uni-modal experts at different levels. The managers introduced in each cross-modal layer can adaptively aggregate uni-modal semantic knowledge to facilitate more comprehensive cross-modal alignment and fusion. ManagerTower outperforms prior work.

TranSFormer: Slow-Fast Transformer for Machine Translation
Bei Li, Yi Jing, Xu Tan, Zhen Xing, Tong Xiao and Jingbo Zhu
61th Annual Meeting of the Association for Computational Linguistics (Findings of ACL), 2023
[pdf] / [code]

Building upon our previous ICML work, we refine the extraction of fine-grained character-level features by developing a multiscale Transformer model with a two-branch architecture. The Slow-Fast framework effectively mitigates the computational overhead associated with capturing long-term dependencies among character-level sequences, while employing a cross-granularity attention mechanism to learn interactions between the fast and slow branches. Comprehensive experiments conducted on multiple machine translation benchmarks attest to the efficacy of our proposed TranSFormer model.

Learning Multiscale Transformer Models for Sequence Generation
Bei Li, Tong Zheng, Yi Jing, Chengbo Jiao, Tong Xiaoand Jingbo Zhu
International Conference on Machine Learning (ICML, Spotlight), 2022
[pdf] / [code]

We re-define the concept of scale for NLP, including scales of sub-word, word and phrase. Our intention is to leverage the word boundaries and phrase-level prior knowledge to compensate for the sub-word features. Then we establish the relationships among different scales, resulting in builting a multiscale Transformer model.

On Vision Features in Multimodal Machine Translation
Bei Li, Chuanhao Lv, Zefan Zhou, Tao Zhou, Tong Xiao, Anxiang Ma and Jingbo Zhu
60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022
[pdf] / [code]

This work investigates the effect of vision features in multimodal machine translation (MMT) scenarios. We proposed three probing tasks to evaluate MMT systems which can help the following researchers. The main contribution is to reveal the importance of strong vision features.

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation
Bei Li, Quan Du, Tao Zhou, Yi Jing, Shuhan Zhou, Xin Zeng, Tong Xiao,and Jingbo Zhu
60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022
[pdf] / [code]

This work attempts to further enhance the standard sequence-level KD method by taking full advantage of the teacher parameters and generate the parameters for student.

Weight Distillation: Transferring the Knowledge in Neural Network Parameters
Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, Jingbo Zhu
59th Annual Meeting of the Association for Computational Linguistics (ACL, Oral), 2021
[pdf] / [code]

This work establishes the relationship between ODE and the design of Transformer architecture. We also redesign the Transformer architecture inspired by the lower truncation error achieved by high-order solvers in ODE. ODE Transformer can deliver much better translation performance within the same model capacity. Experimental results on three sequence generation tasks demonstrate the effectiveness.

Learning Light-Weight Translation Models from Deep Transformer
Bei Li, Ziyang Wang, Hui Liu, Quan Du, Tong Xiao, Chunliang Zhang, Jingbo Zhu
Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), 2021
[pdf] / [code]

This work attempts to learn a light-weight translation model from a deep Transformer teacher network. It introduces a group-permutation based knowledge distillation method to compressing a strong deep Transformer teacher into a much shallower counterpart with a minor BLEU degradation. Furthermore, to enhance the performance of the teacher network, we also propose a skipping sub-layer regularization training method to randomly omit some sub-layers vertically. Both methods can be well applicable into the teacher training process.

Shallow-to-Deep Training for Neural Machine Translation
Bei Li, Ziyang Wang, Hui Liu, Yufan Jiang, Quan Du, Tong Xiao, Huizhen Wang, Jingbo Zhu
The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
[pdf] / [code]

Deep Transformer systems have been widely investigated in the MT community recently. However, with the model going deeper, a crucial challenge is the huge memory cost and extremely long training time. We investigate the behavior of trained systems and find that adjacent layers behave similarly. Thus, we proposed a shallow-to-deep training method instead of learning from scratch which speeds up the training process up to 1.5 times with no loss in BLEU.

Does Multi-Encoder Help? A Case Study on Context-Aware Neural Machine Translation
Bei Li, Hui Liu, Ziyang Wang, Yufan Jiang, Tong Xiao, Jingbo Zhu, Tongran Liu, Changliang Li
58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020
[pdf] / [code]

We investigate a general-used multi-encoder framework on document-level machine translation task. It utilizes an additional context-encoder to capture the relationship between the current sentence and its contextual information. However, through specially designed context inputs, we find that the context-encoder acts more like a noise generator instead of encoding the contextual information, which is similar with dropout.Especially when we turn off the context-encoder during inference, there is even slight improvements in terms of BLEU score.

Learning deep transformer models for machine translation
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, Lidia S Chao
57th Annual Meeting of the Association for Computational Linguistics (ACL, Oral), 2019
[pdf] / [code]

It studies deep encoders in Transformer and mathematically explains the importance of the location of layer normalization for deep models. It also proposes a novel connection schema to successfully train a 30-layer Transformer system, which is the deepest encoder at that time. While, it is one of the most high cited NMT papers.

The niutrans machine translation systems for wmt19
Bei Li, Yinqiao Li, Chen Xu, Ye Lin, Jiqiang Liu, Hui Liu, Ziyang Wang, Yuhao Zhang, Nuo Xu, Zeyang Wang, Kai Feng, Hexuan Chen, Tengbo Liu, Yanyang Li, Qiang Wang, Tong Xiao, Jingbo Zhu
Fourth Conference on Machine Translation (WMT, Workshop of ACL), 2019
[pdf] / [code]

It describes the submission of the NiuTrans systems for WMT2019 on both supervised and unsupervised tasks, including 13 language directions. This paper shows the details about model architectures, data augmentation methods, ensemble knowledge distillation and system combination strategies.

Honors & Awards

  • Baidu Scholarship Finalist.
    2022
  • National Scholarship (PH.D).
    2022
  • Top Ten Graduate students of Northeastern University (The May 4th medal).
    2022
  • National Scholarship (PH.D).
    2021
  • Outstanding Reviewers of EMNLP2021.
    2021
  • 1st Rank in Chinese-English in terms of human-evaluation on WMT21.
    2021
  • The Excellent Master thesis of Liaoning Province.
    2020
  • The Excellent Master Graduate of Liao Ning Province.
    2020
  • The Excellent Master Graduate of Northeastern.
    2020
  • 1st Rank in Japanese-English news translation in terms of human-evaluation on WMT20.
    2020
  • National Scholarship (Master).
    2019
  • 1st Rank in 3 news translation in terms of auto-evaluation on WMT19.
    2019
  • 2nd Rank in 3 news translation in terms of auto-evaluation on WMT19.
    2019
  • National Scholarship (Master, Rank 1/230).
    2018
  • The Excellent Graduate of Shenyang.
    2018
  • 1st Rank in Chinese-English news translation in terms of human-evaluation on WMT18.
    2018
  • 2nd Rank in English-Chinese news translation in terms of auto-evaluation on WMT18.
    2018

Intern Experiences
Research Intern, MicroSoft Research Asia, Natural Language Computing
May. 2022 - Dec. 2022
Advisor: Chenfei Wu
Text-to-Image Generation, Diffusion Models, Multimodal Modeling
Research Intern, MicroSoft Research Asia, Machine Learning
Dec. 2022 - Nov. 2023
Advisor: Xu Tan, Rui Wang
Machine Translation, Ordinary Differential Equation, Large Language Models
Professional activities

  • Conference Reviewer for ACL, EMNLP, ICML, ICLR, Neurips, AAAI, IJCAI, NAACL, COLING, EACL

       © Bei Li | Last updated: Nov. 2023.