(#: equal contribution; Open resources are available in GitHub ★ )
[Preprints]
This paper investigates the faithfulness of multimodal large language model (MLLM) agents in the graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general setting is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. A wide range of MLLMs are evaluated as GUI agents using our simulated dataset, following three working patterns with different levels of perception. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions. While recent studies predominantly focus on the helpfulness (i.e., action accuracy) of multimodal agents, our findings indicate that these agents are prone to environmental distractions, resulting in unfaithful behaviors. Furthermore, we switch to the adversarial perspective and implement environment injection, demonstrating that such unfaithfulness can be exploited, leading to unexpected risks. |
@article{ma2024caution, title={Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions}, author={Ma, Xinbei and Wang, Yiting and Yao, Yao and Yuan, Tongxin and Zhang, Aston and Zhang, Zhuosheng and Zhao, Hai}, journal={arXiv preprint arXiv:2408.02544}, year={2024} }
The rapid adoption of large language models (LLMs) in multi-agent systems has highlighted their impressive capabilities in various applications, such as collaborative problem-solving and autonomous negotiation. However, the security implications of these LLM-based multi-agent systems have not been thoroughly investigated, particularly concerning the spread of manipulated knowledge. In this paper, we investigate this critical issue by constructing a detailed threat model and a comprehensive simulation environment that mirrors real-world multi-agent deployments in a trusted platform. Subsequently, we propose a novel two-stage attack method involving Persuasiveness Injection and Manipulated Knowledge Injection to systematically explore the potential for manipulated knowledge (i.e., counterfactual and toxic knowledge) spread without explicit prompt manipulation. Our method leverages the inherent vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to unconsciously spread fabricated information. Through extensive experiments, we demonstrate that our attack method can successfully induce LLM-based agents to spread both counterfactual and toxic knowledge without degrading their foundational capabilities during agent communication. Furthermore, we show that these manipulations can persist through popular retrieval-augmented generation frameworks, where several benign agents store and retrieve manipulated chat histories for future interactions. This persistence indicates that even after the interaction has ended, the benign agents may continue to be influenced by manipulated knowledge. Our findings reveal significant security risks in LLM-based multi-agent systems, emphasizing the imperative need for robust defenses against manipulated knowledge spread, such as introducing ``guardian'' agents and advanced fact-checking tools. |
@article{ju2024flooding, title={Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities}, author={Ju, Tianjie and Wang, Yiting and Ma, Xinbei and Cheng, Pengzhou and Zhao, Haodong and Wang, Yulong and Liu, Lifeng and Xie, Jian and Zhang, Zhuosheng and Liu, Gongshen}, journal={arXiv preprint arXiv:2407.07791}, year={2024} }
Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP). Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. Attacking LLMs is inherently risky in security review, while prohibitively expensive. Besides, the continuous iteration of LLMs will degrade the robustness of backdoors. In this paper, we propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation, thereby manipulating LLMs in universal attack scenarios. Specifically, the adversary constructs elaborate target contexts and trigger sets. Multiple pairs of backdoor shortcuts are orthogonally optimized by contrastive learning, thus constraining the triggering conditions to a parameter subspace to improve the matching. To improve the recall of the RAG for the target contexts, we introduce a knowledge graph to construct structured data to achieve hard matching at a fine-grained level. Moreover, we normalize the backdoor scenarios in LLMs to analyze the real harm caused by backdoors from both attackers' and users' perspectives and further verify whether the context is a favorable tool for jailbreaking models. Extensive experimental results on truthfulness, language understanding, and harmfulness show that TrojanRAG exhibits versatility threats while maintaining retrieval capabilities on normal queries. |
@article{cheng2024trojanrag, title={TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models}, author={Cheng, Pengzhou and Ding, Yidong and Ju, Tianjie and Wu, Zongru and Du, Wei and Yi, Ping and Zhang, Zhuosheng and Liu, Gongshen}, journal={arXiv preprint arXiv:2405.13401}, year={2024} }
Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, they also introduce novel vulnerabilities that demand careful consideration for safety. However, there exists a notable gap in the literature, as there has been no comprehensive exploration of these vulnerabilities. This position paper fills this gap by conducting a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures. We begin by providing a comprehensive overview of the potential risks inherent to scientific LLM agents, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. Then, we delve into the origins of these vulnerabilities and provide a scoping review of the limited existing works. Based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. Furthermore, we highlight the limitations and challenges associated with safeguarding scientific agents and advocate for the development of improved models, robust benchmarks, and comprehensive regulations to address these issues effectively. |
@article{tang2024prioritizing, title={Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science}, author={Tang, Xiangru and Jin, Qiao Jin and Zhu, Kunlun and Yuan, Tongxin and Zhang, Yichi and Zhou, Wangchunshu and Qu, Meng and Zhao, Yilun and Tang, Jian and Zhang, Zhuosheng and Cohan, Arman and Lu, Zhiyong and Gerstein, Mark}, journal={arXiv preprint arXiv:2402.04247}, year={2024} }
Large language models (LLMs) have dramatically enhanced the field of language intelligence, as demonstrably evidenced by their formidable empirical performance across a spectrum of complex reasoning tasks. Additionally, theoretical proofs have illuminated their emergent reasoning capabilities, providing a compelling showcase of their advanced cognitive abilities in linguistic contexts. Critical to their remarkable efficacy in handling complex reasoning tasks, LLMs leverage the intriguing chain-of-thought (CoT) reasoning techniques, obliging them to formulate intermediate steps en route to deriving an answer. The CoT reasoning approach has not only exhibited proficiency in amplifying reasoning performance but also in enhancing interpretability, controllability, and flexibility. In light of these merits, recent research endeavors have extended CoT reasoning methodologies to nurture the development of autonomous language agents, which adeptly adhere to language instructions and execute actions within varied environments. This survey paper orchestrates a thorough discourse, penetrating vital research dimensions, encompassing: (i) the foundational mechanics of CoT techniques, with a focus on elucidating the circumstances and justification behind its efficacy; (ii) the paradigm shift in CoT; and (iii) the burgeoning of language agents fortified by CoT approaches. Prospective research avenues envelop explorations into generalization, efficiency, customization, scaling, and safety. We hope to offer readers a comprehensive understanding of prevalent research areas such as CoT reasoning and language agents and illuminate the interconnections weaving through these areas. This paper caters to a wide audience, including beginners seeking comprehensive knowledge of CoT reasoning and language agents, as well as experienced researchers interested in foundational mechanics and engaging in cutting-edge discussions on these topics. A repository for the related papers is available at https://github.com/Zoeyyao27/CoT-Igniting-Agent. |
@article{zhang2023igniting, title={Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents}, author={Zhang, Zhuosheng and Yao, Yao and Zhang, Aston and Tang, Xiangru and Ma, Xinbei and He, Zhiwei and Wang, Yiming and Gerstein, Mark and Wang, Rui and Liu, Gongshen and others}, journal={arXiv preprint arXiv:2311.11797}, year={2023} }
Real-world data deviating from the independent and identically distributed (i.i.d.) assumption of in-distribution training data poses security threats to deep networks, thus advancing out-of-distribution (OOD) detection algorithms. Detection methods in generative language models (GLMs) mainly focus on uncertainty estimation and embedding distance measurement, with the latter proven to be most effective in traditional linguistic tasks like summarization and translation. However, another complex generative scenario mathematical reasoning poses significant challenges to embedding-based methods due to its high-density feature of output spaces, but this feature causes larger discrepancies in the embedding shift trajectory between different samples in latent spaces. Hence, we propose a trajectory-based method TV score, which uses trajectory volatility for OOD detection in mathematical reasoning. Experiments show that our method outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios and can be extended to more applications with high-density features in output spaces, such as multiple-choice questions. |
@article{wang2024trajectory, title={Trajectory Volatility for Out-of-Distribution Detection in Mathematical Reasoning}, author={Wang, Yiming and Zhang, Pei and Yang, Baosong and Wong, Derek F and Zhang, Zhuosheng and Wang, Rui}, journal={arXiv preprint arXiv:2405.14039}, year={2024} }
Large language models (LLMs) have played a pivotal role in building communicative AI to imitate human behaviors but face the challenge of efficient customization. To tackle this challenge, recent studies have delved into the realm of model editing, which manipulates specific memories of language models and changes the related language generation. However, the robustness of model editing remains an open question. This work seeks to understand the strengths and limitations of editing methods, thus facilitating robust, realistic applications of communicative AI. Concretely, we conduct extensive analysis to address the three key research questions. Q1: Can edited LLMs behave consistently resembling communicative AI in realistic situations? Q2: To what extent does the rephrasing of prompts lead LLMs to deviate from the edited knowledge memory? Q3: Which knowledge features are correlated with the performance and robustness of editing? Our experimental results uncover a substantial disparity between existing editing methods and the practical application of LLMs. On rephrased prompts that are complex and flexible but common in realistic applications, the performance of editing experiences a significant decline. Further analysis shows that more popular knowledge is memorized better, easier to recall, and more challenging to edit effectively. |
@article{ma2024possible, title={Is it Possible to Edit Large Language Models Robustly?}, author={Ma, Xinbei and Ju, Tianjie and Qiu, Jiyang and Zhang, Zhuosheng and Zhao, Hai and Liu, Lifeng and Wang, Yulong}, journal={arXiv preprint arXiv:2402.05827}, year={2024} }
Despite the rapid progress of large language models (LLMs), their task performance remains sensitive to prompt design. Recent studies have explored leveraging the LLM itself as an optimizer to identify optimal prompts that maximize task accuracy. However, when evaluating prompts, such approaches heavily rely on elusive manually annotated gold labels to calculate task accuracy for each candidate prompt, which hinders the widespread implementation and generality. To overcome the limitation, this work proposes a gold label-agnostic prompt evaluation (GLaPE) to alleviate dependence on gold labels. Motivated by the observed correlation between self-consistency and the accuracy of the answer, we adopt self-consistency as the initial evaluation score. Subsequently, we refine the scores of prompts producing identical answers to be mutually consistent. Experimental results show that GLaPE provides reliable evaluations uniform with accuracy, even in the absence of gold labels. Moreover, on six popular reasoning tasks, our GLaPE-based prompt optimization yields effective prompts comparable to accuracy-based ones. |
@article{zhang2024glape, title={GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large Language Model}, author={Zhang, Xuanchang and Zhang, Zhuosheng and Zhao, Hai}, journal={arXiv preprint arXiv:2402.02408}, year={2024} }
Large language models (LLMs) have exhibited great potential in autonomously completing tasks across real-world applications. Despite this, these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering on LLM-generated content safety in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging safety risks given agent interaction records. R-Judge comprises 162 agent interaction records, encompassing 27 key risk scenarios among 7 application categories and 10 risk types. It incorporates human consensus on safety with annotated safety risk labels and high-quality risk descriptions. Utilizing R-Judge, we conduct a comprehensive evaluation of 8 prominent LLMs commonly employed as the backbone for agents. The best-performing model, GPT-4, achieves 72.29% in contrast to the human score of 89.38%, showing considerable room for enhancing the risk awareness of LLMs. Notably, leveraging risk descriptions as environment feedback significantly improves model performance, revealing the importance of salient safety risk feedback. Furthermore, we design an effective chain of safety analysis technique to help the judgment of safety risks and conduct an in-depth case study to facilitate future research. R-Judge is publicly available at https://github.com/Lordog/R-Judge. |
@article{yuan2024r, title={R-Judge: Benchmarking Safety Risk Awareness for LLM Agents}, author={Yuan, Tongxin and He, Zhiwei and Dong, Lingzhong and Wang, Yiming and Zhao, Ruijie and Xia, Tian and Xu, Lizhen and Zhou, Binglin and Li, Fangqi and Zhang, Zhuosheng and Wang, Rui and Liu, Gongshen}, journal={arXiv preprint arXiv:2401.10019}, year={2024} }
The advent of large language models (LLMs) has spurred considerable interest in advancing autonomous LLMs-based agents, particularly in intriguing applications within smartphone graphical user interfaces (GUIs). When presented with a task goal, these agents typically emulate human actions within a GUI environment until the task is completed. However, a key challenge lies in devising effective plans to guide action prediction in GUI tasks, though planning have been widely recognized as effective for decomposing complex tasks into a series of steps. Specifically, given the dynamic nature of environmental GUIs following action execution, it is crucial to dynamically adapt plans based on environmental feedback and action this http URL show that the widely-used ReAct approach fails due to the excessively long historical dialogues. To address this challenge, we propose a novel approach called Dynamic Planning of Thoughts (D-PoT) for LLM-based GUI agents.D-PoT involves the dynamic adjustment of planning based on the environmental feedback and execution history. Experimental results reveal that the proposed D-PoT significantly surpassed the strong GPT-4V baseline by +12.7% (34.66% → 47.36%) in accuracy. The analysis highlights the generality of dynamic planning in different backbone LLMs, as well as the benefits in mitigating hallucinations and adapting to unseen tasks. Code is available at https://github.com/sqzhang-lazy/D-PoT. |
@article{zhang2024dynamic, title={Dynamic Planning for LLM-based Graphical User Interface Automation}, author={Zhang, Shaoqing and Zhang, Zhuosheng and Chen, Kehai and Ma, Xinbe and Yang, Muyun and Zhao, Tiejun and Zhang, Min}, journal={arXiv preprint arXiv:2410.00467}, year={2024} }
Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies are mostly isolated in the language modality with LLMs, where LLMs are hard to deploy. To elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning. The key challenge is that those language models tend to generate hallucinated reasoning chains that mislead the answer inference. To mitigate the effect of such mistakes, we propose Multimodal-CoT that incorporates vision features in a decoupled training framework. The framework separates the rationale generation and answer inference into two stages. By incorporating the vision features in both stages, the model is able to generate effective rationales that contribute to answer inference. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16% (75.17%->91.68%) on the ScienceQA benchmark and even surpasses human performance. Code is publicly available at https://github.com/amazon-science/mm-cot. |
@article{zhang2023multicot, title={Multimodal Chain-of-Thought Reasoning in Language Models}, author={Zhang, Zhuosheng and Zhang, Aston and Li, Mu and Zhao, Hai and Karypis, George and Smola, Alex}, journal={arXiv preprint arXiv:2302.00923}, year={2023} }
Recent work has showcased the powerful capability of large language models (LLMs) in recalling knowledge and reasoning. However, the reliability of LLMs in combining these two capabilities into reasoning through multi-hop facts has not been widely explored. This paper systematically investigates the possibilities for LLMs to utilize shortcuts based on direct connections between the initial and terminal entities of multi-hop knowledge. We first explore the existence of factual shortcuts through Knowledge Neurons, revealing that: (i) the strength of factual shortcuts is highly correlated with the frequency of co-occurrence of initial and terminal entities in the pre-training corpora; (ii) few-shot prompting leverage more shortcuts in answering multi-hop questions compared to chain-of-thought prompting. Then, we analyze the risks posed by factual shortcuts from the perspective of multi-hop knowledge editing. Analysis shows that approximately 20% of the failures are attributed to shortcuts, and the initial and terminal entities in these failure instances usually have higher co-occurrences in the pre-training corpus. Finally, we propose erasing shortcut neurons to mitigate the associated risks and find that this approach significantly reduces failures in multiple-hop knowledge editing caused by shortcuts. |
@article{ju2024investigating, title={Investigating Multi-Hop Factual Shortcuts in Knowledge Editing of Large Language Models}, author={Ju, Tianjie and Chen, Yijin and Yuan, Xinwei and Zhang, Zhuosheng and Du, Wei and Zheng, Yubin and Liu, Gongshen}, journal={arXiv preprint arXiv:2402.11900}, year={2024} }
Despite the notable success of language models (LMs) in various natural language processing (NLP) tasks, the reliability of LMs is susceptible to backdoor attacks. Prior research attempts to mitigate backdoor learning while training the LMs on the poisoned dataset, yet struggles against complex backdoor attacks in real-world scenarios. In this paper, we investigate the learning mechanisms of backdoor LMs in the frequency space by Fourier analysis. Our findings indicate that the backdoor mapping presented on the poisoned datasets exhibits a more discernible inclination towards lower frequency compared to clean mapping, resulting in the faster convergence of backdoor mapping. To alleviate this dilemma, we propose Multi-Scale Low-Rank Adaptation (MuScleLoRA), which deploys multiple radial scalings in the frequency space with low-rank adaptation to the target model and further aligns the gradients when updating parameters. Through downscaling in the frequency space, MuScleLoRA encourages the model to prioritize the learning of relatively high-frequency clean mapping, consequently mitigating backdoor learning. Experimental results demonstrate that MuScleLoRA outperforms baselines significantly. Notably, MuScleLoRA reduces the average success rate of diverse backdoor attacks to below 15\% across multiple datasets and generalizes to various backbone LMs, including BERT, RoBERTa, and Llama2. |
@article{wu2024acquiring, title={Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space}, author={Wu, Zongru and Zhang, Zhuosheng and Cheng, Pengzhou and Liu, Gongshen}, journal={arXiv preprint arXiv:2402.12026}, year={2024} }
Text watermarking technology aims to tag and identify content produced by large language models (LLMs) to prevent misuse. In this study, we introduce the concept of ''cross-lingual consistency'' in text watermarking, which assesses the ability of text watermarks to maintain their effectiveness after being translated into other languages. Preliminary empirical results from two LLMs and three watermarking methods reveal that current text watermarking technologies lack consistency when texts are translated into various languages. Based on this observation, we propose a Cross-lingual Watermark Removal Attack (CWRA) to bypass watermarking by first obtaining a response from an LLM in a pivot language, which is then translated into the target language. CWRA can effectively remove watermarks by reducing the Area Under the Curve (AUC) from 0.95 to 0.67 without performance loss. Furthermore, we analyze two key factors that contribute to the cross-lingual consistency in text watermarking and propose a defense method that increases the AUC from 0.67 to 0.88 under CWRA. |
@article{he2024can, title={Can Watermarks Survive Translation? On the Cross-lingual Consistency of Text Watermark for Large Language Models}, author={He, Zhiwei and Zhou, Binglin and Hao, Hongkun and Liu, Aiwei and Wang, Xing and Tu, Zhaopeng and Zhang, Zhuosheng and Wang, Rui}, journal={arXiv preprint arXiv:2402.14007}, year={2024} }
Autonomous user interface (UI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LLMs) for effective engagement in diverse environments. To align with the input-output requirement of LLMs, existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-UI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique -- leveraging a series of intermediate previous action histories and future action plans -- to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30K unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-UI achieves state-of-the-art performance with an action type prediction accuracy of 90% and an overall action success rate of 74%. Code is publicly available at https://github.com/cooelf/Auto-UI. |
@article{zhang2023autoui, title={You Only Look at Screens: Multimodal Chain-of-Action Agents}, author={Zhang, Zhuosheng and Zhang, Aston}, journal={arXiv preprint arXiv:2309.11436}, year={2023} }
Bargaining is an important and unique part of negotiation between humans. As LLM-driven agents learn to negotiate and act like real humans, how to evaluate agents' bargaining abilities remains an open problem. For the first time, we formally described the Bargaining task as an asymmetric incomplete information game, defining the gains of the Buyer and Seller in multiple bargaining processes. It allows us to quantitatively assess an agent's performance in the Bargain task. We collected a real product price dataset, AmazonHistoryPrice, and conducted evaluations of various LLM agents' bargaining abilities. We find that playing a Buyer is much harder than a Seller, and increasing model size can not effectively improve the Buyer's performance. To address the challenge, we propose a novel approach called OG-Narrator that integrates a deterministic Offer Generator to control the price range of Buyer's offers, and an LLM Narrator to create natural language sentences for generated offers. Experimental results show that OG-Narrator improves the buyer's deal rates from 26.67% to 88.88% and brings a ten times of multiplication of profits on all baselines, even a model that has not been aligned. |
@article{xia2024measuring, title={Measuring Bargaining Abilities of LLMs: A Benchmark and A Buyer-Enhancement Method}, author={Xia, Tian and He, Zhiwei and Ren, Tong and Miao, Yibo and Zhang, Zhuosheng and Yang, Yang and Wang, Rui}, journal={arXiv preprint arXiv:2402.15813}, year={2024} }
Large language models (LLMs) have played a pivotal role in building communicative AI to imitate human behaviors but face the challenge of efficient customization. To tackle this challenge, recent studies have delved into the realm of model editing, which manipulates specific memories of language models and changes the related language generation. However, the robustness of model editing remains an open question. This work seeks to understand the strengths and limitations of editing methods, thus facilitating robust, realistic applications of communicative AI. Concretely, we conduct extensive analysis to address the three key research questions. Q1: Can edited LLMs behave consistently resembling communicative AI in realistic situations? Q2: To what extent does the rephrasing of prompts lead LLMs to deviate from the edited knowledge memory? Q3: Which knowledge features are correlated with the performance and robustness of editing? Our experimental results uncover a substantial disparity between existing editing methods and the practical application of LLMs. On rephrased prompts that are complex and flexible but common in realistic applications, the performance of editing experiences a significant decline. Further analysis shows that more popular knowledge is memorized better, easier to recall, and more challenging to edit effectively. |
@article{ma2024comprehensive, title={Comprehensive Cognitive LLM Agent for Smartphone GUI Automation}, author={Ma, Xinbei and Zhang, Zhuosheng and Zhao, Hai}, journal={arXiv preprint arXiv:2402.11941}, year={2024} }
Large Language Models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare. This field faces unique challenges such as domain-specific terminologies and the reasoning over specialized knowledge. To address these obstinate issues, we propose a novel Multi-disciplinary Collaboration (MC) framework for the medical domain that leverages role-playing LLM-based agents who participate in a collaborative multi-round discussion, thereby enhancing LLM proficiency and reasoning capabilities. This training-free and interpretable framework encompasses five critical steps: gathering domain experts, proposing individual analyses, summarising these analyses into a report, iterating over discussions until a consensus is reached, and ultimately making a decision. Our work particularly focuses on the zero-shot scenario, our results on nine data sets (MedQA, MedMCQA, PubMedQA, and six subtasks from MMLU) establish that our proposed MC framework excels at mining and harnessing the medical expertise in LLMs, as well as extending its reasoning abilities. Based on these outcomes, we further conduct a human evaluation to pinpoint and categorize common errors within our method, as well as ablation studies aimed at understanding the impact of various factors on overall performance. |
@article{tang2023medagents, title={MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning}, author={Tang, Xiangru and Zou, Anni and Zhang, Zhuosheng and Zhao, Yilun and Zhang, Xingyao and Cohan, Arman and Gerstein, Mark}, journal={arXiv preprint arXiv:2311.10537}, year={2023} }
This paper studies the problem of solving complex chemistry problems with large language models (LLMs). Despite the extensive general knowledge in LLMs (such as GPT-4), they struggle with chemistry reasoning that requires faithful grounded reasoning with diverse chemical knowledge and an integrative understanding of chemical interactions. We propose InstructChem, a new structured reasoning approach that substantially boosts the LLMs' chemical reasoning capabilities. InstructChem explicitly decomposes the reasoning into three critical phrases, including chemical formulae generation by LLMs that offers the basis for subsequent grounded reasoning, step-by-step reasoning that makes multi-step derivations with the identified formulae for a preliminary answer, and iterative review-and-refinement that steers LLMs to progressively revise the previous phases for increasing confidence, leading to the final high-confidence answer. We conduct extensive experiments on four different chemistry challenges, including quantum chemistry, quantum mechanics, physical chemistry, and chemistry kinetics. Our approach significantly enhances GPT-4 on chemistry reasoning, yielding an 8% average absolute improvement and a 30% peak improvement. We further use the generated reasoning by GPT-4 to fine-tune smaller LMs (e.g., Vicuna) and observe strong improvement of the smaller LMs. This validates our approach and enables LLMs to generate high-quality reasoning. |
@article{ouyang2023structured, title={Structured Chemistry Reasoning with Large Language Models}, author={Ouyang, Siru and Zhang, Zhuosheng and Yan, Bing and Liu, Xuan and Han, Jiawei and Qin, Lianhui}, journal={arXiv preprint arXiv:2311.09656}, year={2023} }
Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing specific background documents. This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models. While recent Large Language Models (LLMs) like GPT-3 have demonstrated their effectiveness in zero-shot ODQA using direct prompting methods, these methods still fall short of fully harnessing the potential of LLMs when implicitly invoked. In this paper, we propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of LLMs and their strong instruction understanding abilities. Concretely, we prompt LLMs step by step to generate multiple pseudo QA pairs with background passages and explanations entirely from scratch. These generated elements are then utilized for in-context learning. Experimental results show that our method significantly surpasses previous state-of-the-art zero-shot methods on three widely-used ODQA datasets and even achieves comparable performance with various customized fine-tuned models on full training data. Our code is available at https://github.com/lockon-n/self-prompting. |
@article{li2022self, title={Self-Prompting Large Language Models for Open-Domain QA}, author={Li, Junlong and hang, Zhuosheng and Zhao, Hai}, journal={arXiv preprint arXiv:2212.08635}, year={2022} }
Insufficient modeling of human preferences within the reward model is a major obstacle for leveraging human feedback to improve translation quality. Fortunately, quality estimation (QE), which predicts the quality of a given translation without reference, has achieved impressive alignment with human evaluations in the last two years. In this work, we investigate the potential of employing the QE model as the reward model (the QE-based reward model) to predict human preferences for feedback training. We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines. We examine the problem and argue that the vulnerability of the QE model might lead to high rewards for incorrect translations, resulting in overoptimization and error propagation. To address the problem, we adopt a simple yet effective method that uses heuristic rules to detect the incorrect translations and assigns a penalty term to the QE-based rewards for the detected incorrect translations. Experimental results show that the proposed QE-based feedback training achieves consistent and significant improvements across various settings, further verified through human preference studies. Our subsequent analysis demonstrates the high data efficiency of the proposed QE-based feedback training: the proposed approach using a small amount of monolingual data can outperform systems using larger parallel corpora. |
@article{he2024improving, title={Improving machine translation with human feedback: An exploration of quality estimation as a reward model}, author={He, Zhiwei and Wang, Xing and Jiao, Wenxiang and Zhang, Zhuosheng and Wang, Rui and Shi, Shuming and Tu, Zhaopeng}, journal={arXiv preprint arXiv:2401.12873}, year={2024} }
Large language models have manifested remarkable capabilities by leveraging chain-of-thought (CoT) reasoning techniques to solve intricate questions through step-by-step reasoning chains. Despite its success, the efficacy of such reasoning is inherently contingent upon the quality of CoT. However, flawless CoT reasoning cannot be guaranteed due to the presence of indecomposable questions and the potential for erroneous reasoning chains, particularly in the case of small-scale language models. To tackle this challenge, we propose a novel approach called the selective filtering reasoner (SelF-Reasoner) that assesses the entailment relationship between the question and the candidate reasoning chain. Then, we proceed with CoT reasoning when the reasoning chain demonstrates confidence; otherwise, we opt to predict the answer directly. SelF-Reasoner improves the fine-tuned T5 baseline consistently over the ScienceQA, ECQA, and LastLetter tasks. |
Large language models (LLMs) empowered by chain-of-thought (CoT) prompting have yielded remarkable prowess in reasoning tasks. Nevertheless, current methods predominantly lean on handcrafted or task-specific demonstrations, lack reliable knowledge basis and thus struggle for trustworthy responses in an automated pattern. While recent works endeavor to improve upon one certain aspect, they ignore the importance and necessity of establishing an integrated and interpretable reasoning system. To address these drawbacks and provide a universal solution, we propose \textsc{AuRoRA}: a one-for-all platform for augmented reasoning and refining based on CoT prompting that excels in adaptability, reliability, integrity, and interpretability. The system exhibits superior performances across six reasoning tasks and offers real-time visual analysis, which has pivotal academic and application value in the era of LLMs. |
Dialogue related Machine Reading Comprehension requires language models to effectively decouple and model multi-turn dialogue passages. As a dialogue development goes after the intentions of participants, its topic may not keep constant through the whole passage. Hence, it is non-trivial to detect and leverage the topic shift in dialogue modeling. Topic modeling, although has been widely studied in plain text, deserves far more utilization in dialogue reading comprehension. This paper proposes to model multi-turn dialogues from a topic-aware perspective. We start with a dialogue segmentation algorithm to split a dialogue passage into topic-concentrated fragments in an unsupervised way. Then we use these fragments as topic-aware language processing units in further dialogue comprehension. On one hand, the split segments indict specific topics rather than mixed intentions, thus showing convenient on in-domain topic detection and location. For this task, we design a clustering system with a self-training auto-encoder, and we build two constructed datasets for evaluation. On the other hand, the split segments are an appropriate element of multi-turn dialogue response selection. For this purpose, we further present a novel model, Topic-Aware Dual-Attention Matching (TADAM) Network, which takes topic segments as processing elements and matches response candidates with a dual cross-attention. Empirical studies on three public benchmarks show great improvements over baselines. Our work continues the previous studies on document topic, and brings the dialogue modeling to a novel topic-aware perspective with exhaustive experiments and analyses. |
@article{ma2023multi, title={Multi-turn Dialogue Comprehension from a Topic-aware Perspective}, author={Ma, Xinbei and Xu, Yi and Zhao, Hai and Zhang, Zhuosheng}, journal={arXiv preprint arXiv:2309.09666}, year={2023} }
Recent years have witnessed an increasing interest in training machines with reasoning ability, which deeply relies on accurately and clearly presented clue forms. The clues are usually modeled as entity-aware knowledge in existing studies. However, those entity-aware clues are primarily focused on commonsense, making them insufficient for tasks that require knowledge of temporary facts or events, particularly in logical reasoning for reading comprehension. To address this challenge, we are motivated to cover both commonsense and temporary knowledge clues hierarchically. Specifically, we propose a general formalism of knowledge units by extracting backbone constituents of the sentence, such as the subject-verb-object formed ``facts''. We then construct a supergraph on top of the fact units, allowing for the benefit of sentence-level (relations among fact groups) and entity-level interactions (concepts or actions inside a fact). Experimental results on logical reasoning benchmarks and dialogue modeling datasets show that our approach improves the baselines substantially, and it is general across backbone models. |
@article{ouyang2024fact, title={Fact-driven Logical Reasoning for Machine Reading Comprehension}, author={Ouyang, Siru and Zhang, Zhuosheng and Zhao, Hai}, journal={The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)}, year={2024} }
Large language models (LLMs) have demonstrated impressive capabilities in general scenarios, exhibiting a level of aptitude that approaches, in some aspects even surpasses, human-level intelligence. Among their numerous skills, the translation abilities of LLMs have received considerable attention. In contrast to traditional machine translation that focuses solely on source-target mapping, LLM-based translation can potentially mimic the human translation process that takes many preparatory steps to ensure high-quality translation. This work aims to explore this possibility by proposing the MAPS framework, which stands for Multi-Aspect Prompting and Selection. Specifically, we enable LLMs to first analyze the given source text and extract three aspects of translation-related knowledge: keywords, topics and relevant demonstrations to guide the translation process. To filter out the noisy and unhelpful knowledge, we employ a selection mechanism based on quality estimation. Experiments suggest that MAPS brings significant and consistent improvements over text-davinci-003 and Alpaca on eight translation directions from the latest WMT22 test sets. Our further analysis shows that the extracted knowledge is critical in resolving up to 59% of hallucination mistakes in translation. Code is available at this https https://github.com/zwhe99/MAPS-mt. |
@article{he2023exploring, title={Exploring Human-Like Translation Strategy with Large Language Models}, author={He, Zhiwei and Liang, Tian and Jiao, Wenxiang and Zhang, Zhuosheng and Yang, Yujiu and Wang, Rui and Tu, Zhaopeng and Shi, Shuming and Wang, Xing}, journal={arXiv preprint arXiv:2305.04118}, year={2023} }
Spurred by advancements in scale, large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot -- i.e., without adaptation on downstream data. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community due to the fact that it can generate high-quality responses to human input and self-correct previous mistakes based on subsequent conversations. However, it is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot. In this work, we empirically analyze the zero-shot learning ability of ChatGPT by evaluating it on 20 popular NLP datasets covering 7 representative task categories. With extensive empirical studies, we demonstrate both the effectiveness and limitations of the current version of ChatGPT. We find that ChatGPT performs well on many tasks favoring reasoning capabilities (e.g., arithmetic reasoning) while it still faces challenges when solving specific tasks such as sequence tagging. We additionally provide in-depth analysis through qualitative case studies. |
@article{qin2023chatgpt, title={Is ChatGPT a General-Purpose Natural Language Processing Task Solver?}, author={Qin, Chengwei and Zhang, Aston and Zhang, Zhuosheng and Chen, Jiaao and Yasunaga, Michihiro and Yang, Diyi}, journal={arXiv preprint arXiv:2302.06476}, year={2023} }
Automatic summarization generates concise summaries that contain key ideas of source documents. As the most mainstream datasets for the news sub-domain, CNN/DailyMail and BBC XSum have been widely used for performance benchmarking. However, the reference summaries of those datasets turn out to be noisy, mainly in terms of factual hallucination and information redundancy. To address this challenge, we first annotate new expert-writing Element-aware test sets following the "Lasswell Communication Model" proposed by Lasswell (1948), allowing reference summaries to focus on more fine-grained news elements objectively and comprehensively. Utilizing the new test sets, we observe the surprising zero-shot summary ability of LLMs, which addresses the issue of the inconsistent results between human preference and automatic evaluation metrics of LLMs' zero-shot summaries in prior work. Further, we propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step, which helps them integrate more fine-grained details of source documents into the final summaries that correlate with the human writing mindset. Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L on the two datasets, respectively. Dataset and code are publicly available at this https https://github.com/Alsace08/SumCoT. |
@inproceedings{wang2023element, title={Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method}, author={Wang, Yiming and Zhang, Zhuosheng and Wang, Rui}, booktitle={The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)}, year={2023} }
Masked Language Modeling (MLM) has been widely used as the denoising objective in pre-training language models (PrLMs). Existing PrLMs commonly adopt a Random-Token Masking strategy where a fixed masking ratio is applied and different contents are masked by an equal probability throughout the entire training. However, the model may receive complicated impact from pre-training status, which changes accordingly as training time goes on. In this paper, we show that such time-invariant MLM settings on masking ratio and masked content are unlikely to deliver an optimal outcome, which motivates us to explore the influence of time-variant MLM settings. We propose two scheduled masking approaches that adaptively tune the masking ratio and masked content in different training stages, which improves the pre-training efficiency and effectiveness verified on the downstream tasks. Our work is a pioneer study on time-variant masking strategy on ratio and content and gives a better understanding of how masking ratio and masked content influence the MLM pre-training. |
@inproceedings{yang2023learning, title={Learning Better Masking for Better Language Model Pre-training}, author={Yang, Dongjie and Zhang, Zhuosheng and Zhao, Hai}, booktitle={The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)}, year={2023} }
Commonsense fact verification, as a challenging branch of commonsense question-answering (QA), aims to verify through facts whether a given commonsense claim is correct or not. Answering commonsense questions necessitates a combination of knowledge from various levels. However, existing studies primarily rest on grasping either unstructured evidence or potential reasoning paths from structured knowledge bases, yet failing to exploit the benefits of heterogeneous knowledge simultaneously. In light of this, we propose Decker, a commonsense fact verification model that is capable of bridging heterogeneous knowledge by uncovering latent relationships between structured and unstructured knowledge. Experimental results on two commonsense fact verification benchmark datasets, CSQA2.0 and CREAK demonstrate the effectiveness of our Decker and further analysis verifies its capability to seize more precious information through reasoning. |
@inproceedings{zou2023decker, title={Decker: Double Check with Heterogeneous Knowledge for Commonsense Fact Verification}, author={Zou, Anni and Zhang, Zhuosheng and Zhao, Hai}, booktitle={The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)}, year={2023} }
Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like "Let's think step by step" to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific demonstrations one by one. We show that such manual efforts may be eliminated by leveraging LLMs with the "Let's think step by step" prompt to generate reasoning chains for demonstrations one by one, i.e., let's think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate the effect of such mistakes, we find that diversity matters for automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto-CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations. Code is available at https://github.com/amazon-research/auto-cot |
@inproceedings{zhang2023automatic, title={Automatic Chain of Thought Prompting in Large Language Models}, author={Zhang, Zhuosheng and Zhang, Aston and Li, Mu and Smola, Alex}, booktitle={The Eleventh International Conference on Learning Representations (ICLR 2023)}, year={2023} }
Multi-party multi-turn dialogue comprehension brings unprecedented challenges in handling complicated scenarios, as the co-occurrence of multiple speakers causes complexity and inconsistency. As a result of the multiple participation, the shift of speaker roles and crisscrossed discourse relations among utterances hinder reading comprehension. Motivated by this, we further integrate the enhancements of speaker-related features for dialogue comprehension performance. This work proposes a novel model with enhancement from both sides of speaker roles and speaker-aware relations. At the token level, we apply a speaker mask for attention, while at the discourse level, we utilize heterogeneous graph networks for comprehensive speaker-aware discourse clues. Experimental results show that our E nhanced S peaker- A ware method (ESA) helps achieve state-of-the-art performance on the Molweni dataset, as well as significant improvements on the FriendsQA dataset. We find that our method makes steady improvements on stronger backbones. Analysis shows that our model enhances the connections between utterances and their own speakers and captures the speaker-aware discourse relations. Discussions on data features and error cases are presented, and a visualized case is displayed. The findings reveal the importance of speaker-aware signals in dialogue comprehension. |
@ARTICLE{10147329, author={Ma, Xinbei and Zhang, Zhuosheng and Zhao, Hai}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={Enhanced Speaker-aware Multi-party Multi-turn Dialogue Comprehension}, year={2023}, volume={}, number={}, pages={1-16}, doi={10.1109/TASLP.2023.3284516} }
Representation learning is the foundation of natural language processing (NLP). This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs or a shared cross-modal embedding space that is pre-trained on out-of-shelf text-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively. The two sequences of representations are further fused by an attention layer for the interaction of the two modalities. In this study, the retrieval process is controllable and flexible. The universal visual representation overcomes the lack of large-scale bilingual sentence-image pairs. Our method can be easily applied to text-only tasks without manually annotated multimodal parallel corpora. We apply the proposed method to a wide range of natural language generation and understanding tasks, including neural machine translation, natural language inference, and semantic similarity. Experimental results show that our method is generally effective for different tasks and languages. Analysis indicates that the visual signals enrich textual representations of content words, provide fine-grained grounding information about the relationship between concepts and events, and potentially conduce to disambiguation. |
@ARTICLE{zhang2023universal, author={Zhang, Zhuosheng and Chen, Kehai and Wang, Rui and Utiyama, Masao and Sumita, Eiichiro and Li, Zuchao and Zhao, Hai}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, title={Universal Multimodal Representation for Language Understanding}, year={2023}, volume={}, number={}, pages={1-18}, doi={10.1109/TPAMI.2023.3234170}}
Discriminative pre-trained language models (PLMs) learn to predict original texts from intentionally corrupted ones. Taking the former text as positive and the latter as negative samples, the PLM can be trained effectively for contextualized representation. However, the training of such a type of PLMs highly relies on the quality of the automatically constructed samples. Existing PLMs simply treat all corrupted texts as equal negative without any examination, which actually lets the resulting model inevitably suffer from the false negative issue where training is carried out on pseudo-negative data and leads to less efficiency and less robustness in the resulting PLMs. In this work, on the basis of defining the false negative issue in discriminative PLMs that has been ignored for a long time, we design enhanced pre-training methods to counteract false negative predictions and encourage pre-training language models on true negatives by correcting the harmful gradient updates subject to false negative predictions. Experimental results on GLUE and SQuAD benchmarks show that our counter-false-negative pre-training methods indeed bring about better performance together with stronger robustness. |
@inproceedings{zhang2023TrueNeg, title={Language Model Pre-training on True Negatives}, author={Zhang, Zhuosheng and Zhao, Hai and Utiyama, Masao and Sumita, Eiichiro}, booktitle={The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023)}, year={2023} }
Understanding human language is one of the key themes of artificial intelligence. For language representation, the capacity of effectively modeling the linguistic knowledge from the detail-riddled and lengthy texts and getting rid of the noises is essential to improve its performance. Traditional attentive models attend to all words without explicit constraint, which results in inaccurate concentration on some dispensable words. In this work, we propose using syntax to guide the text modeling by incorporating explicit syntactic constraints into attention mechanisms for better linguistically motivated word representations. In detail, for self-attention network (SAN) sponsored Transformer-based encoder, we introduce syntactic dependency of interest (SDOI) design into the SAN to form an SDOI-SAN with syntax-guided self-attention. Syntax-guided network (SG-Net) is then composed of this extra SDOI-SAN and the SAN from the original Transformer encoder through a dual contextual architecture for better linguistics inspired representation. The proposed SG-Net is applied to typical Transformer encoders. Extensive experiments on popular benchmark tasks, including machine reading comprehension, natural language inference, and neural machine translation show the effectiveness of the proposed SG-Net design. |
@article{zhang2022sg, title={SG-Net: Syntax Guided Transformer for Language Representation}, author={Zhang, Zhuosheng and Wu, Yuwei and Zhou, Junru and Duan, Sufeng and Zhao, Hai and Wang, Rui}, journal={IEEE Transactions on Pattern Analysis \& Machine Intelligence}, volume={44}, number={06}, pages={3285--3299}, year={2022}, publisher={IEEE Computer Society} }
Text encoding is one of the most important steps in Natural Language Processing (NLP). It has been done well by the self-attention mechanism in the current state-of-the-art Transformer encoder, which has brought about significant improvements in the performance of many NLP tasks. Though the Transformer encoder may effectively capture general information in its resulting representations, the backbone information, meaning the gist of the input text, is not specifically focused on. In this paper, we propose explicit and implicit text compression approaches to enhance the Transformer encoding and evaluate models using this approach on several typical downstream tasks that rely on the encoding heavily. Our explicit text compression approaches use dedicated models to compress text, while our implicit text compression approach simply adds an additional module to the main model to handle text compression. We propose three ways of integration, namely backbone source-side fusion, target-side fusion, and both-side fusion, to integrate the backbone information into Transformer-based models for various downstream tasks. Our evaluation on benchmark datasets shows that the proposed explicit and implicit text compression approaches improve results in comparison to strong baselines. We therefore conclude, when comparing the encodings to the baseline models, text compression helps the encoders to learn better language representations. |
@article{li2022text, title={Text Compression-Aided Transformer Encoding}, author={Li, Zuchao and Zhang, Zhuosheng and Zhao, Hai and Wang, Rui and Chen, Kehai and Utiyama, Masao and Sumita, Eiichiro}, journal={IEEE Transactions on Pattern Analysis \& Machine Intelligence}, volume={44}, number={07}, pages={3840--3857}, year={2022}, publisher={IEEE Computer Society} }
Training machines to understand natural language and interact with humans is one of the major goals of artificial intelligence. Recent years have witnessed an evolution from matching networks to pre-trained language models (PrLMs). In contrast to the plain-text modeling as the focus of the PrLMs, dialogue texts involve multiple speakers and reflect special characteristics such as topic transitions and structure dependencies between distant utterances. However, the related PrLM models commonly represent dialogues sequentially by processing the pairwise dialogue history as a whole. Thus the hierarchical information on either utterance interrelation or speaker roles coupled in such representations is not well addressed. In this work, we propose compositional learning for holistic interaction across the utterances beyond the sequential contextualization from PrLMs, in order to capture the utterance-aware and speaker-aware representations entailed in a dialogue history. We decouple the contextualized word representations by masking mechanisms in Transformer-based PrLM, making each word only focus on the words in current utterance, other utterances, and two speaker roles (i.e., utterances of sender and utterances of the receiver), respectively. In addition, we employ domain-adaptive training strategies to help the model adapt to the dialogue domains. Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets, achieving new state-of-the-art performance over previous methods. |
@article{zhang2022cdn, author={Zhang, Zhuosheng and Zhao, Hai and Liu, Longxiang}, journal={IEEE Transactions on Neural Networks and Learning Systems}, title={Channel-Aware Decoupling Network for Multiturn Dialog Comprehension}, year={2022}, volume={}, number={}, pages={1-12}, doi={10.1109/TNNLS.2022.3220047} }
Discriminative pre-trained language models (PrLMs) can be generalized as denoising auto-encoders that work with two procedures, ennoising and denoising. First, an ennoising process corrupts texts with arbitrary noising functions to construct training instances. Then, a denoising language model is trained to restore the corrupted tokens. Existing studies have made progress by optimizing independent strategies of either ennoising or denosing. They treat training instances equally throughout the training process, with little attention on the individual contribution of those instances. To model explicit signals of instance contribution, this work proposes to estimate the complexity of restoring the original sentences from corrupted ones in language model pre-training. The estimations involve the corruption degree in the ennoising data construction process and the prediction confidence in the denoising counterpart. Experimental results on natural language understanding and reading comprehension benchmarks show that our approach improves pre-training efficiency, effectiveness, and robustness. |
@article{zhang2022instance, title={Instance Regularization for Discriminative Language Model Pre-training}, author={Zhang, Zhuosheng and Zhao, Hai and Zhou, Ming}, journal={arXiv preprint arXiv:2210.05471}, year={2022} }
Leveraging task-aware annotated data as supervised signals to assist with self-supervised learning on large-scale unlabeled data has become a new trend in pre-training language models. Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks. To tackle the challenge, we propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks. We conduct extensive experiments on 40 datasets, which show that our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships. The task relationships reflected by the prefixes align transfer learning performance between tasks. They also suggest directions for data augmentation with complementary tasks, which help our model achieve human-parity results on commonsense reasoning leaderboards. |
@article{zhang2022task, title={Task Compass: Scaling Multi-task Pre-training with Task Prefix}, author={Zhang, Zhuosheng and Wang, Shuohang and Xu, Yichong and Fang, Yuwei and Yu, Wenhao and Liu, Yang and Zhao, Hai and Zhu, Chenguang and Zeng, Michael}, journal={arXiv preprint arXiv:2210.06277}, year={2022} }
|
Multi-turn dialogue modeling as a challenging branch of natural language understanding (NLU), aims to build representations for machines to understand human dialogues, which provides a solid foundation for multiple downstream tasks. Recent studies of dialogue modeling commonly employ pre-trained language models (PrLMs) to encode the dialogue history as successive tokens, which is insufficient in capturing the temporal characteristics of dialogues. Therefore, we propose Bidirectional Information Decoupling Network (BiDeN) as a universal dialogue encoder, which explicitly incorporates both the past and future contexts and can be generalized to a wide range of dialogue-related tasks. Experimental results on datasets of different downstream tasks demonstrate the universality and effectiveness of our BiDeN. |
@article{li2022back, title={Back to the Future: Bidirectional Information Decoupling Network for Multi-turn Dialogue Modeling}, author={Li, Yiyang and Zhao, Hai and Zhang, Zhuosheng}, journal={arXiv preprint arXiv:2204.08152}, year={2022} }
Machine reading comprehension (MRC) poses new challenges to logical reasoning, which aims to understand the implicit logical relations entailed in the given contexts and perform inference over them. Due to the complexity of logic, logical connections exist at different granularity levels. However, most existing methods of logical reasoning individually focus on either entity-aware or discourse-based information but ignore the hierarchical relations that may even have mutual effects. This paper proposes a holistic graph network (HGN) that deals with context at both discourse-level and word-level as the basis for logical reasoning to provide a more fine-grained relation extraction. Specifically, node-level and type-level relations, which can be interpreted as bridges in the reasoning process, are modeled by a hierarchical interaction mechanism to improve the interpretation of MRC systems. Experimental results on logical reasoning QA datasets (ReClor and LogiQA) and natural language inference datasets (SNLI and ANLI) show the effectiveness and generalization of our method, and in-depth analysis verifies its capability to understand complex logical relations. |
@inproceedings{chen2022hgm, title={Modeling Hierarchical Reasoning Chains by Linking Discourse Units and Key Phrases for Reading Comprehension}, author={Chen, jialin and Zhang, Zhuosheng and Zhao, Hai}, booktitle={The 29th the International Conference on Computational Linguistics (COLING 2022)}, year={2022} }
Machine reading comprehension is a heavily-studied research and test field for evaluating new pre-trained models and fine-tuning strategies, and recent studies have enriched the pre-trained models with syntactic, semantic and other linguistic information to improve the performance of the model. In this paper, we imitated the human's reading process in connecting the anaphoric expressions and explicitly leverage the coreference information to enhance the word embeddings from the pre-trained model, in order to highlight the coreference mentions that must be identified for coreference-intensive question answering in QUOREF, a relatively new dataset that is specifically designed to evaluate the coreference-related performance of a model. We used an additional BERT layer to focus on the coreference mentions, and a Relational Graph Convolutional Network to model the coreference relations. We demonstrated that the explicit incorporation of the coreference information in fine-tuning stage performed better than the incorporation of the coreference information in training a pre-trained language models. |
@inproceedings{huang2021tracing, title={Tracing Origins: Coref-aware Machine Reading Comprehension}, author={Huang, Baorong and Zhang, Zhuosheng and Zhao, Hai}, booktitle={The 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)}, year={2021} }
Tangled multi-party dialogue context leads to challenges for dialogue reading comprehension, where multiple dialogue threads flow simultaneously within the same dialogue history, thus increasing difficulties in understanding a dialogue history for both human and machine. Dialogue disentanglement aims to clarify conversation threads in a multi-party dialogue history, thus reducing the difficulty of comprehending the long disordered dialogue passage. Existing studies commonly focus on utterance encoding with carefully designed feature engineering-based methods but pay inadequate attention to dialogue structure. This work designs a novel model to disentangle multi-party history into threads, by taking dialogue structure features into account. Specifically, based on the fact that dialogues are constructed through successive participation of speakers and interactions between users of interest, we extract clues of speaker property and reference of users to model the structure of a long dialogue record. The novel method is evaluated on the Ubuntu IRC dataset and shows state-of-the-art experimental results in dialogue disentanglement. |
@inproceedings{ma2022structural, title={Structural Modeling for Dialogue Disentanglement}, author={Ma, Xinbei and Zhang, Zhuosheng and Zhao, Hai}, booktitle={The 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)}, year={2022} }
Training dense passage representations via contrastive learning has been shown effective for Open-Domain Passage Retrieval (ODPR). Existing studies focus on further optimizing by improving negative sampling strategy or extra pretraining. However, these studies keep unknown in capturing passage with internal representation conflicts from improper modeling granularity. This work thus presents a refined model on the basis of a smaller granularity, contextual sentences, to alleviate the concerned conflicts. In detail, we introduce an in-passage negative sampling strategy to encourage a diverse generation of sentence representations within the same passage. Experiments on three benchmark datasets verify the efficacy of our method, especially on datasets where conflicts are severe. Extensive experiments further present good transferability of our method across datasets. |
@inproceedings{wu2022sentence, title={Sentence-aware Contrastive Learning for Open-Domain Passage Retrieval}, author={Wu, Bohong and Zhang, Zhuosheng and Wang, Jinyuan and Zhao, Hai}, booktitle={The 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)}, year={2022} }
Recently, the problem of robustness of pre-trained language models (PrLMs) has received increasing research interest. Latest studies on adversarial attacks achieve high attack success rates against PrLMs, claiming that PrLMs are not robust. However, we find that the adversarial samples that PrLMs fail are mostly non-natural and do not appear in reality. We question the validity of the current evaluation of robustness of PrLMs based on these non-natural adversarial samples and propose an anomaly detector to evaluate the robustness of PrLMs with more natural adversarial samples. We also investigate two applications of the anomaly detector: (1) In data augmentation, we employ the anomaly detector to force generating augmented data that are distinguished as non-natural, which brings larger gains to the accuracy of PrLMs. (2) We apply the anomaly detector to a defense framework to enhance the robustness of PrLMs. It can be used to defend all types of attacks and achieves higher accuracy on both adversarial samples and compliant samples than other defense frameworks. |
@inproceedings{wang2022distinguishing, title={Distinguishing Non-natural from Natural Adversarial Samples for More Robust Pre-trained Language Model}, author={Wang, Jiayi and Bao, Rongzhou and Zhang, Zhuosheng and Zhao, Hai}, booktitle={Findings of the Association for Computational Linguistics: ACL 2022}, year={2022} }
In this paper, we report our discovery on named entity distribution in general word embedding space, which helps an open definition on multilingual named entity definition rather than previous closed and constraint definition on named entities through a named entity dictionary, which is usually derived from human labor and replies on schedule update. Our initial visualization of monolingual word embeddings indicates named entities tend to gather together despite of named entity types and language difference, which enable us to model all named entities using a specific geometric structure inside embedding space, namely, the named entity hypersphere. For monolingual case, the proposed named entity model gives an open description on diverse named entity types and different languages. For cross-lingual case, mapping the proposed named entity model provides a novel way to build named entity dataset for resource-poor languages. At last, the proposed named entity model may be shown as a very useful clue to significantly enhance state-of-the-art named entity recognition systems generally. |
@article{luo2022open, author={Luo, Ying and Zhao, Hai and Zhang, Zhuosheng and Tang, Bingjie}, journal={IEEE Transactions on Knowledge and Data Engineering}, title={Open Named Entity Modeling From Embedding Distribution}, year={2022}, volume={34}, number={11}, pages={5472-5483}, doi={10.1109/TKDE.2021.3049654} }
Recent pre-trained language models (PrLMs) offer a new performant method of contextualized word representations by leveraging the sequence-level context for modeling. Although the PrLMs generally provide more effective contextualized word representations than non-contextualized models, they are still subject to a sequence of text contexts without diverse hints from multimodality. This paper thus proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance. In detail, we build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images. Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach. Analysis shows that our method with visual guidance pays more attention to content words, improves the representation diversity, and is potentially beneficial for enhancing the accuracy of disambiguation. |
@article{zhang2022apple, title={Which Apple Keeps Which Doctor Away? Colorful Word Representations With Visual Oracles}, author={Zhang, Zhuosheng and Yu, Haojie and Zhao, Hai and Utiyama, Masao}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={30}, pages={49--59}, year={2022}, publisher={IEEE} }
This paper presents a novel method to generate answers for non-extraction machine reading comprehension (MRC) tasks whose answers cannot be simply extracted as one span from the given passages. Using a pointer network-style extractive decoder for such type of MRC may result in unsatisfactory performance when the ground-truth answers are given by human annotators or highly re-paraphrased from parts of the passages. On the other hand, using a generative decoder cannot well guarantee the resulted answers with well-formed syntax and semantics when encountering long sentences. Therefore, to alleviate the obvious drawbacks of both sides, we propose an answer making-up method from extracted multi-spans that are learned by our model as highly confident n-gram candidates in the given passage. That is, the returned answers are composed of discontinuous multi-spans but not just one consecutive span in the given passages anymore. The proposed method is simple but effective: empirical experiments on MS MARCO show that the proposed method has a better performance on accurately generating long answers and substantially outperforms two typical competitive one-span and Seq2Seq baseline decoders. |
@article{zhang2022syntax, title={Syntax-Aware Multi-Spans Generation for Reading Comprehension}, author={Zhang, Zhuosheng and Zhang, Yiqing and Zhao, Hai}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={30}, pages={260--268}, year={2022}, publisher={IEEE} }
Multi-choice Machine Reading Comprehension (MRC) requires models to decide the correct answer from a set of answer options when given a passage and a question. Thus, in addition to a powerful Pre-trained Language Model (PrLM) as an encoder, multi-choice MRC especially relies on a matching network design that is supposed to effectively capture the relationships among the triplet of passage, question, and answers. While the newer and more powerful PrLMs have shown their strengths even without the support from a matching network, we propose a new DUal Multi-head Co-Attention (DUMA) model. It is inspired by the human transposition thinking process solving the multi-choice MRC problem by considering each others focus from the standpoint of passage and question. The proposed DUMA has been shown to be effective and is capable of generally promoting PrLMs. Our proposed method is evaluated on two benchmark multi-choice MRC tasks, DREAM, and RACE. Our results show that in terms of powerful PrLMs, DUMA can further boost the models to obtain higher performance. |
@ARTICLE{9664302, author={Zhu, Pengfei and Zhang, Zhuosheng and Zhao, Hai and Li, Xiaoguang}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={DUMA: Reading Comprehension With Transposition Thinking}, year={2022}, volume={30}, number={}, pages={269-279}, doi={10.1109/TASLP.2021.3138683}}
Multi-choice Machine Reading Comprehension (MRC) as a challenge requires model to select the most appropriate answer from a set of candidates given passage and question. Most of the existing researches focus on the modeling of the task datasets without explicitly referring to external fine-grained knowledge sources, which is supposed to greatly make up the deficiency of the given passage. Thus we propose a novel reference-based knowledge enhancement model called Reference Knowledgeable Network (RekNet), which refines critical information from the passage and quote explicit knowledge in necessity. In detail, RekNet refines fine-grained critical information and defines it as Reference Span, then quotes explicit knowledge quadruples by the co-occurrence information of Reference Span and candidates. The proposed RekNet is evaluated on three multi-choice MRC benchmarks: RACE, DREAM and Cosmos QA, which shows consistent and remarkable performance improvement with observable statistical significance level over strong baselines. |
@ARTICLE{9748021, author={Zhao, Yilin and Zhang, Zhuosheng and Zhao, Hai}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={Reference Knowledgeable Network for Machine Reading Comprehension}, year={2022}, volume={30}, number={}, pages={1461-1473}, doi={10.1109/TASLP.2022.3164219}}
Although pre-trained language models (PrLMs) have achieved significant success, recent studies demonstrate that PrLMs are vulnerable to adversarial attacks. By generating adversarial examples with slight perturbations on different levels (sentence / word / character), adversarial attacks can fool PrLMs to generate incorrect predictions, which questions the robustness of PrLMs. However, we find that most existing textual adversarial examples are unnatural, which can be easily distinguished by both human and machine. Based on a general anomaly detector, we propose a novel metric (Degree of Anomaly) as a constraint to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples. Under this new constraint, the success rate of existing attacks drastically decreases, which reveals that the robustness of PrLMs is not as fragile as they claimed. In addition, we find that four types of randomization can invalidate a large portion of textual adversarial examples. Based on anomaly detector and randomization, we design a universal defense framework, which is among the first to perform textual adversarial defense without knowing the specific attack. Empirical results show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses, while preserving higher original accuracy at the same time. Our work discloses the essence of textual adversarial attacks, and indicates that (i) further works of adversarial attacks should focus more on how to overcome the detection and resist the randomization, otherwise their adversarial examples would be easily detected and invalidated; and (ii) compared with the unnatural and perceptible adversarial examples, it is those undetectable adversarial examples that pose real risks for PrLMs and require more attention for future robustness-enhancing strategies. |
@ARTICLE{9833338, author={Wang, Jiayi and Bao, Rongzhou and Zhang, Zhuosheng and Zhao, Hai}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={Rethinking Textual Adversarial Defense for Pre-Trained Language Models}, year={2022}, volume={}, number={}, pages={1-15}, doi={10.1109/TASLP.2022.3192097}}
Conversational machine reading (CMR) requires machines to communicate with humans through multi-turn interactions between two salient dialogue states of decision making and question generation processes. In open CMR settings, as the more realistic scenario, the retrieved background knowledge would be noisy, which results in severe challenges in the information transmission. Existing studies commonly train independent or pipeline systems for the two subtasks. However, those methods are trivial by using hard-label decisions to activate question generation, which eventually hinders the model performance. In this work, we propose an effective gating strategy by smoothing the two dialogue states in only one decoder and bridge decision making and question generation to provide a richer dialogue state reference. Experiments on the OR-ShARC dataset show the effectiveness of our method, which achieves new state-of-the-art results. |
@inproceedings{zhang2021oscar, title={Smoothing Dialogue States for Open Conversational Machine Reading}, author={Zhang, Zhuosheng and Ouyang, Siru and Zhao, Hai and Utiyama, Masao and Sumita, Eiichiro}, booktitle={The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021)}, year={2021} }
Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have shown that incorporating span-level information over consecutive words in pre-training could further improve the performance of PrLMs. However, given that span-level clues are introduced and fixed in pre-training, previous methods are time-consuming and lack of flexibility. To alleviate the inconvenience, this paper presents a novel span fine-tuning method for PrLMs, which facilitates the span setting to be adaptively determined by specific downstream tasks during the fine-tuning phase. In detail, any sentences processed by the PrLM will be segmented into multiple spans according to a pre-sampled dictionary. Then the segmentation information will be sent through a hierarchical CNN module together with the representation outputs of the PrLM and ultimately generate a span-enhanced representation. Experiments on GLUE benchmark show that the proposed span fine-tuning method significantly enhances the PrLM, and at the same time, offer more flexibility in an efficient way. |
@inproceedings{bao2021spanft, title={Span Fine-tuning for Pre-trained Language Models}, author={Bao, Rongzhou and Zhang, Zhuosheng and Zhao, Hai}, booktitle={Findings of the Association for Computational Linguistics: EMNLP 2021}, year={2021} }
Multi-party dialogue machine reading comprehension (MRC) raises an even more challenging understanding goal on dialogue with more than two involved speakers, compared with the traditional plain passage style MRC. To accurately perform the question-answering (QA) task according to such multi-party dialogue, models have to handle fundamentally different discourse relationships from common non-dialogue plain text, where discourse relations are supposed to connect two far apart utterances in a linguistics-motivated this http URL further explore the role of such unusual discourse structure on the correlated QA task in terms of MRC, we propose the first multi-task model for jointly performing QA and discourse parsing (DP) on the multi-party dialogue MRC task. Our proposed model is evaluated on the latest benchmark Molweni, whose results indicate that training with complementary tasks indeed benefits not only QA task, but also DP task itself. We further find that the joint model is distinctly stronger when handling longer dialogues which again verifies the necessity of DP in the related MRC. |
@inproceedings{he2021mtldlg, title={Multi-tasking Dialogue Comprehension with Discourse Parsing}, author={He, Yuchen and Zhang, Zhuosheng and Zhao, Hai}, booktitle={The 35th Pacific Asia Conference on Language, Information and Computation (PACLIC 35)}, year={2021} }
Multi-turn dialogue reading comprehension aims to teach machines to read dialogue contexts and solve tasks such as response selection and answering questions. The major challenges involve noisy history contexts and especial prerequisites of commonsense knowledge that is unseen in the given material. Existing works mainly focus on context and response matching approaches. This work thus makes the first attempt to tackle the above two challenges by extracting substantially important turns as pivot utterances and utilizing external knowledge to enhance the representation of context. We propose a pivot-oriented deep selection model (PoDS) on top of the Transformer-based language models for dialogue comprehension. In detail, our model first picks out the pivot utterances from the conversation history according to the semantic matching with the candidate response or question, if any. Besides, knowledge items related to the dialogue context are extracted from a knowledge graph as external knowledge. Then, the pivot utterances and the external knowledge are combined together with a well-designed mechanism for refining predictions. Experimental results on four dialogue comprehension benchmark tasks show that our proposed model achieves great improvements on baselines. A series of empirical comparisons are conducted to show how our selection strategies and the extra knowledge injection influence the results. |
@article{zhang2021multi, title={Multi-Turn Dialogue Reading Comprehension With Pivot Turns and Knowledge}, author={Zhang, Zhuosheng and Li, Junlong and Zhao, Hai}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={29}, pages={1161--1173}, year={2021}, publisher={IEEE} }
Pre-trained language models (PrLMs) have demonstrated superior performance due to their strong ability to learn universal language representations from self-supervised pre-training. However, even with the help of the powerful PrLMs, it is still challenging to effectively capture task-related knowledge from dialogue texts which are enriched by correlations among speaker-aware utterances. In this work, we present SPIDER, Structural Pre-traIned DialoguE Reader, to capture dialogue exclusive features. To simulate the dialogue-like features, we propose two training objectives in addition to the original LM objectives: 1) utterance order restoration, which predicts the order of the permuted utterances in dialogue context; 2) sentence backbone regularization, which regularizes the model to improve the factual correctness of summarized subject-verb-object triplets. Experimental results on widely used dialogue benchmarks verify the effectiveness of the newly introduced self-supervised tasks. |
@inproceedings{zhang2021structural, title={Structural Pre-training for Dialogue Comprehension}, author={Zhang, Zhuosheng and Zhao, Hai}, booktitle={The 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021)}, year={2021} }
Conversational Machine Reading (CMR) aims at answering questions in complicated interactive scenarios. Machine needs to answer questions through interactions with users based on given rule document, user scenario and dialogue history, and even initiatively asks questions for clarification if necessary. Namely, the answer to the task needs a machine in the response of either \textsl{Yes, No, Irrelevant} or to raise a follow-up question for further clarification. To effectively capture multiple objects in such a challenging task, graph modeling is supposed to be adopted, though it is surprising that this does not happen until this work proposes a dialogue graph modeling framework by incorporating two complementary graph models, i.e., explicit discourse graph and implicit discourse graph, which respectively capture explicit and implicit interactions hidden in the rule documents. The proposed model is evaluated on the ShARC benchmark and achieves new state-of-the-art by first exceeding the milestone accuracy score of 80\%. |
@inproceedings{ouyang2021dialogue, title={Dialogue Graph Modeling for Conversational Machine Reading}, author={Ouyang, Siru and Zhang, Zhuosheng and Zhao, Hai}, booktitle={Findings of the Association for Computational Linguistics: ACL 2021}, year={2021} }
Machine reading comprehension (MRC) is an AI challenge that requires machine to determine the correct answers to questions based on a given passage. MRC systems must not only answer question when necessary but also distinguish when no answer is available according to the given passage and then tactfully abstain from answering. When unanswerable questions are involved in the MRC task, an essential verification module called verifier is especially required in addition to the encoder, though the latest practice on MRC modeling still most benefits from adopting well pre-trained language models as the encoder block by only focusing on the "reading". This paper devotes itself to exploring better verifier design for the MRC task with unanswerable questions. Inspired by how humans solve reading comprehension questions, we proposed a retrospective reader (Retro-Reader) that integrates two stages of reading and verification strategies: 1) sketchy reading that briefly investigates the overall interactions of passage and question, and yield an initial judgment; 2) intensive reading that verifies the answer and gives the final prediction. The proposed reader is evaluated on two benchmark MRC challenge datasets SQuAD2.0 and NewsQA, achieving new state-of-the-art results. Significance tests show that our model is significantly better than the strong ELECTRA and ALBERT baselines. A series of analysis is also conducted to interpret the effectiveness of the proposed reader. |
@inproceedings{zhang2021retro, title={Retrospective Reader for Machine Reading Comprehension}, author={Zhang, Zhuosheng and Yang, Junjie and Zhao, Hai}, booktitle={The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI 2021)}, year={2021} }
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles. Thus utterance- and speaker-aware clues are supposed to be well captured in models. However, in the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely by taking the pairwise dialogue history and candidate response as a whole, the hierarchical information on either utterance interrelation or speaker roles coupled in such representations is not well addressed. In this work, we propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history. In detail, we decouple the contextualized word representations by masking mechanisms in Transformer-based PrLM, making each word only focus on the words in current utterance, other utterances, two speaker roles (i.e., utterances of sender and utterances of receiver), respectively. Experimental results show that our method boosts the strong ELECTRA baseline substantially in four public benchmark datasets, and achieves various new state-of-the-art performance over previous methods. A series of ablation studies are conducted to demonstrate the effectiveness of our method. |
@inproceedings{liu2021filling, title={Filling the Gap of Utterance-aware and Speaker-aware Representation for Multi-turn Dialogue}, author={Liu, Longxiang and Zhang, Zhuosheng and and Zhao, Hai and Zhou, Xi and Zhou, Xiang}, booktitle={The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI 2021)}, year={2021} }
In the retrieval-based multi-turn dialogue modeling, it remains a challenge to select the most appropriate response according to extracting salient features in context utterances. As a conversation goes on, topic shift at discourse-level naturally happens through the continuous multi-turn dialogue context. However, all known retrieval-based systems are satisfied with exploiting local topic words for context utterance representation but fail to capture such essential global topic-aware clues at discourse-level. Instead of taking topic-agnostic n-gram utterance as processing unit for matching purpose in existing systems, this paper presents a novel topic-aware solution for multi-turn dialogue modeling, which segments and extracts topic-aware utterances in an unsupervised way, so that the resulted model is capable of capturing salient topic shift at discourse-level in need and thus effectively track topic flow during multi-turn conversation. Our topic-aware modeling is implemented by a newly proposed unsupervised topic-aware segmentation algorithm and Topic-Aware Dual-attention Matching (TADAM) Network, which matches each topic segment with the response in a dual cross-attention way. Experimental results on three public datasets show TADAM can outperform the state-of-the-art method by a large margin, especially by 3.4% on E-commerce dataset that has an obvious topic shift. |
@inproceedings{xu2021topic, title={Topic-aware multi-turn dialogue modeling}, author={Xu, Yi and Zhao, Hai and Zhang, Zhuosheng}, booktitle={The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI 2021)}, year={2021} }
Though visual information has been introduced for enhancing neural machine translation (NMT), its effectiveness strongly relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we present a universal visual representation learned over the monolingual corpora with image annotations, which overcomes the lack of large-scale bilingual sentence-image pairs, thereby extending image applicability in NMT. In detail, a group of images with similar topics to the source sentence will be retrieved from a light topic-image lookup table learned over the existing sentence-image pairs, and then is encoded as image representations by a pre-trained ResNet. An attention layer with a gated weighting is to fuse the visual information and text information as input to the decoder for predicting target translations. In particular, the proposed method enables the visual information to be integrated into large-scale text-only NMT in addition to the multimodel NMT. Experiments on four widely used translation datasets, including the WMT'16 English-to-Romanian, WMT'14 English-to-German, WMT'14 English-to-French, and Multi30K, show that the proposed approach achieves significant improvements over strong baselines. |
@inproceedings{zhang2020neural, title={Neural Machine Translation with Universal Visual Representation}, author={Zhuosheng Zhang and Kehai Chen and Rui Wang and Masao Utiyama and Eiichiro Sumita and Zuchao Li and Hai Zhao}, booktitle={International Conference on Learning Representations}, year={2020}, url={https://openreview.net/forum?id=Byl8hhNYPS} }
The latest work on language representations carefully integrates contextualized features into language model training, which enables a series of success especially in various machine reading comprehension and natural language inference tasks. However, the existing language representation models including ELMo, GPT and BERT only exploit plain context-sensitive features such as character or word embeddings. They rarely consider incorporating structured semantic information which can provide rich semantics for language representation. To promote natural language understanding, we propose to incorporate explicit contextual semantics from pre-trained semantic role labeling, and introduce an improved language representation model, Semantics-aware BERT (SemBERT), which is capable of explicitly absorbing contextual semantics over a BERT backbone. SemBERT keeps the convenient usability of its BERT precursor in a light fine-tuning way without substantial task-specific modifications. Compared with BERT, semantics-aware BERT is as simple in concept but more powerful. It obtains new state-of-the-art or substantially improves results on ten reading comprehension and language inference tasks. |
@inproceedings{zhang2020semantics, title={Semantics-aware bert for language understanding}, author={Zhang, Zhuosheng and Wu, Yuwei and Zhao, Hai and Li, Zuchao and Zhang, Shuailiang and Zhou, Xi and Zhou, Xiang}, booktitle={Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020)}, volume={34}, number={05}, pages={9628--9635}, year={2020} }
For machine reading comprehension, the capacity of effectively modeling the linguistic knowledge from the detail-riddled and lengthy passages and getting ride of the noises is essential to improve its performance. Traditional attentive models attend to all words without explicit constraint, which results in inaccurate concentration on some dispensable words. In this work, we propose using syntax to guide the text modeling by incorporating explicit syntactic constraints into attention mechanism for better linguistically motivated word representations. In detail, for self-attention network (SAN) sponsored Transformer-based encoder, we introduce syntactic dependency of interest (SDOI) design into the SAN to form an SDOI-SAN with syntax-guided self-attention. Syntax-guided network (SG-Net) is then composed of this extra SDOI-SAN and the SAN from the original Transformer encoder through a dual contextual architecture for better linguistics inspired representation. To verify its effectiveness, the proposed SG-Net is applied to typical pre-trained language model BERT which is right based on a Transformer encoder. Extensive experiments on popular benchmarks including SQuAD 2.0 and RACE show that the proposed SG-Net design helps achieve substantial performance improvement over strong baselines. |
@inproceedings{zhang2020sg, title={SG-Net: Syntax-Guided Machine Reading Comprehension.}, author={Zhang, Zhuosheng and Wu, Yuwei and Zhou, Junru and Duan, Sufeng and Zhao, Hai and Wang, Rui}, booktitle={Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020)}, pages={9636--9643}, year={2020} }
Multi-choice reading comprehension is a challenging task to select an answer from a set of candidate options when given passage and question. Previous approaches usually only calculate question-aware passage representation and ignore passage-aware question representation when modeling the relationship between passage and question, which obviously cannot take the best of information between passage and question. In this work, we propose dual co-matching network (DCMN) which models the relationship among passage, question and answer options bidirectionally. Besides, inspired by how human solve multi-choice questions, we integrate two reading strategies into our model: (i) passage sentence selection that finds the most salient supporting sentences to answer the question, (ii) answer option interaction that encodes the comparison information between answer options. DCMN integrated with the two strategies (DCMN+) obtains state-of-the-art results on five multi-choice reading comprehension datasets which are from different domains: RACE, SemEval-2018 Task 11, ROCStories, COIN, MCTest. |
@inproceedings{zhang2020dcmn+, title={{DCMN+}: Dual co-matching network for multi-choice reading comprehension}, author={Zhang, Shuailiang and Zhao, Hai and Wu, Yuwei and Zhang, Zhuosheng and Zhou, Xi and Zhou, Xiang}, booktitle={Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020)}, volume={34}, number={05}, pages={9563--9570}, year={2020} }
In this paper, we present a Linguistic Informed Multi-Task BERT (LIMIT-BERT) for learning language representations across multiple linguistic tasks by Multi-Task Learning (MTL). LIMIT-BERT includes five key linguistic syntax and semantics tasks: Part-Of-Speech (POS) tags, constituent and dependency syntactic parsing, span and dependency semantic role labeling (SRL). Besides, LIMIT-BERT adopts linguistics mask strategy: Syntactic and Semantic Phrase Masking which mask all of the tokens corresponding to a syntactic/semantic phrase. Different from recent Multi-Task Deep Neural Networks (MT-DNN) (Liu et al., 2019), our LIMIT-BERT is linguistically motivated and learning in a semi-supervised method which provides large amounts of linguistic-task data as same as BERT learning corpus. As a result, LIMIT-BERT not only improves linguistic tasks performance but also benefits from a regularization effect and linguistic information that leads to more general representations to help adapt to new tasks and domains. LIMIT-BERT obtains new state-of-the-art or competitive results on both span and dependency semantic parsing on Propbank benchmarks and both dependency and constituent syntactic parsing on Penn Treebank. |
@inproceedings{zhou2020limit, title={LIMIT-BERT: Linguistic informed multi-task bert}, author={Zhou, Junru and Zhang, Zhuosheng and Zhao, Hai and Zhang, Shuailiang}, booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020", year = "2020", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.findings-emnlp.399", doi = "10.18653/v1/2020.findings-emnlp.399", pages = "4450--4461", }
Memory-based learning can be characterized as a lazy learning method in machine learning terminology because it delays the processing of input by storing the input until needed. Linguistic structure parsing, which has been in a performance improvement bottleneck since the latest series of works was presented, determines the syntactic or semantic structure of a sentence. In this article, we construct a memory component and use it to augment a linguistic structure parser which allows the parser to directly extract patterns from the known training treebank to form memory. The experimental results show that existing state-of-the-art parsers reach new heights of performance on the main benchmarks for dependency parsing and semantic role labeling with this memory network. |
@article{li2020memory, title={Memory Network for Linguistic Structure Parsing}, author={Li, Zuchao and Guan, Chaoyu and Zhao, Hai and Wang, Rui and Parnow, Kevin and Zhang, Zhuosheng}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={28}, pages={2743--2755}, year={2020}, publisher={IEEE} }
Pinyin-to-character (P2C) conversion is the core component of pinyin-based Chinese input method engine (IME). However, the conversion is seriously compromised by the ambiguities of Chinese characters corresponding to pinyin as well as the predefined fixed vocabularies. To alleviate such inconveniences, we propose a neural P2C conversion model augmented by a large online updating vocabulary with a target vocabulary sampling mechanism to support an open vocabulary learning during IME working. Our experiments show that the proposed approach reduces the decoding time on CPUs up to 50$\%$ on P2C tasks at the same or only negligible change in conversion accuracy, and the online updated vocabulary indeed helps our IME effectively follows user inputting behavior. |
@inproceedings{zhang2019acl, title = "Open Vocabulary Learning for Neural {Chinese} Pinyin {IME}", author = "Zhang, Zhuosheng and Huang, Yafang and Zhao, Hai", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019)", url = "https://www.aclweb.org/anthology/P19-1154", pages = "1584--1594", year = "2019", }
Representation learning is the foundation of machine reading comprehension and inference. In state-of-the-art models, character-level representations have been broadly adopted to alleviate the problem of effectively representing rare or complex words. However, character itself is not a natural minimal linguistic unit for representation or word embedding composing due to ignoring the linguistic coherence of consecutive characters inside word. This paper presents a general subword-augmented embedding framework for learning and composing computationally-derived subword-level representations. We survey a series of unsupervised segmentation methods for subword acquisition and different subword-augmented strategies for text understanding, showing that subword-augmented embedding significantly improves our baselines in various types of text understanding tasks on both English and Chinese benchmarks. |
@article{Zhang2019subword, title={Effective Subword Segmentation for Text Comprehension}, author={Zhang, Zhuosheng and Zhao, Hai and Ling, Kangwei and Li, Jiangtong and He, Shexia and Fu, Guohong}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)}, year={2019}, volume={27}, number={11}, pages={1664-1674}, doi={10.1109/TASLP.2019.2922537} }
Who did what to whom is a major focus in natural language understanding, which is right the aim of semantic role labeling (SRL) task. Despite of sharing a lot of processing characteristics and even task purpose, it is surprisingly that jointly considering these two related tasks was never formally reported in previous work. Thus this paper makes the first attempt to let SRL enhance text comprehension and inference through specifying verbal predicates and their corresponding semantic roles. In terms of deep learning models, our embeddings are enhanced by explicit contextual semantic role labels for more fine-grained semantics. We show that the salient labels can be conveniently added to existing models and significantly improve deep learning models in challenging text comprehension tasks. Extensive experiments on benchmark machine reading comprehension and inference datasets verify that the proposed semantic learning helps our system reach new state-of-the-art over strong baselines which have been enhanced by well pretrained language models from the latest progress. |
@inproceedings{zhang2019explicit, title = "Explicit Contextual Semantics for Text Comprehension", author = "Zhang, Zhuosheng and Wu, Yuwei and Li, Zuchao and Zhao, Hai", booktitle = "Proceedings of the 33rd Pacific Asia Conference on Language, Information and Computation (PACLIC 33)", year = "2019", }
Semantic role labeling (SRL) aims to discover the predicate-argument structure of a sentence. End-to-end SRL withoutsyntactic input has received great attention. However, mostof them focus on either span-based or dependency-based semantic representation form and only show specific model optimization respectively. Meanwhile, handling these two SRL tasks uniformly was less successful. This paper presents an end-to-end model for both dependency and span SRL with a unified argument representation to deal with two different types of argument annotations in a uniform fashion. Furthermore, we jointly predict all predicates and arguments,especially including long-term ignored predicate identification subtask. Our single model achieves new state-of-the-artresults on both span (CoNLL 2005, 2012) and dependency(CoNLL 2008, 2009) SRL benchmarks. |
Multi-turn conversation understanding is a major challenge for building intelligent dialogue systems. This work focuses on retrieval-based response matching for multi-turn conversation whose related work simply concatenates the conversation utterances, ignoring the interactions among previous utterances for context modeling. In this paper, we formulate previous utterances into context using a proposed deep utterance aggregation model to form a fine-grained context representation. In detail, a self-matching attention is first introduced to route the vital information in each utterance. Then the model matches a response with each refined utterance and the final matching score is obtained after attentive turns aggregation. Experimental results show our model outperforms the state-of-the-art methods on three multi-turn conversation benchmarks, including a newly introduced e-commerce dialogue corpus. |
@inproceedings{zhang2018dua, title = {Modeling Multi-turn Conversation with Deep Utterance Aggregation}, author = {Zhang, Zhuosheng and Li, Jiangtong and Zhu, Pengfei and Zhao, Hai}, booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)}, pages= {3740–-3752}, year = {2018}, }
Representation learning is the foundation of machine reading comprehension. In state-of-the-art models, deep learning methods broadly use word and character level representations. However, character is not naturally the minimal linguistic unit. In addition, with a simple concatenation of character and word embedding, previous models actually give suboptimal solution. In this paper, we propose to use subword rather than character for word embedding enhancement. We also empirically explore different augmentation strategies on subword-augmented embedding to enhance the cloze-style reading comprehension model reader. In detail, we present a reader that uses subword-level representation to augment word embedding with a short list to handle rare words effectively. A thorough examination is conducted to evaluate the comprehensive performance and generalization ability of the proposed reader. Experimental results show that the proposed approach helps the reader significantly outperform the state-of-the-art baselines on various public datasets. |
@inproceedings{zhang2018mrc, title = {Subword-augmented Embedding for Cloze Reading Comprehension}, author = {Zhang, Zhuosheng and Huang,Yafang and Zhao, Hai}, booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)}, pages = {1802-–1814}, year = {2018}, }
Answering questions from university admission exams (Gaokao in Chinese) is a challenging AI task since it requires effective representation to capture complicated semantic relations between questions and answers. In this work, we propose a hybrid neural model for deep question-answering task from history examinations. Our model employs a cooperative gated neural network to retrieve answers with the assistance of extra labels given by a neural turing machine labeler. Empirical study shows that the labeler works well with only a small training dataset and the gated mechanism is good at fetching the semantic representation of lengthy answers. Experiments on question answering demonstrate the proposed model obtains substantial performance gains over various neural model baselines in terms of multiple evaluation metrics. |
@inproceedings{zhang2018gaokao, title = {One-shot Learning for Question-Answering in Gaokao History Challenge}, author = {Zhang, Zhuosheng and Zhao, Hai}, booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)}, pages = {449–-461}, year = {2018}, }
Traditional chatbots usually need a mass of human dialogue data, especially when using supervised machine learning method. Though they can easily deal with single-turn question answering, for multi-turn the performance is usually unsatisfactory. In this paper, we present Lingke, an information retrieval augmented chatbot which is able to answer questions based on given product introduction document and deal with multi-turn conversations. We will introduce a fine-grained pipeline processing to distill responses based on unstructured documents, and attentive sequential context-response matching for multi-turn conversations. |
@inproceedings{zhu2018lingke, title = {Lingke: A Fine-grained Multi-turn Chatbot for Customer Service}, author = {Zhu, Pengfei and Zhang, Zhuosheng and Li, Jiangtong and Huang, Yafang and Zhao, Hai}, booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), System Demonstrations}, pages = {108-–112}, year = {2018}, }
Machine reading comprehension is a task to model relationship between passage and query. In terms of deep learning framework, most of state-of-the-art models simply concatenate word and character level representations, which has been shown suboptimal for the concerned task. In this paper, we empirically explore different integration strategies of word and character embeddings and propose a character-augmented reader which attends character-level representation to augment word embedding with a short list to improve word representations, especially for rare words. Experimental results show that the proposed approach helps the baseline model significantly outperform state-of-the-art baselines on various public benchmarks. |
@inproceedings{zhang2018char, title = {Effective Character-augmented Word Embedding for Machine Reading Comprehension}, author = {Zhang, Zhuosheng and Huang, Yafang and Zhu, Pengfei and Zhao, Hai}, booktitle = {Proceedings of the Seventh CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC 2018)}, pages = {27-39}, year = {2018}, }
Chinese pinyin input method engine(IME) lets user conveniently input Chinese into a computer by typing pinyin through the common keyboard. In addition to offering high conversion quality, modern pinyin IME is supposed to aid user input with extended association function. However, existing solutions for such functions are roughly based on oversimplified matching algorithms at word-level, whose resulting products provide limited extension associated with user inputs. This work presents the Moon IME, a pinyin IME that integrates the attention-based neural machine translation (NMT) model and Information Retrieval (IR) to offer amusive and customizable association ability. The released IME is implemented on Windows via text services framework. |
@inproceedings{Huang2018Moon, title={{Moon IME:} Neural-based Chinese Pinyin Aided Input Method with Customizable Association}, author={Huang,Yafang and Li,Zuchao and Zhang,Zhuosheng and Zhao,Hai}, booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), System Demonstrations}, pages = {140–145}, year={2018} }
Semantic role labeling (SRL) aims to recognize the predicate-argument structure of a sentence. Syntactic information has been paid a great attention over the role of enhancing SRL. However, the latest advance shows that syntax would not be so important for SRL with the emerging much smaller gap be-tween syntax-aware and syntax-agnostic SRL. To comprehensively explore the role of syntax for SRL task, we extend existing models and propose a unified framework to investigate more effective and more diverse ways of incorporating syntax into sequential neural networks. Exploring the effect of syntactic input quality on SRL performance, we confirm that high-quality syntactic parse could still effectively enhance syntactically-driven SRL. Using empirically optimized integration strategy, we even enlarge the gap between syntax-aware and syntax-agnostic SRL. Ourframework achieves state-of-the-art results on CoNLL-2009 benchmarks both for English and Chinese, substantially outperforming all previous models. |
@inproceedings{li2018unified, title={A unified syntax-aware framework for semantic role labeling}, author={Li, Zuchao and He, Shexia and Cai, Jiaxun and Zhang, Zhuosheng and Zhao, Hai and Liu, Gongshen and Li, Linlin and Si, Luo}, booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018)}, pages={2401--2411}, year={2018} }