Large Language Models (LLMs) have demonstrated impressive performance in tasks like Natural Language Processing, generation, and text synthesis. However, they still encounter major difficulties in more complicated circumstances. These are assignments that call for using tools to solve problems, dealing with structured data, or carrying out complex multi-step reasoning. For instance, although LLMs are adept at comprehending unstructured text, they have trouble utilizing and interpreting organized data, such as spreadsheets, tables, and databases. In addition, subpar performance is frequently achieved on tasks like multi-hop question answering (MHQA), which calls for combining data from several sources. Similarly, LLMs still find it challenging to complete duties that require the use of tools, including using SQL to answer tabular inquiries.
To overcome these issues, a new technique called Source2Synth has been introduced by researchers from Meta, Oxford University, and University College London. The primary benefit of Source2Synth is its capacity to impart new skills to LLMs without the need for expensive and time-consuming human annotations. Conventional approaches to enhancing LLM performance frequently call for a great deal of manual annotation, which is costly and difficult to scale, particularly for complicated jobs. This requirement has been removed by Source2Synth, which creates synthetic data that imitates actual situations and thought processes.
In order to create synthetic instances with intermediate reasoning steps, Source2Synth uses a specific data source, such as tables from the internet or relevant articles. Since these examples are based on actual data, the synthetic data is guaranteed to be diversified, realistic, and factually correct. The method’s main step is creating a seed topic, which might be an entity or a factual statement, and then developing it into a comprehensive example. The example contains the instructions for the task, the steps needed to solve the problem using reasoning, and the solution. Through this procedure, Source2Synth is able to generate intricate, realistic data points that mimic the way LLMs ought to handle structured data or carry out multi-step activities.
The method that Source2Synth uses to enhance dataset quality is an essential component. Low-quality examples can deteriorate model performance, and not all generated data points are equally valuable. In order to address this, Source2Synth uses filtering strategies determined by how answerable the synthetic instances are. For example, the example is discarded if the generated data does not result in the right response within a certain number of trials. This quality control procedure ensures that only excellent examples, those that help in the LLM’s acquisition of the necessary skills, are kept for the last round of fine-tuning.
The technique has been implemented in two unique and demanding fields, which are as follows,
Multi-Hop Question Answering (MHQA): To respond to a single question, the LLM in this domain analyzes and synthesizes data from several sources. When Source2Synth was evaluated on HotPotQA, a dataset created for multi-hop reasoning, it outperformed baseline models that were adjusted by conventional techniques by 22.57%.
Answering questions with structured data is known as tabular question answering (TQA), and it frequently calls for SQL queries to communicate with tables. WikiSQL is a dataset that focuses on using SQL to answer questions about tables. Source2Synth was tested on it and achieved a 25.51% improvement over baseline models.
The results have demonstrated how Source2Synth can increase LLM performance on challenging tasks without requiring large amounts of human annotations on datasets. For training LLMs in domains requiring sophisticated reasoning and tool usage, Source2Synth offers a scalable method by producing grounded, realistic examples and rigorously filtering the dataset to ensure high quality.
In conclusion, Source2Synth is a unique method for imparting new knowledge to LLMs, particularly in situations where human annotation is not feasible. This strategy solves the current constraints of LLMs in complicated tasks like multi-step reasoning and structured data manipulation by guaranteeing that only high-quality examples are utilized for fine-tuning and by rooting synthetic data generation in real-world sources for validation.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 50k+ ML SubReddit
The post Source2Synth: A New AI Technique for Synthetic Data Generation and Curation Grounded in Real Data Sources appeared first on MarkTechPost.