A group of Russian experts has developed a method to significantly reduce the costs of marking data that are necessary for teaching artificial intelligence systems (AI) using large language models. This approach is based on the principle of active learning, in which the model itself chooses the most useful examples to increase its accuracy.
One of the main problems in creating AI in narrow areas, such as medicine or jurisprudence, is necessary in a large amount of carefully marked data. Their training requires either the participation of qualified specialists, which is expensive or significant computing resources when using large language models.
The new method allows you to start training on a limited set of already marked data, after which the model independently chooses what additional examples will help it improve accuracy. This makes it possible to reduce the amount of the required marking two to four times without loss as a result.
Researchers tested the technology on four popular tasks – generating answers, solving logical problems, understanding the text and creating a brief resume. The results showed that the model that uses a new approach shows comparable quality with random selection methods, but requires about three times less than the marked data.
The tools implementing this method are posted in open access. Specialists from T-technologies, the Airi Institute, HSE, Innopolis and Sberbank participated in the development.