The remarkable capabilities demonstrated by large language models (LLMs) today are fueled by vast amounts of data that imbue them with extensive human knowledge. If these models are likened to high-speed trains propelled by technological innovation, then data corpora serve as the vital fuel powering them. Enhancing the quality of these corpora is essential for achieving significant improvements in model performance.
However, a critical issue is the rapid depletion of high-quality corpora—Chinese LLM companies are facing a severe shortage.
For example, Gao Wen, an academician at the Chinese Academy of Engineering, highlighted that, in the current global dataset of around 50 billion data points used to train LLMs, Chinese corpora account for only 1.3%. This is significantly lower in both quantity and quality compared to English and other languages. A substantial amount of high-value corpus data remains untapped in reports, papers, newspapers, and other documents, which are difficult to process and extract due to their complex formats and the limitations of current large model training capabilities.
Addressing the issues of insufficient and low-quality Chinese data, as well as managing diverse data types, remains a significant challenge for companies. To assist enterprises in overcoming these limitations, Intsig Information, a Chinese tech company specializing in data, mobile, and artificial intelligence applications, has launched a large model accelerator designed to enhance pre-training, corpus development, practical implementation, and other processes for large model applications.
Named the “TextIn” smart document processing platform, this accelerator can swiftly parse unstructured data from lengthy documents and reconstruct the correct reading order. It was unveiled during the World Artificial Intelligence Conference (WAIC) 2024.
In the initial stages of training, TextIn’s document parsing engine overcomes layout parsing barriers in books, papers, research reports, and other documents, providing outputs optimized for model training and application. Additionally, it includes a text vectorization model to address the hallucination problem often associated with LLMs.
Intsig’s approach begins at the source, utilizing a standardized platform for corpus structuring. This improves pre-training data efficiency and helps large model companies achieve effective performance enhancement and iteration.
Handling complex corpora
Intsig’s TextIn platform is designed with three main features: document parsing, embedding, and a tool called OpenKIE. Currently, processing complex elements like borderless tables, cross-page tables, and formulas remains a major challenge for large model corpora.
For instance, consider the fund statements from bank custodians. These statements come in various styles, and their complex table formats make data extraction and organization labor-intensive and time-consuming, which is crucial for model training. Even a minor error in data interpretation from a single cell can lead to significant inaccuracies in the overall results. The accuracy of table restoration directly impacts the effectiveness of the model.
TextIn’s document parsing feature can process a hundred-page document in as little as 1.5 seconds, not only quickly but also intelligently restoring the reading order of documents. To handle various data types, Intsig emphasizes the algorithm design of its document parsing feature. It can restore more than ten common chart types, such as bar charts, line charts, pie charts, and radar charts, into JSON or Markdown formats.
The parsed data corpora are clear and easy to understand, enabling large models to better comprehend chart data and learn the argumentation logic in professional documents such as business reports and academic papers. Even when charts do not display specific values, the feature can estimate them based on the coordinate axes.
Large models often struggle with professional inquiries, frequently presenting hallucinations that can have serious consequences if not handled carefully. Tests show that using TextIn’s embedding model improves the quality, efficiency, and accuracy of information search and question answering in large models.
TextIn’s embedding model, the ACGE text embedding model, functions like a compass, quickly searching the full text to find information, extracting effective text features, and accurately completing classification and clustering tasks through extensive learning of Chinese corpora.
Compared to other open-source models, the ACGE model is smaller in size, occupies fewer resources, and its 1,024 input text length can meet the needs of most scenarios. Despite the increasing number of tokens supported by large models, they still face issues like catastrophic forgetting. To address this, the ACGE model uses a continuous learning training method and supports variable output dimensions, allowing enterprises to allocate resources according to specific scenarios, thereby enhancing model system performance and user experience.
For practical applications, before introducing a vector database, a company using a distributed system and open-source solution may quickly hit a bottleneck as the corpus grows. Traditional single-line program methods limit the speed of processing billions of data points daily. However, the introduction of the ACGE model significantly improves overall document processing speed and, with sufficient data, can eliminate some hallucinations, improve multi-document element recognition, and resolve layout analysis issues.
OpenKIE is a tool for extracting information from image documents, including field extraction, list extraction, and element extraction modes. Customers only need to create document types, set the fields to be extracted, and upload files; OpenKIE will automatically extract the required information for direct application or import into other systems.
For example, in LLM document processing scenarios, Intsig collaborated with Baichuan Intelligence to address longstanding problems of multi-document element recognition and layout analysis, enhancing the processing speed of hundred-page documents by more than ten times.
Tang Qi, general manager of Intsig’s smart intelligence unit, told 36Kr that the current TextIn platform covers up to 47 scenarios across finance, medicine, and media, compatible with more than 3,200 types of documents. It has been used in the pre-training processes of leading large model companies such as Baichuan and has also accumulated a small batch of developer users.
Generalized engineering capabilities for a broad range of scenarios
Currently, each improvement in large model capabilities is crucially influenced by multiple dimensions, including the quantity, quality, and field types of pre-training data. In terms of data processing, most Chinese companies choose between two main solutions: one is to entrust third-party companies that provide infrastructure services, such as Intsig’s TextIn platform or Amazon’s Textract text extraction service. The other is to combine traditional optical character recognition (OCR) algorithms with internally developed models, more applicable to vertical use cases by banks and securities firms, among others.
Tang told 36Kr that “companies’ selection criteria for suppliers mainly focus on three dimensions: speed, stability, and accuracy.”
- Speed: The document parsing engine must be fast. According to Intsig, the TextIn platform maintains a parsing time within 1.5 seconds, while some similar tools on the market are 3–5 times slower.
- Stability: The platform must handle a large volume of complex corpora, such as PDF files and forms, with high accuracy.
- Accuracy: The platform should precisely restore document information into tables.
Currently, the shortage of high-quality, curated corpora is a major issue, “especially for Chinese data, which is even scarcer,” Tang said. Both Chinese and international large model datasets are mainly in English, trained from many open-source datasets like Common Crawl, RedPajama, BooksCorpus, The Pile, and ROOT, among others. Despite the abundance of data, the quality varies. A large amount of high-quality Chinese corpus data remains dormant in reports, papers, newspapers, and other documents, yet to be utilized.
The processing of corpus data in the pre-training stage is crucial, moving from acquiring massive data to high-value data. Providing large model companies and developers with the foundational means of achieving this capability is therefore essential for seed user adoption.
Tang experienced such a situation. A merchant engaged in secondhand luxury goods trade had accumulated many receipts. To calculate profits, he needed to manually subtract the original price from the selling price and record the final result in the backend. This process involved complex formula calculations, including price differences and inventory issues for various styles, which traditional OCR models could not handle. After reaching out to Tang, the problem was quickly solved by adjusting small parameters on the accelerator platform.
This is just a subtle problem in a specific scenario. In the era of large models, the fundamental nature of platform tools diverges from the traditional single-layer privatization deployment logic. Instead, it emphasizes the importance of generalized engineering capabilities that can be applied across a wide array of scenarios.
Accordingly, Intsig did several things in the product design stage. First, it anticipated various scenarios by enriching models with substantial high-quality expertise in specific vertical domains during the pre-customization phase. This approach targets common challenges in specific industries like finance, law, and education, offering solutions in product design tailored to user needs. Consequently, this strategy enhances the performance of the large model accelerator in key application scenarios.
Secondly, it emphasizes productization by not only offering APIs for general scenarios but also providing a range of tool-based products. This approach reduces the barriers to application and ensures easy, out-of-the-box usability. It is particularly beneficial for traditional companies with limited technical resources, small and medium enterprises, and individual developers.
In the current trend of large model transformation, the importance of data-centricity is widely acknowledged by industry professionals involved in large model research and application. However, significant work remains in the upstream stages of large models, particularly in areas like text parsing, logical layout, and document question answering.
Looking ahead, Intsig plans to introduce specialized products for vertical domains such as finance and healthcare. Additionally, it will promote internal testing programs for developers, aiming to attract more users to participate in the co-creation and optimization of these products.
KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Huang Nan for 36Kr.