VIsual LAyout(VILA) 모델로 논문 PDF 파일에서 구조를 추출하는 방법

1. Introduction

논문은 대체로 서론, 관련 연구, 방법론, 실험, 결론과 같은 구조로 구성되어 있다.
하지만 PDF로 배포된 논문에서 위와 같은 구조를 자동으로 파악하는 것은 쉽지 않다.
이러한 문제를 해결하기 위해 다양한 Document-Image Understanding 모델들이 제안되었다.
그 중 VIsual LAyout (VILA) 모델을 사용하여 논문의 구조를 추출하는 과정을 진행해보았다.

2. VILA: VIsual LAyout

VILA: Improving structured content extraction from scientific PDFs using visual layout groups
문서의 구조를 인식하는 Document Layout Analysis 문제는 주로 token classification (NLP-centric)이나 object detection (vision-centric) 문제로 치환하여 해결한다.
- VILA의 경우 token classification 방식을 사용했다.
VILA는 line 또는 block과 같은 visual group 내의 token들은 같은 라벨을 가진다는 “group uniformity assumption”을 강조한다.
group uniformity assumption를 따르기 위해 두 가지 방법을 제안한다.
- I-VILA: group 사이에 speical token [BLK] 입력
- H-VILA: group 별로 self-attention 후 group representation 추출하여 group 간 self-attention 진행
성능은 I-VILA가 뛰어나지만, 효율성은 H-VILA가 더 좋다.

3. Code

VILA를 Docbank 데이터셋으로 fine-tuning한 모델 사용 [2]
VILA official repository에서 공개한 코드와 모델 사용

3.1. Setup

clone repsoitory

# git clone https://github.com/allenai/vila.git
import sys
sys.path.append('vila/src')

import libraries

import layoutparser as lp
from collections import defaultdict

from utils import download_paper
from vila.pdftools.pdf_extractor import PDFExtractor
from vila.predictors import HierarchicalPDFPredictor, LayoutIndicatorPDFPredictor

3.2. Modules

predict: pdf를 읽은 뒤 token classification 진행
construct_token_groups: 같은 class로 예측된 token끼리 그룹화
join_group_text: 같은 group의 token들을 하나의 text로 묶음
- token의 bbox를 기준으로 띄어쓰기 유무 결정
construct_section_groups: section(서론, 본론, 결론 등)과 section에 해당하는 paragraph 추출

def predict(pdf_path, pdf_extractor, vision_model, layout_model):
    page_tokens, page_images = pdf_extractor.load_tokens_and_image(pdf_path)

    pred_tokens = []
    for page_token, page_image in zip(page_tokens, page_images):
        blocks = vision_model.detect(page_image)
        page_token.annotate(blocks=blocks)
        pdf_data = page_token.to_pagedata().to_dict()
        pred_tokens += layout_model.predict(pdf_data, page_token.page_size)

    return pred_tokens
    

def construct_token_groups(pred_tokens):
    groups, group, group_type, prev_bbox = [], [], None, None
    
    for token in pred_tokens:
        if group_type is None:
            is_continued = True
            
        elif token.type == group_type:
            if group_type == 'section':
                is_continued = abs(prev_bbox[3] - token.coordinates[3]) < 1.
            else:
                is_continued = True

        else:
            is_continued = False

        
        # print(token.text, token.type, is_continued)
        group_type = token.type
        prev_bbox = token.coordinates
        if is_continued:
            group.append(token)
        
        else:
            groups.append(group)
            group = [token]
    
    if group:
        groups.append(group)

    return groups

def join_group_text(group):
    text = ''
    prev_bbox = None
    for token in group:
        if not text:
            text += token.text
    
        else:        
            if abs(prev_bbox[2] - token.coordinates[0]) > 2:
                text += ' ' + token.text
    
            else:
                text += token.text
    
        prev_bbox = token.coordinates
    return text

def construct_section_groups(token_groups):
    section_groups = defaultdict(list)

    section = None
    for group in token_groups:
        group_type = group[0].type
        group_text = join_group_text(group)
        
        if group_type == 'section':
            section = group_text
            section_groups[section]
    
        elif group_type == 'paragraph' and section is not None:
            section_groups[section].append(group_text)

    section_groups = {k: ' '.join(v) for k,v in section_groups.items()}
    return section_groups

3.3. Run

prepare models

pdf_extractor = PDFExtractor("pdfplumber")
vision_model = lp.EfficientDetLayoutModel("lp://PubLayNet") 
layout_model = HierarchicalPDFPredictor.from_pretrained("allenai/hvila-row-layoutlm-finetuned-docbank")

inference

pdf_path = '2307.03170v1.pdf'
pred_tokens = predict(pdf_path, pdf_extractor, vision_model, layout_model)
token_groups = construct_token_groups(pred_tokens)
section_groups = construct_section_groups(token_groups)

3.4. Results

section 목록

sections = list(section_groups.keys())
print(sectiosn)

section text

print(section_groups['6 Limitations and future work'])

Reference

[1] Shen, Z., Lo, K., Wang, L. L., Kuehl, B., Weld, D. S., & Downey, D. (2022). VILA: Improving structured content extraction from scientific PDFs using visual layout groups. Transactions of the Association for Computational Linguistics, 10, 376-392.ISO 690

[2] Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., & Zhou, M. (2020). DocBank: A benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038.

'ai' 카테고리의 다른 글

Cross-encoder를 사용한 Open Named Entity Recognition 모델 개발 (1)	2023.08.20
BERTScore Knowledge Distillation (0)	2023.07.06
Maximal Marginal Relevance를 사용한 뉴스 요약 (0)	2023.07.04
Hydra + Lightning Fabric으로 딥러닝 학습 template 만들기 (0)	2023.05.18
꼬맨틀 풀이 프로그램 개발 (0)	2023.05.17

yongsun's blog

VIsual LAyout(VILA) 모델로 논문 PDF 파일에서 구조를 추출하는 방법

1. Introduction

2. VILA: VIsual LAyout

3. Code

3.1. Setup

3.2. Modules

3.3. Run

3.4. Results

Reference

'ai' 카테고리의 다른 글

티스토리툴바

VIsual LAyout(VILA) 모델로 논문 PDF 파일에서 구조를 추출하는 방법

1. Introduction

2. VILA: VIsual LAyout

3. Code

3.1. Setup

3.2. Modules

3.3. Run

3.4. Results

Reference

'ai' 카테고리의 다른 글

'ai' Related Articles

티스토리툴바