0%

利用XGBoost进行NLP分类任务

Task

Semantic Relation Extraction and Classification in Scientific Papers

Subtask: 1 - Relation classification

1.1 Relation classification on clean data

1.2 Relation classification on noisy data

Classes - Semantic Relations

1
relations = ['usage', 'result', 'model-feature', 'part-whole', 'topic', 'comparison']

Features

Lexical Features

Features Remarks Value
L1 Distance which shows the distances between entities Int
L2 hasIn(Model-Feature, Part-Whole) int(0, 1)
L3 hasOf(Topic, Result) Int(0, 1)
L4 hasFor(Usage) Int(0, 1)
L5 hasWith(Compare) int(0, 1)
L6 hasThan(Compare) Int(0, 1)
L7 hasAnd Int(0, 1)
L8 hasFrom Int(0, 1)

Entity Features

Features Remarks Value
L1 For comparison, it’s necessary to measure Similarity(sim200) Float
L2 Similarity Bucket int(0, 1, 2, 3, 4)
L3 Position of Entity (Text) LabelEnocder (Text Index)
L4 Start Entity Index
L5 End Entity Index

数据预处理

  • input format

    1
    2
    import numpy as np
    np.array[[...feature_values...label],...]
  • output format .csv

Model Training

It seems to be better to use XGBoost as well as Scikit-Learning. In other words, use XGB.fit() rather then XGB.train().

Accuracy