The dissertation will be defended at the public meeting of the Dissertation Defence Council of the Scientific Field of Informatics Engineering in the SRA-I Meeting Hall of Vilnius Gediminas Technical University at 14 a.m. on 12 June 2024.
The dissertation explores the intricacies of identifying, extracting, and documenting content blocks in internet web pages. The research object is the methodologies for these processes to improve the computer perception of online web page data. The primary goal is to conduct an in-depth analysis of datasets containing web page content blocks to enhance their granularity and minimise the volume of blocks requiring manual labelling. The dissertation undertakes several essential tasks: (1) conducting a systematic analysis of the latest research in the field of data extraction from internet web pages; (2) developing a structured dataset for web pages that accommodates a variety of features for different content blocks and is compatible with various data extraction methods; (3) creating a solution for partly automated content block labelling in web pages, which establishes relationships between content blocks and groups them, thereby reducing the need for manual review; (4) evaluating the effectiveness of this developed dataset and labelling solution in identifying, grouping, and establishing relationships between web page content blocks. The dissertation comprises four parts: an introduction, four main chapters, conclusions, references, and appendices. The introduction presents the research problem, significance, objectives, methodology, novelty, practical implications, defended statements, lists of the author’s conference presentations and outlines the dissertation’s structure. The first chapter focuses on Web Mining and examines the challenges and evolution of data extraction and classification techniques. The second chapter explores methods to determine HTML block similarity, considering data and structure. The third chapter details creating a dataset for improved data extraction, highlighting the need for diverse information about block types, features, and structures. The fourth chapter presents advanced methods for identifying HTML content blocks and enhancing content extraction accuracy and efficiency. Several articles were published on the topic discussed in the dissertation: two in publications of the main list of Clarivate Analytics Web of Science and two in the publications of scientific conference proceedings. Research results were presented at three international conferences: 6th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), 2018, Vilnius, Lithuania; Open Conference of Electrical, Electronic and Information Sciences (eStream), 2018, Vilnius, Lithuania; International Conference on Science & Technology (STRA), 2023, Prague, Czech Republic.
Doctoral dissertation readers can search via VILNIUS TECH Virtual Library.