Home | Login
Lectures       Previous announcements
Select year: 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017
Seminars in 2017
    The authors have conducted studies on recognizing Arabic news captions to develop a system for video retrieval to index and edit Arabic broadcast programs daily received and stored in big database. This paper describes a dedicated OCR for recognizing low resolution news captions in video images. News caption recognition system consisting of text line extraction, word segmentation and segmentation-recognition of words is developed and the performance was experimentally evaluated using datasets of frame images extracted from AlJazeera broadcasting programs. Character recognition of moving news caption is difficult due to combing noise yielded by the interlacing of scan lines. A technique to detect and eliminate the combing noise to correctly recognize the moving news caption is proposed. This paper also proposes a technique based on inter-frame text difference to detect transition frame of still news captions. The technique to detect transition frames is necessary for efficient video retrieve and play. The proposed technique is experimentally tested and shown to be robust to quick motion of the background and is able to detect the transition frame correctly with the F-measure higher than 90%. When compared with the ABBY FineReader 11 R ⃝ commercial OCR the dedicated OCR improves the recall of the Arabic characters in AlJazeera broadcasting news from 70.74% to 95.85% for non-interlaced moving news captions and from 23.82% to 96.29% for interlaced moving news captions.
    Attached files: Recognition and transition.pdf
    Forest fire is an serious hazard in many places around the world. For such threats, video-based smoke detection would be particularly important for early warning because smoke arises in any forest fire and can be seen from a long distance. This paper presents a novel and robust approach for smoke detection that employs Deep Belief Networks. The proposed method is divided into three phases. In the preprocessing phase, the region of high motion is extracted by background subtraction method. During the next phase, smoke pixel intensities are extracted from the Red, Green and Blue and Luminance; Chroma:Blue; Chroma:Red color spaces for foreground regions. Subsequently, second feature which is based on texture is computed for detecting smoke regions in which Local Extrema Co-occurrence Pattern, an improved version of local binary patterns are extracted from different foreground regions which compute not only texture of smoke but also intensity and color of smoke using Hue Saturation Value color space. Finally, Deep Belief Network is employed for classification. The proposed method proves its accuracy and robustness when tested on different varieties of scenarios whether wildfire-smoke video, hill base smoke video, indoor or outdoor smoke videos.
    Attached files: seminar_2017-07-08.pdf
    Markov random fields are widely used to model many computer vision problems that can be cast in an energy minimization framework composed of unary and pairwise potentials. While computationally tractable discrete optimizers such as Graph Cuts and belief propagation (BP) exist for multi-label discrete problems, they still face prohibitively high computational challenges when the labels reside in a huge or very densely sampled space. Integrating key ideas from PatchMatch of effective particle propagation and resampling, PatchMatch belief propagation (PMBP) has been demonstrated to have good performance in addressing continuous labeling problems and runs orders of magnitude faster than Particle BP (PBP). However, the quality of the PMBP solution is tightly coupled with the local window size, over which the raw data cost is aggregated to mitigate ambiguity in the data constraint. This dependency heavily influences the overall complexity, increasing linearly with the window size. This paper proposes a novel algorithm called sped-up PMBP (SPM-BP) to tackle this critical computational bottleneck and speeds up PMBP by 50-100 times. The crux of SPM-BP is on unifying efficient filter-based cost aggregation and message passing with PatchMatch-based particle generation in a highly effective way. Though simple in its formulation, SPM-BP achieves superior performance for sub-pixel accurate stereo and optical-flow on benchmark datasets when compared with more complex and taskspecific approaches.
    Matching cost aggregation is one of the oldest and still popular methods for stereo correspondence. While effective and efficient, cost aggregation methods typically aggregate the matching cost by summing/averaging over a user-specified, local support region. This is obviously only locally-optimal, and the computational complexity of the full-kernel implementation usually depends on the region size. In this paper, the cost aggregation problem is reexamined and a non-local solution is proposed. The matching cost values are aggregated adaptively based on pixel similarity on a tree structure derived from the stereo image pair to preserve depth edges. The nodes of this tree are all the image pixels, and the edges are all the edges between the nearest neighboring pixels. The similarity between any two pixels is decided by their shortest distance on the tree. The proposed method is non-local as every node receives supports from all other nodes on the tree. As can be expected, the proposed non-local solution outperforms all local cost aggregation methods on the standard (Middlebury) benchmark. Besides, it has great advantage in extremely low computational complexity: only a total of 2 addition/subtraction operations and 3 multiplication operations are required for each pixel at each disparity level. It is very close to the complexity of unnormalized box filtering using integral image which requires 6 addition/subtraction operations. Unnormalized box filter is the fastest local cost aggregation method but blurs across depth edges. The proposed method was tested on a MacBook Air laptop computer with a 1.8 GHz Intel Core i7 CPU and 4 GB memory. The average runtime on the Middlebury data sets is about 90 milliseconds, and is only about 1.25? slower than unnormalized box filter. A non-local disparity refinement method is also proposed based on the non-local cost aggregation method.
    Attached files: A Non-Local Aggregation Method Stereo Matching.pdf
    We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a non-parametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that main- tains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.
    Attached files: CVPR_Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.pdf
    With the increasing number of machine learning methods used for segmenting images and analyzing videos, there has been a growing need for large datasets with pixel accurate ground truth. In this let- ter, we propose a highly accurate semi-automatic method for segmenting foreground moving objects pictured in surveillance videos. Given a limited number of user interventions, the goal of the method is to provide results sufficiently accurate to be used as ground truth. In this paper, we show that by manually outlining a small number of moving objects, we can get our model to learn the appearance of the background and the foreground moving objects. Since the background and foreground moving objects are highly redundant from one image to another (videos come from surveillance cameras) the model does not need a large number of examples to accurately fit the data. Our end-to-end model is based on a multi-resolution convolutional neural network (CNN) with a cascaded architecture. Tests performed on the largest publicly-available video dataset with pixel accurate groundtruth (changde- tection.net) reveal that on videos from 11 categories, our approach has an average F-measure of 0.95 which is within the error margin of a human being. With our model, the amount of manual work for ground truthing a video gets reduced by a factor of up to 40. Code is made publicly available at: https://github.com/zhimingluo/MovingObjectSegmentation
    Attached files: Interactive deep learning method for segmenting moving object.pdf
    In this work, a deep learning approach has been developed to carry out road detection using only LIDAR data. Starting from an unstructured point cloud, top-view images encoding several basic statistics such as mean elevation and density are generated. By considering a top-view representation, road detection is reduced to a single-scale problem that can be addressed with a simple and fast fully convolutional neural network (FCN). The FCN is specifically designed for the task of pixel-wise semantic segmentation by combining a large receptive field with high-resolution feature maps. The proposed system achieved excellent performance and it is among the top-performing algorithms on the KITTI road benchmark. Its fast inference makes it particularly suitable for real-time applications
    Attached files: Fast LIDAR-based Road Detection Using Fully Convolutional Neural Networks.pdf
    A robust vanishing point estimation method is proposed that uses a probabilistic voting procedure based on intersection points of line segments extracted from an input image. The proposed voting function is defined with line segment strength that represents relevance of the extracted line segments. Next, candidate line segments for lanes are selected by considering geometric constraints. Finally, the host lane is detected by using the proposed score function, which is designed to remove outliers in the candidate line segments. Also, the detected host lane is refined by using inter-frame similarity that considers location consistency of the detected host lane and the estimated vanishing point in consecutive frames. Furthermore, in order to reduce computational costs in the vanishing point estimation process, a method using a lookup table is proposed.
    Attached files: A Robust Lane Detection Method Based on.pdf
    When considering person re-identification (re-ID) as a retrieval process, re-ranking is a critical step to improve its accuracy. Yet in the re-ID community, limited effort has been devoted to re-ranking, especially those fully au- tomatic, unsupervised solutions. In this paper, we propose a k-reciprocal encoding method to re-rank the re-ID re- sults. Our hypothesis is that if a gallery image is simi- lar to the probe in the k-reciprocal nearest neighbors, it is more likely to be a true match. Specifically, given an image, a k-reciprocal feature is calculated by encoding its k-reciprocal nearest neighbors into a single vector, which is used for re-ranking under the Jaccard distance. The fi- nal distance is computed as the combination of the original distance and the Jaccard distance. Our re-ranking method does not require any human interaction or any labeled data, so it is applicable to large-scale datasets. Experiments on the large-scale Market-1501, CUHK03, MARS, and PRW datasets confirm the effectiveness of our method.
    Attached files: re-ranking with reciprocal encoding.pdf
    There is a huge proliferation of surveillance systems that require strategies for detecting different kinds of stationary foreground objects (e.g., unattended packages or illegally parked vehicles). As these strategies must be able to detect foreground objects remaining static in crowd scenarios, regardless of how long they have not been moving, several algorithms for detecting different kinds of such foreground objects have been developed over the last decades. This paper presents an efficient and highquality strategy to detect stationary foreground objects, which is able to detect not only completely static objects but also partially static ones. Three parallel nonparametric detectors with different absorption rates are used to detect currently moving foreground objects, short-term stationary foreground objects, and long-term stationary foreground objects. The results of the detectors are fed into a novel finite state machine that classifies the pixels among background, moving foreground objects, stationary foreground objects, occluded stationary foreground objects, and uncovered background. Results show that the proposed detection strategy is not only able to achieve high quality in several challenging situations but it also improves upon previous strategies.
    Attached files: 20170513-Saturday Seminar-Wahyono.pdf
    ?Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build ?fully convolutional? networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional networks achieve improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.
    Attached files: Fully Convolutional Networks for Semantic Segmentation.pdf
    Compared with other video semantic clues, such as gestures, motions etc., video text generally provides highly useful and fairly precise semantic information, the analysis of which can to a great extent facilitate video and scene understanding. It can be observed that the video texts show stronger edges. The Nonsubsampled Contourlet Transform (NSCT) is a fully shift-invariant, multi-scale, and multi-direction expansion, which can preserve the edge/ silhouette of the text characters well. Therefore, in this paper, a new approach has been proposed to detect video text based on NSCT. First of all, the 8 directional coefficients of NSCT are combined to build the directional edge map (DEM), which can keep the horizontal, vertical and diagonal edge features and suppress other directional edge features. Then various directional pixels of DEM are integrated into a whole binary image (BE). Based on the BE, text frame classification is carried out to determine whether the video frames contain the text lines. Finally, text detection based on the BE is performed on consecutive frames to discriminate the video text from non-text regions. Experimental evaluations based on our collected TV videos data set demonstrate that our method significantly outperforms the other 3 video text detection algorithms in both detection speed and accuracy, especially when there are challenges such as video text with various sizes, languages, colors, fonts, short or long text lines.
    Attached files: art%3A10.1007%2Fs11042-017-4619-8.pdf
    Estimating the disparity and normal direction of one pixel simultaneously instead of only disparity, also known as 3D label methods, can achieve much higher sub-pixel accuracy in the stereo matching problem. However, it is extremely difficult to assign an appropriate 3D label to each pixel from the continuous label space R^3 while maintaining global consistency because of the infinite parameter space. In this paper, we propose a novel algorithm called PatchMatch-based Superpixel Cut (PMSC) to assign 3D labels of an image more accurately. In order to achieve robust and precise stereo matching between local windows, we develop a bilayer matching cost, where a bottom-up scheme is exploited to design the two layers. The bottom layer is employed to measure the similarity between small square patches locally by exploiting a pre-trained convolutional neural network, and then the top layer is developed to assemble the local matching costs in large irregular windows induced by the tangent planes of object surfaces. To optimize the spatial smoothness of local assignments, we propose a novel strategy to update 3D labels. In the procedure of optimization, both segmentation information and random refinement of PatchMatch are exploited to update candidate 3D label set for each pixel with high probability of achieving lower loss. Since pairwise energy of general candidate label sets violates the submodular property of graph cut, we propose a novel multi-layer superpixel structure to group candidate label sets into candidate assignments, which thereby can be efficiently fused by ??-expansion graph cut. Extensive experiments demonstrate that our method can achieve higher sub-pixel accuracy in different datasets, and currently ranks 1st on the new challenging Middlebury 3.0 benchmark among all the existing methods.
    Abstract?Humans have various complex postures and movements. Considerable attention is given to the problem of recognizing a human fall. However, the recognition rates must be further improved, for practical applications, from that obtained in the previous research. In this paper, a new recognition method, based on the analysis of a human fall, is provided. Furthermore, five eigenvectors that describe a fall are defined i.e. the aspect ratio, effective area ratio, human point margin, body axis angle, and centrifugal rate of the body contour. Then, a support vector machine based on the Gauss radial basis function is trained to obtain a better identification result. The simulation results show that the model, though the combination of the five eigenvectors, has a recognition rate of 94.5%, which is a significant improvement as compared to the previous research.
    Attached files: seminar paper.pdf
    We address two difficulties in establishing an accurate system for image matching. First, image matching relies on the descriptor for feature extraction, but the optimal descriptor often varies from image to image, or even patch to patch. Second, conventional matching approaches carry out geometric checking on a small set of correspondence candidates due to the concern of efficiency. It may result in restricted performance in recall. We aim at tackling the two issues by integrating adaptive descriptor selection and progressive candidate enrichment into image matching. We consider that the two integrated components are complementary: The high-quality matching yielded by adaptively selected descriptors helps in exploring more plausible candidates, while the enriched candidate set serves as a better reference for descriptor selection. It motivates us to formulate image matching as a joint optimization problem, in which adaptive descriptor selection and progressive correspondence enrichment are alternately conducted. Our approach is comprehensively evaluated and compared with the state-of-the-art approaches on two benchmarks. The promising results manifest its effectiveness.
    Attached files: ??.pdf
    This paper addresses the problem of Face Alignment for a single image. We show how an ensemble of regression trees can be used to estimate the face?s landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions. We present a general framework based on gradient boosting for learning an ensemble of regression trees that optimizes the sum of square error loss and naturally handles missing or partially labelled data. We show how using appropriate priors exploiting the structure of image data helps with efficient feature selection. Different regularization strategies and its importance to combat overfitting are also investigated. In addition, we analyse the effect of the quantity of training data on the accuracy of the predictions and explore the effect of data augmentation using synthesized data.
    Attached files: Kazemi_One_Millisecond_Face_2014_CVPR_paper.pdf
    Convolutional network techniques have recently achieved great success in vision based detection tasks. This paper introduces the recent development of our research on transplanting the fully convolutional network technique to the detection tasks on 3D range scan data. Specifically, the scenario is set as the vehicle detection task from the range data of Velodyne 64E lidar. We proposes to present the data in a 2D point map and use a single 2D end-to-end fully convolutional network to predict the objectness confidence and the bounding boxes simultaneously. By carefully design the bounding box encoding, it is able to predict full 3D bounding boxes even using a 2D convolutional network. Experiments on the KITTI dataset shows the state-ofthe- art performance of the proposed method.
    Background subtraction is usually based on lowlevel or hand-crafted features such as raw color components, gradients, or local binary patterns. As an improvement, we present a background subtraction algorithm based on spatial features learned with convolutional neural networks (ConvNets). Our algorithm uses a background model reduced to a single background image and a scene-specific training dataset to feed ConvNets that prove able to learn how to subtract the background from an input image patch. Experiments led on 2014 ChangeDetection.net dataset show that our ConvNet based algorithm at least reproduces the performance of state-of-the-art methods, and that it even outperforms them significantly when scene-specific knowledge is considered.
    Attached files: BS with scene specific.pdf
    We propose a novel object localization methodology with the purpose of boosting the localization accuracy of stateof- the-art object detection systems. Our model, given a search region, aims at returning the bounding box of an object of interest inside this region. To accomplish its goal, it relies on assigning conditional probabilities to each row and column of this region, where these probabilities provide useful information regarding the location of the boundaries of the object inside the search region and allow the accurate inference of the object bounding box under a simple probabilistic framework. For implementing our localization model, we make use of a convolutional neural network architecture that is properly adapted for this task, called LocNet. We show experimentally that LocNet achieves a very significant improvement on the mAP for high IoU thresholds on PASCAL VOC2007 test set and that it can be very easily coupled with recent stateof- the-art object detection systems, helping them to boost their performance. Finally, we demonstrate that our detection approach can achieve high detection accuracy even when it is given as input a set of sliding windows, thus proving that it is independent of box proposal methods.
    Attached files: Gidaris_LocNet_Improving_Localization_CVPR_2016_paper.pdf
    Matching pedestrians across multiple camera views known as human re-identi cation (re-identi cation) is a challenging problem in visual surveillance. In the existing works concentrating on feature extraction, representations are formed locally and independent of other regions. We present a novel siamese Long Short-Term Memory (LSTM) architecture that can process image regions sequentially and enhance the discriminative capability of local feature representation by leveraging contextual information. The feedback connections and internal gating mechanism of the LSTM cells enable our model to memorize the spatial dependencies and selectively propagate relevant contextual information through the network. We demonstrate improved performance compared to the baseline algorithm with no LSTM units and promising results compared to state-of-the-art methods on Market-1501, CUHK03 and VIPeR datasets. Visualization of the internal mechanism of LSTM cells shows meaningful patterns can be learned by our method.
    Attached files: 1607.08381.pdf
    Research on video analysis for fire detection has become a hot topic in computer vision. However, the conventional algorithms use exclusively rule-based models and features vector to classify whether a frame is fire or not. These features are difficult to define and depend largely on the kind of fire observed. The outcome leads to low detection rate and high false-alarm rate. A different approach for this problem is to use a learning algorithm to extract the useful features instead of using an expert to build them. In this paper, we propose a convolutional neural network (CNN) for identifying fire in videos. Convolutional neural network are shown to perform very well in the area of object classification. This network has the ability to perform feature extraction and classification within the same architecture. Tested on real video sequences, the proposed approach achieves better classification performance as some of relevant conventional video fire detection methods and indicates that using CNN to detect fire in videos is very promising
    Attached files: paper.pdf
    We present a block-wise approach to detect stationary objects based on spatio-temporal change detection. First, block candidates are extracted by filtering out consecutive blocks containing moving objects. Then, an online clustering approach groups similar blocks at each spatial location over time via statistical variation of pixel ratios. The stability changes are identified by analyzing the relationships between the most repeated clusters at regular sampling instants. Finally, stationary objects are detected as those stability changes that exceed an alarm time and have not been visualized before. Unlike previous approaches making use of Background Subtraction, the proposed approach does not require foreground segmentation and provides robustness to illumination changes, crowds and intermittent object motion. The experiments over an heterogeneous dataset demonstrate the ability of the proposed approach for short- and long-term operation while overcoming challenging issues.
    Text displayed in a video is an essential part for the high-level semantic information of the video content. Therefore, video text can be used as a valuable source for automated video indexing in digital video libraries. In this paper, we propose a workflow for video text detection and recognition. In the text detection stage, we have developed a fast localization-verif ication scheme, in which an edgebased multi-scale text detector first identifies potential text candidates with high recall rate. Then, detected candidate text lines are refined by using an image entropybased filter. Finally, Stroke Width Transform (SWT)- and Support Vector Machine (SVM)-based verification procedures are applied to eliminate the false alarms. For text recognition, we have developed a novel skeleton-based binarization method in order to separate text from complex backgrounds to make it processible for standard OCR (Optical Character Recognition) software. Operability and accuracy of proposed text detection and binarization methods have been evaluated by using publicly available test data sets.
    Attached files: art%3A10.1007%2Fs11042-012-1250-6.pdf
    This article tackles the problem of estimating non-rigid human 3D shape and motion from image sequences taken by uncalibrated cameras. Similar to other state-of-the-art solutions we factorize 2D observations in camera parameters, base poses and mixing coefficients. Existing methods require sufficient camera motion during the sequence to achieve a correct 3D reconstruction. To obtain convincing 3D reconstructions from arbitrary camera motion, our method is based on a-priorly trained base poses. We show that strong periodic assumptions on the coefficients can be used to define an efficient and accurate algorithm for estimating periodic motion such as walking patterns. For the extension to non-periodic motion we propose a novel regularization term based on temporal bone length constancy. In contrast to other works, the proposed method does not use a predefined skeleton or anthropometric constraints and can handle arbitrary camera motion. We achieve convincing 3D reconstructions, even under the influence of noise and occlusions. Multiple experiments based on a 3D error metric demonstrate the stability of the proposed method. Compared to other state-of-the-art methods our algorithm shows a significant improvement.
    Attached files: pami_final.pdf
News | About us | Research | Lectures